Prepare data

The third data operation of this tutorial will consist in preparing data within BigQuery tables.

🗺️ Overview

The objective of this step will be to create new BigQuery tables into which we will load and reorganize the contents of the tables created at the previous step. As in most cases, this will happen within one BigQuery dataset. For this, we will need:

  • a JSON file to configure the data operation,

  • a JSON file to trigger the workflow,

  • a JSON file for each table creation,

  • and a SQL file for each transfer of data into our new tables.

📄 Create your files

Create the JSON file that configures the data operation

  1. Access your tailer-demo folder.

  2. Inside, create a folder named 3-Prepare-data for this new step.

  3. In this folder, create a JSON file named 000099-tailer-demo-prepare-data.json for your data operation.

  4. The data operation will load a temporary table, and then, if the query runs correctly, then we copy the temporary table into the target table. Copy the following contents into your file:

    {
      "$schema": "http://jsonschema.tailer.ai/schema/table-to-table-veditor",
      "configuration_type": "table-to-table",
      "configuration_id": "000099-load_my_gbq_dataset_my_table",
      "short_description": "Prepare data for the Tailer demo",
      "account": "000099",
      "environment": "DEV",
      "activated": true,
      "archived": false,
      "start_date": "2023, 1, 23",
      "catchup": false,
      "schedule_interval": "None",
      "default_gcp_project_id": "my-gcp-project",
      "default_bq_dataset": "my_gbq_dataset",
      "default_write_disposition": "WRITE_TRUNCATE",
      "task_dependencies": [
        "load_temp_sales >> swap_sales_tables"
      ],
      "workflow": [
        {
          "task_type": "sql",
          "id": "load_temp_sales",
          "short_description": "Load temp sales table",
          "bq_dataset": "temp",
          "table_name": "sales",
          "sql_file": "load_sales.sql"
        },
        {
          "id": "swap_sales_tables",
          "task_type": "copy_gbq_table",
          "source_gcp_project_id": "my-gcp-project",
          "source_bq_dataset": "temp",
          "source_bq_table": "sales",
          "bq_dataset": "my-gcp-project",
          "destination_bq_table": "sales"
        }
      ]
    }
  5. Edit the following values: ◾ Replace my-gcp-project with the ID of the GCP project containing your BigQuery dataset in the default_gcp_project_id and in the source_gcp_project_id parameters. ◾ Replace my_gbq_dataset with the name of your working dataset. ◾ Also replace the project and dataset in the configuration_id

Create a SQL file

Create a SQL file in the same directory and name it load_sales.sql. It must contain a query that will load the sales table.

Create the JSON file that triggers the workflow

Inside the 3-Prepare-data folder, create a file named workflow.json.

Copy the following contents into your file:

{
    "$schema": "http://jsonschema.tailer.ai/schema/workflow-veditor",
    "configuration_type": "workflow",
    "version":"2",
    "configuration_id": "000099-trigger_load_my_gbq_dataset_my_table_DEV",
    "short_description": "Launch the Tailer demo data load",
    "account": "000099",
    "environment": "PROD",
    "activated": true,
    "archived": false,
    "gcp_project_id": "my-project",
    "schedule_interval_reset": "None",
    "authorized_job_ids": [
      "storage-to-tables|000099|000099-tailer-demo-load-files-YOUR-NAME|DEV|sales_{{FD_BLOB_8}}-{{FD_DATE}}.csv"
    ],
    "target_dag": {
        "configuration_type":"table-to-table",
        "configuration_id":"000099-load_my_gbq_dataset_my_table_DEV"
    }
}

This worklfow will trigger our Table-to-Table that loads the sales table each time a sales file is ingested with our Storage-to-Table operation.

The authorized_job_ids defines the job that triggers the target job. We need to insert here the job_id of the Storage-to-Table operation. You can find it in on Tailer Studio. Navigate to the Storage-to-Table runs section, find the last run for a sales file and go to the Run Details tab. Search for a job_id and copy it into the authorized_job_ids section.

Replace the configuration_id in the target_dag section by your configuration's configuration_id, concatenated with "_DEV" (for its environment should be DEV).

▶️ Deploy the data operation

Once your files are ready, you can deploy the data operation:

  1. Access your working folder by running the following command:

    cd "[path to your tailer folder]\tailer-demo\3-Prepare-data"
  2. To deploy the data operation, run the following command:\

    tailer deploy configuration 000099-tailer-demo-prepare-data.json
  3. To trigger the workflow, run the following command:

    tailer deploy configuration workflow.json

You may be asked to select a context (see this page for more information). If you haven't deployed any context, then choose "no context". You can also use the flag --context to specify the context of your choice, or NO_CONTEXT if that's what you want:

tailer deploy configuration 000099-tailer-demo-prepare-data.json --context NO_CONTEXT
tailer deploy configuration workflow.json --context NO_CONTEXT

Your data operation is now deployed, which means the new tables will shortly be created and loaded with data, and your data operation status is now visible in Tailer Studio.

Check the data operation status in Tailer Studio

  1. Access Tailer Studio again.‌

  2. In the left navigation menu, select Tables-to-tables.

  3. In the Configurations tab, search for your data operation. You can see its status is Activated. You can see in the top right corner that it as been deployed by you a few seconds ago.

  4. Check for your workflow configuration. It should also be activated and deployed by you a few seconds ago.

  5. Try to copy a sales file in your Storage-to-Tables source bucket. It should automatically trigger a Storage-to-Tables run. If it's successful, then it should trigger the workflow and create a Table-to-Table run.

  6. When it's completed and successful, you can check on BigQuery and see that your target table is loaded.

🚀 Further steps

We've seen here a very basic example of a Table-to-Table data operation that loads a temporary table, and then copy it to the target destination when no error occurs.

We could go further and add different steps, and use different Tailer features, for instance:

  • create the temporary table using a task of type create_gbq_table. This way, you can specify a DDL for this table, add column descriptions, column types, and define partitioning and clustering fields.

  • add a task that performs tests using expectations or custom asserts

  • add several SQL tasks

Last updated