Tables to Tables script file

This is the description of the JSON configuration file of a Tables to Tables data operation.

The configuration file is in JSON format. It contains the following sections:

Global parameters: General information about the data operation.
Workflow task parameters: Information about the different tasks of the workflow.

📹 Tables to Tables script video

Here is an example of a TTT configuration file:

{
	"configuration_type": "table-to-table",
	"configuration_id": "000099_iowa_liquor_agg_store",
	"short_description": "Create_aggregation",
	"doc_md": "000099_iowa_liquor_agg_store.md",
	"account": "000099",
	"environment": "DEV",
	"activated": true,
	"archived": false,
	"direct_execution": true,
	"start_date": "2019, 1, 1",
	"catchup": false,
	"schedule_interval": "0 8 * * *",
	"max_active_runs": 1,
	"task_concurrency": 3,
	"default_gcp_project_id": "fd-io-jarvis-demo-dlk",
	"default_bq_dataset": "dlk_demo_iowa_liquor_bda_cluster",
	"default_write_disposition": "WRITE_TRUNCATE",
	"task_dependencies": [
		"create_table >> run_query"
	],
	"workflow": [
        {
            "id": "create_table",
            "short_description": "Create an empty table with the proper name and descriptions.",
            "doc_md": "iowa_liquor_agg_store.md",
            "write_disposition": "WRITE_TRUNCATE",
            "task_type": "create_gbq_table",
            "force_delete": true,
            "bq_table": "iowa_liquor_agg_store_2019",
            "bq_dataset": "dlk_demo_iowa_liquor_bda_cluster",
            "ddl_file": "iowa_liquor_agg_store_DDL.json"
        },
        {
            "id": "run_query",
            "short_description": "Run the query and store the results in a table.",
			"doc_md": "iowa_liquor_agg_store.md",
            "write_disposition": "WRITE_APPEND",
            "task_type": "sql",
            "sql_file": "iowa_liquor_agg_store.sql",
            "table_name": "iowa_liquor_agg_store_2019",
            "bq_dataset": "dlk_demo_iowa_liquor_bda_cluster"
        }
	]
}

🌐 Global parameters

General information about the data operation.

Parameter

Description

configuration_type

type: string

mandatory

Type of data operation.

For a TTT data operation, the value is always "table-to-table".

configuration_id

type: string

mandatory

ID of the data operation.

You can pick any name you want, but is has to be unique for this data operation type.

Note that in case of conflict, the newly deployed data operation will overwrite the previous one. To guarantee its uniqueness, the best practice is to name your data operation by concatenating:

your account ID,
the word "load",
and the target dataset or table.

short_description

type: string

optional

Short description of the context of the data operation.

doc_md

type: string

optional

Path to a file containing a detailed description of the data operation. The file must be in Markdown format.

environment

type: string

mandatory

Deployment context.

Values: PROD, PREPROD, STAGING, DEV.

account

type: string

mandatory

Your account ID is a 6-digit number assigned to you by your Tailer Platform administrator.

activated

type: boolean

optional

Flag used to enable/disable the execution of the data operation.

Default value: true

archived

type: boolean

optional

Flag used to enable/disable the visibility of the data operation's configuration and runs in Jarvis Studio.

Default value: false

start_date

type: string

mandatory

Start date of the data operation.

The format must be:

"YYYY, MM, DD"

Where:

YYYY >= 1970
MM = [1, 12]
DD = [1, 31]

schedule_interval

type: string

mandatory

A Tables to Tables data operation can be launched in two different ways:

If schedule_interval is set to "None", the data operation will need to be started with a Workflow, when a given condition is met. (This solution is recommended.)
If you want the data operation to start at regular intervals, you can define this in the schedule_interval parameter with a Cron expression.

Example

For the data operation to start everyday at 7:00, you need to set it as follows:

"schedule_interval": "0 7 * * *",

You can find online tools to help you edit your Cron expression (for example, crontab.guru).

max_active_runs

type: integer

optional

This parameter limits the number of concurrent runs for this data operation. As most data operations are run once daily, there is usually no need to set a value higher than 1 here.

Default value: 1

task_concurrency

type: integer

optional

This parameter limits the number of tasks that might run concurrently.

As a great volume of data might be handled by each task, it is important to make sure to avoid consuming too many resources for one data operation.

Make sure also that the value you set here is high enough for concurrent tasks set in the task_dependencies parameter to run properly.

Default value: 5

catchup

type: boolean

optional

This parameter allows you to specify if you want to execute the data operation runs that were supposed to happen between start_date and the actual deployment date.

Example

If you start receiving data from September 1st, but you only finish writing your code on September 7th, you might want to run your data operation from a date in the past: September 1st.

The catchup parameter can have two values:

If it is set to "true" AND a scheduling_interval is set AND start_date is set in the past, Composer/Airflow will execute every run of the data operation scheduled from the start date until the current date.
If it is set to "false", the data operation will only be executed starting from the current date.

⚠️ If the data operation is scheduled to happen frequently and/or the missed execution period is long, the amount of runs might be important. Make sure you have enough resources to handle all the executions when deploying a data operation with catchup set to "true".

⚠️ If you plan to store a version of your configuration file for further modifications (in a Git repository for example), it's recommended to store it with a catchup set to "false" to avoid to accidentally create a lot of runs when you will deploy and execute your next version in the future.

Default value: false

task_dependencies

type: array of strings

mandatory

The task_dependencies parameter allows you to create dependencies between the different tasks specified in the workflow parameter (see below). It will define in which order the workflow tasks will run, some of them running concurrently, others sequentially.

Usually workflow tasks will be run in the following order:

create_gbq_table
sql
copy_gbq_table

Syntax

The double chevron >> means that the first task needs to be completed before the next one can start.
The comma , means that the tasks will run concurrently.
The square brackets [ and ] allow you to define a set of tasks that will run together.

For detailed information about the syntax, refer to the Airflow documentation.

⚠️ Criticality and tasks chaining: If a warning or a transparent task has a child linked critical task, the job is rejected to prevent inconsistent dependencies or criticalities. See "criticality" parameter below for more details.

ℹ️ You can create conditional tasks that can break a dependency chain depending on predefined conditions without raising an error with the task parameter "criticality" set to "break". See "criticality" parameter below or this article for more information.

Example 1

We have the following tasks that we want to run sequentially: taskA (create_gbq_table), taskB (sql) and taskC (copy_gbq_table). The task_dependencies parameter will be as follows: "task_dependencies": [" taskA >> taskB >> taskC "],

Example 2

We have the following tasks that we want to run concurrently: taskA, taskB and taskC.

The task_dependencies parameter will be as follows: "task_dependencies": [" taskA, taskB, taskC "],

Example 3

We have the following 9 tasks we want to order: taskA, taskD, taskG (create_gbq_table), taskB, taskE, taskH (sql), taskC, taskF, taskI (copy_gbq_table). The task_dependencies parameter will be as follows:"task_dependencies": [" [taskA, taskD, taskG] >> [taskB, taskE, taskH] >> [taskC, taskF, taskI] "],

Example 4

In the example above, we want taskH to run before taskE so we can use its result for taskE.

The task_dependencies parameter will be as follows:

"task_dependencies": [" [taskA, taskD, taskG] >> taskH >> [taskB, taskE] >> [taskC, taskF, taskI] "],

workflow

type: array of maps

mandatory

List of tasks the data operations will execute.

Check the section below for detailed information on their parameters.

default_gcp_project_id

type: string

mandatory

Default GCP Project ID.

This parameter can be set for each workflow task sub-object, and will be overridden by that value if it is different.

default_bq_dataset

type: string

mandatory

Default BigQuery dataset ID.

This parameter can be set for each workflow task sub-object, and will be overridden by that value if it is different.

default_write_disposition

type: string

mandatory

Action that occurs if the destination table already exists (see Google BigQuery documentation).

Possible values:

"WRITE_TRUNCATE" (default): The run will write table data from the beginning. If the table already contained lines, they will all be deleted and replaced by the new lines. This option is used most of the time for daily runs to avoid duplicates.
"WRITE_APPEND": The run will append new lines to the table. When using this option, make sure not to run the data operation several times.
"WRITE_EMPTY": This option only allows adding data to an empty table. If the table already contains data, it returns an error. It is hardly ever used as data operations are usually run periodically, so they will always contain data after the first run.

This parameter can be set for each workflow task sub-object, and will be overridden by that value if it is different.

direct_execution

type: boolean

optional

Tailer's execution engine has been rewritten to switch from Airflow/Composer to a Kubernetes severless architecture, improving its speed, stability, security and scalability.

To use the new execution mode, the direct_execution parameter must be set to "true".

Default value: true (from January 5, 2021)

➿ Workflow task parameters

A Tables to Tables workflow can include for types of tasks:

create_gbq_table: This type of task allows you to create the skeleton of a table based on a DDL file. You could compare it to the baking pan of your table.
sql: This type of task allows you to fill that baking pan using a SQL query or a SQL script from a file.
copy_gbq_table: This type of task allows you to duplicate a table named X into another table named Y.
expectation: This type of task allows you to perform data quality tests. See expectations for more details.

For each workflow sub-object, parameters will differ depending on the task type.

Refer to this page to know how to create the DDL and SQL files corresponding to these tasks.

SQL task parameters

Parameter

Description

task_type

type: string

mandatory

The value has to be set to "sql" for this task type.

As "sql" is the default type, this parameter can be omitted for this task type.

type: string

mandatory

ID of the task. It must be unique within the data operation.

short_description

type: string

optional

Short description of what the task does.

doc_md

type: string

optional

Path to a file containing a detailed description of the task. The file must be in Markdown format.

sql_file

type: string

mandatory

Path to the file containing the actual SQL query or the SQL script (can be assertions or expectations). This file is going to be read and its content uploaded to Firestore upon deployment of the data operation.

gcp_project_id

type: string

optional

GCP Project ID.

Overrides default_gcp_project_id.

bq_dataset

type: string

optional

Name of the BigQuery destination dataset.

Overrides default_bq_dataset.

write_disposition

type: string

optional

Action that occurs if the destination table already exists (see Google BigQuery documentation). Possible values:

"WRITE_TRUNCATE" (default): The run will write table data from the beginning. If the table already contained lines, they will all be deleted and replaced by the new lines. This option is used most of the time for daily runs to avoid duplicates.
"WRITE_APPEND": The run will append new lines to the table. When using this option, make sure not to run the data operation several times.
"WRITE_EMPTY": This option only allows adding data to an empty table. If the table already contains data, it returns an error. It is hardly ever used as data operations are usually run periodically, so they will always contain data after the first run.

Overrides default_write_disposition.

table_name

type: string

mandatory if the SQL file contains a query, optional for a SQL script

Target table used upon SQL query execution. This parameter is optional if the SQL file contains a script, i.e. one or more statements separated by a ; (for ex. assertions or expectations). Otherwise, this parameter is mandatory.

criticality

type: string

optional

Task criticality which defines the impact of an error on this task on the data operation global status and behavior. It allows to create tasks dedicated to warning tests (see expectations) that won't break the execution if it fails or conditional tasks that can lead to different tasks depending on predefined conditions (see this article).

Criticality can be:

critical: if a critical task fails, the run stops and have the global status "failed".
warning: if a warning task fails, the run proceeds to the next task, executing the following dependent task if any, and the global status of the run will be "warning" (if no critical task fails afterward). The workflow manager regards warning as success, so any job chained by a workflow configuration will be triggered.
transparent: if a transparent task fails, the run proceeds to the next task, executing the following dependent task if any, and the global status of the run will be "success" (if no critical or warning task fails). Any job chained by a workflow configuration will be triggered.
break: if a break task fails, the run skips all the dependent tasks and proceeds to the independent tasks if any. The global status of the run will be "success" (if no critical or warning task fails). Any job chained by a workflow configuration will be triggered. You can use this criticality to create conditional tasks that lead to different tasks depending on predefined conditions. See this article for more information.

⚠️ Tasks chaining: We enforce rules at the validation of the configuration (at deployment time) to prevent inconsistent dependencies or criticalities in a chain of tasks. If a warning or a transparent task has a child linked critical task, the job is rejected. Indeed, if you have 3 tasks A: critical >> B: warning >> C: critical, then you have either an inconsistent dependency (B is not necessary for the task C and therefore should be parallelized with C) or an inconsistent criticality (B is necessary for the critical task C and therefore should be critical itself).

Default value: "critical"

temporary_table

type: boolean

optional

If set to "true", this parameter will force the deletion of the table at the end of the data operation. If the TTT data operation failed before deleting the temporary table, the table will be deleted automatically after 24 hours.

⚠️Please note that if this flag is set to "true", then the target table is preliminarily deleted if it exists. Description, partitionning and clustering are lost.

Default value: false

sql_query_template

type: string

optional

If you want to use variables in your SQL query or script, you need to set this parameter to "TEMPLATE_CURRENT_DATE" (only supported value). This variable will be set to the execution date of the data operation (and not today's date).

For example, if you want to retrieve data corresponding to the execution date, you can use the following instruction:

WHERE sale_date = DATE('{{TEMPLATE_CURRENT_DATE}}')

Table Creation task parameters

Parameter

Description

task_type

type: string

mandatory

In this case, the value has to be set to "create_gbq_table".

type: string

mandatory

ID of the task. It must be unique within the data operation.

short_description

type: string

optional

Short description of what the task does.

doc_md

type: string

optional

Path to a file containing a detailed description of the task. The file must be in Markdown format.

gcp_project_id

type: string

optional

GCP Project ID.

Overrides default_gcp_project_id.

bq_dataset

type: string

optional

Name of the BigQuery destination dataset.

Overrides default_bq_dataset.

bq_table

type: string

mandatory

Name of the BigQuery table name.

ddl_file

type: string

mandatory

Path to the JSON file containing DDL information to create the table.

force_delete

type: boolean

optional

If set to "true", this parameter will force the deletion of the table prior to its creation.

criticality

type: string

optional

Task criticality which defines the impact of an error on this task on the data operation global status and behavior.

Criticality can be: critical, warning, transparent or break.

See "criticality" description in the SQL task above for more details

Default value: "critical"

Table copy task parameters

Parameter

Description

task_type

type: string

mandatory

In this case, the value has to be set to "copy_gbq_table".

type: string

mandatory

ID of the task. It must be unique within the data operation.

short_description

type: string

optional

Short description of what the task does.

doc_md

type: string

optional

Path to a file containing a detailed description. The file must be in Markdown format.

source_gcp_project_id

type: string

mandatory

GCP project ID for the source BigQuery table.

source_bq_dataset

type: string

mandatory

BigQuery dataset for the source table.

source_bq_table

type: string

mandatory

Name of the source table.

gcp_project_id

type: string

optional

GCP project ID for the destination BigQuery table.

Overrides default_gcp_project_id.

bq_dataset

type: string

optional

Name of the BigQuery destination dataset.

Overrides default_bq_dataset.

destination_bq_table

type: string

mandatory

Name of the BigQuery destination table.

destination_bq_table_date_suffix

type: boolean

mandatory

Enables/Disables the date suffix to allow date partitioning in BigQuery.

destination_bq_table_date_suffix_format

type: string

mandatory if destination_bq_table_date_suffix is set to true

Date format for the table suffix.

As of now, the only possible value is "%Y%m%d".

criticality

type: string

optional

Task criticality which defines the impact of an error on this task on the data operation global status and behavior.

Criticality can be: critical, warning, transparent or break.

See "criticality" description in the SQL task above for more details

Default value: "critical"

PreviousDDL script file NextLaunch configuration and furthermore

Last updated 2 years ago

Was this helpful?