Tables to Tables configuration file
This is the description of the JSON configuration file of a Tables to Tables data operation.
The configuration file is in JSON format. It contains the following sections:
Global parameters: General information about the data operation.
Workflow task parameters: Information about the different tasks of the workflow.
đī¸âđ¨ī¸ Example
Here is an example of TTT configuration file:
đ Global parameters
General information about the data operation.
$schema
type: string
optional
The url of the json-schema that contains the properties that your configuration must verify. Most Code Editor can use that to validate your configuration, display help boxes and enlighten issues.
configuration_type
type: string
mandatory
Type of data operation.
For a TTT data operation, the value is always "table-to-table".
configuration_id
type: string
mandatory
ID of the data operation.
You can pick any name you want, but is has to be unique for this data operation type.
Note that in case of conflict, the newly deployed data operation will overwrite the previous one. To guarantee its uniqueness, the best practice is to name your data operation by concatenating:
your account ID,
the word "load",
and the target dataset or table.
short_description
type: string
optional
Short description of the context of the data operation.
doc_md
type: string
optional
Path to a file containing a detailed description of the data operation. The file must be in Markdown format.
environment
type: string
mandatory
Deployment context.
Values: PROD, PREPROD, STAGING, DEV.
account
type: string
mandatory
Your account ID is a 6-digit number assigned to you by your Tailer Platform administrator.
activated
type: boolean
optional
Flag used to enable/disable the execution of the data operation.
Default value: true
archived
type: boolean
optional
Flag used to enable/disable the visibility of the data operation's configuration and runs in Jarvis Studio.
Default value: false
start_date
type: string
mandatory
Start date of the data operation.
The format must be:
"YYYY, MM, DD"
Where:
YYYY >= 1970
MM = [1, 12]
DD = [1, 31]
schedule_interval
type: string
mandatory
A Tables to Tables data operation can be launched in two different ways:
If schedule_interval is set to "None", the data operation will need to be started with a Workflow, when a given condition is met. (This solution is recommended.)
If you want the data operation to start at regular intervals, you can define this in the schedule_interval parameter with a Cron expression.
Example
For the data operation to start everyday at 7:00, you need to set it as follows:
"schedule_interval": "0 7 * * *",
You can find online tools to help you edit your Cron expression (for example, crontab.guru).
max_active_runs
type: integer
optional
This parameter limits the number of concurrent runs for this data operation. As most data operations are run once daily, there is usually no need to set a value higher than 1 here.
Default value: 1
task_concurrency
type: integer
optional
This parameter limits the number of tasks that might run concurrently.
As a great volume of data might be handled by each task, it is important to make sure to avoid consuming too many resources for one data operation.
Make sure also that the value you set here is high enough for concurrent tasks set in the task_dependencies parameter to run properly.
Default value: 5
catchup
type: boolean
optional
This parameter allows you to specify if you want to execute the data operation runs that were supposed to happen between start_date and the actual deployment date.
Example
If you start receiving data from September 1st, but you only finish writing your code on September 7th, you might want to run your data operation from a date in the past: September 1st.
The catchup parameter can have two values:
If it is set to "true" AND a scheduling_interval is set AND start_date is set in the past, Composer/Airflow will execute every run of the data operation scheduled from the start date until the current date.
If it is set to "false", the data operation will only be executed starting from the current date.
â ī¸ If the data operation is scheduled to happen frequently and/or the missed execution period is long, the amount of runs might be important. Make sure you have enough resources to handle all the executions when deploying a data operation with catchup set to "true".
â ī¸ If you plan to store a version of your configuration file for further modifications (in a Git repository for example), it's recommended to store it with a catchup set to "false" to avoid to accidentally create a lot of runs when you will deploy and execute your next version in the future.
Default value: false
task_dependencies
type: array of strings
mandatory
The task_dependencies parameter allows you to create dependencies between the different tasks specified in the workflow parameter (see below). It will define in which order the workflow tasks will run, some of them running concurrently, others sequentially.
Usually workflow tasks will be run in the following order:
create_gbq_table
sql
copy_gbq_table
Syntax
The double chevron
>>
means that the first task needs to be completed before the next one can start.The comma
,
means that the tasks will run concurrently.The square brackets
[
and]
allow you to define a set of tasks that will run together.
For detailed information about the syntax, refer to the Airflow documentation.
â ī¸ Criticality and tasks chaining: If a warning or a transparent task has a child linked critical task, the job is rejected to prevent inconsistent dependencies or criticalities. See "criticality" parameter below for more details.
âšī¸ You can create conditional tasks that can break a dependency chain depending on predefined conditions without raising an error with the task parameter "criticality" set to "break". See "criticality" parameter below or this article for more information.
Example 1
We have the following tasks that we want to run sequentially: taskA (create_gbq_table), taskB (sql) and taskC (copy_gbq_table).
The task_dependencies parameter will be as follows: "task_dependencies": [" taskA >> taskB >> taskC "],
Example 2
We have the following tasks that we want to run concurrently: taskA, taskB and taskC.
The task_dependencies parameter will be as follows: "task_dependencies": [" taskA, taskB, taskC "],
Example 3
We have the following 9 tasks we want to order: taskA, taskD, taskG (create_gbq_table), taskB, taskE, taskH (sql), taskC, taskF, taskI (copy_gbq_table).
The task_dependencies parameter will be as follows:"task_dependencies": [" [taskA, taskD, taskG] >> [taskB, taskE, taskH] >> [taskC, taskF, taskI] "],
Example 4
In the example above, we want taskH to run before taskE so we can use its result for taskE.
The task_dependencies parameter will be as follows:
"task_dependencies": [" [taskA, taskD, taskG] >> taskH >> [taskB, taskE] >> [taskC, taskF, taskI] "],
workflow
type: array of maps
mandatory
List of tasks the data operations will execute.
Check the section below for detailed information on their parameters.
default_gcp_project_id
type: string
mandatory
Default GCP Project ID.
This parameter can be set for each workflow task sub-object, and will be overridden by that value if it is different.
default_bq_dataset
type: string
mandatory
Default BigQuery dataset ID.
This parameter can be set for each workflow task sub-object, and will be overridden by that value if it is different.
default_bq_data_location type: string optional
The default BigQuery data location. Default value: EU
default_write_disposition
type: string
mandatory
Action that occurs if the destination table already exists (see Google BigQuery documentation).
Possible values:
"WRITE_TRUNCATE" (default): The run will write table data from the beginning. If the table already contained lines, they will all be deleted and replaced by the new lines. This option is used most of the time for daily runs to avoid duplicates.
"WRITE_APPEND": The run will append new lines to the table. When using this option, make sure not to run the data operation several times.
"WRITE_EMPTY": This option only allows adding data to an empty table. If the table already contains data, it returns an error. It is hardly ever used as data operations are usually run periodically, so they will always contain data after the first run.
This parameter can be set for each workflow task sub-object, and will be overridden by that value if it is different.
âŋ Workflow task parameters
A Tables to Tables workflow can include four types of tasks:
create_gbq_table: This type of task allows you to create the skeleton of a table based on a DDL file. You could compare it to the baking pan of your table.
sql: This type of task allows you to fill that baking pan using a SQL query or a SQL script from a file.
copy_gbq_table: This type of task allows you to duplicate a table named X into another table named Y.
expectation: This type of task allows you to perform data quality tests. See expectations for more details.
delete_gbq_table: This type of task allows you to delete a table or a partition of a table.
For each workflow sub-object, parameters will differ depending on the task type.
Refer to this page to know how to create the DDL and SQL files corresponding to these tasks.
SQL task parameters
task_type
type: string
optional
The value has to be set to "sql" for this task type.
As "sql" is the default type, this parameter can be omitted for this task type.
id
type: string
mandatory
ID of the task. It must be unique within the data operation.
short_description
type: string
optional
Short description of what the task does.
doc_md
type: string
optional
Path to a file containing a detailed description of the task. The file must be in Markdown format.
sql_file
type: string
mandatory
Path to the file containing the actual SQL query or the SQL script (can be assertions or expectations). This file is going to be read and its content uploaded to Firestore upon deployment of the data operation.
gcp_project_id
type: string
optional
GCP Project ID.
Overrides default_gcp_project_id.
bq_dataset
type: string
optional
Name of the BigQuery destination dataset.
Overrides default_bq_dataset.
write_disposition
type: string
optional
Action that occurs if the destination table already exists (see Google BigQuery documentation). Possible values:
"WRITE_TRUNCATE" (default): The run will write table data from the beginning. If the table already contained lines, they will all be deleted and replaced by the new lines. This option is used most of the time for daily runs to avoid duplicates.
"WRITE_APPEND": The run will append new lines to the table. When using this option, make sure not to run the data operation several times.
"WRITE_EMPTY": This option only allows adding data to an empty table. If the table already contains data, it returns an error. It is hardly ever used as data operations are usually run periodically, so they will always contain data after the first run.
Overrides default_write_disposition.
table_name
type: string
mandatory if the SQL file contains a query, optional for a SQL script
Target table used upon SQL query execution.
This parameter is optional if the SQL file contains a script, i.e. one or more statements separated by a ;
(for ex. assertions or expectations). Otherwise, this parameter is mandatory.
Note that you can use the FD_DATE templates. Example: my_table${{FD_DATE}} will be rendered with the TTT logical date => my_table$20240223 The available templates are:
criticality
type: string
optional
Task criticality which defines the impact of an error on this task on the data operation global status and behavior. It allows to create tasks dedicated to warning tests (see expectations) that won't break the execution if it fails or conditional tasks that can lead to different tasks depending on predefined conditions (see this article).
Criticality can be:
critical: if a critical task fails, the run stops and have the global status "failed".
warning: if a warning task fails, the run proceeds to the next task, executing the following dependent task if any, and the global status of the run will be "warning" (if no critical task fails afterward). The workflow manager regards warning as success, so any job chained by a workflow configuration will be triggered.
transparent: if a transparent task fails, the run proceeds to the next task, executing the following dependent task if any, and the global status of the run will be "success" (if no critical or warning task fails). Any job chained by a workflow configuration will be triggered.
break: if a break task fails, the run skips all the dependent tasks and proceeds to the independent tasks if any. The global status of the run will be "success" (if no critical or warning task fails). Any job chained by a workflow configuration will be triggered. You can use this criticality to create conditional tasks that lead to different tasks depending on predefined conditions. See this article for more information.
stop: if a stop task fails, the run skips all the dependent tasks and proceeds to the independent tasks if any. The global status of the run will be "success" (if no critical or warning task fails). Any job chained by a workflow configuration will not be triggered. You can use this criticality to create a conditional job that checks things regularly. As long as the check is false, then the chained jobs won't be triggered, and you won't have any error alert. When the check is true, then the chained jobs are triggered.
â ī¸ Tasks chaining: We enforce rules at the validation of the configuration (at deployment time) to prevent inconsistent dependencies or criticalities in a chain of tasks. If a warning or a transparent task has a child linked critical task, the job is rejected. Indeed, if you have 3 tasks A: critical >> B: warning >> C: critical
, then you have either an inconsistent dependency (B is not necessary for the task C and therefore should be parallelized with C) or an inconsistent criticality (B is necessary for the critical task C and therefore should be critical itself).
Default value: "critical"
temporary_table
type: boolean
optional
If set to "true", this parameter will force the deletion of the table at the end of the data operation. If the TTT data operation failed before deleting the temporary table, the table will be deleted automatically after 24 hours.
â ī¸Please note that if this flag is set to "true", then the target table is preliminarily deleted if it exists. Description, partitionning and clustering are lost.
Default value: false
sql_query_template
type: string
optional
If you want to use variables in your SQL query or script, you need to set this parameter to "TEMPLATE_CURRENT_DATE" (only supported value). This variable will be set to the execution date of the data operation (and not today's date).
For example, if you want to retrieve data corresponding to the execution date, you can use the following instruction:
WHERE sale_date = DATE('
{{TEMPLATE_CURRENT_DATE}}')
Table Creation task parameters
Parameter
Description
task_type
type: string
mandatory
In this case, the value has to be set to "create_gbq_table".
id
type: string
mandatory
ID of the task. It must be unique within the data operation.
short_description
type: string
optional
Short description of what the task does.
doc_md
type: string
optional
Path to a file containing a detailed description of the task. The file must be in Markdown format.
gcp_project_id
type: string
optional
GCP Project ID.
Overrides default_gcp_project_id.
bq_dataset
type: string
optional
Name of the BigQuery destination dataset.
Overrides default_bq_dataset.
bq_table
type: string
mandatory
Name of the BigQuery table name.
ddl_file
type: string
mandatory
Path to the JSON file containing DDL information to create the table.
force_delete
type: boolean
optional
If set to "true", this parameter will force the deletion of the table prior to its creation.
criticality
type: string
optional
Task criticality which defines the impact of an error on this task on the data operation global status and behavior.
Criticality can be: critical, warning, transparent or break.
See "criticality" description in the SQL task above for more details
Default value: "critical"
Table copy task parameters
Parameter
Description
task_type
type: string
mandatory
In this case, the value has to be set to "copy_gbq_table".
id
type: string
mandatory
ID of the task. It must be unique within the data operation.
short_description
type: string
optional
Short description of what the task does.
doc_md
type: string
optional
Path to a file containing a detailed description. The file must be in Markdown format.
source_gcp_project_id
type: string
mandatory
GCP project ID for the source BigQuery table.
source_bq_dataset
type: string
mandatory
BigQuery dataset for the source table.
source_bq_table
type: string
mandatory
Name of the source table.
gcp_project_id
type: string
optional
GCP project ID for the destination BigQuery table.
Overrides default_gcp_project_id.
bq_dataset
type: string
optional
Name of the BigQuery destination dataset.
Overrides default_bq_dataset.
destination_bq_table
type: string
mandatory
Name of the BigQuery destination table.
destination_bq_table_date_suffix
type: boolean
optional
Enables/Disables the date suffix to allow date partitioning in BigQuery. default: false
destination_bq_table_date_suffix_format
type: string
mandatory if destination_bq_table_date_suffix is set to true
Date format for the table suffix.
As of now, the only possible value is "%Y%m%d".
criticality
type: string
optional
Task criticality which defines the impact of an error on this task on the data operation global status and behavior.
Criticality can be: critical, warning, transparent or break.
See "criticality" description in the SQL task above for more details
Default value: "critical"
Table Deletion task parameters
Parameter
Description
task_type
type: string
mandatory
In this case, the value has to be set to "delete_gbq_table".
id
type: string
mandatory
ID of the task. It must be unique within the data operation.
short_description
type: string
optional
Short description of what the task does.
doc_md
type: string
optional
Path to a file containing a detailed description. The file must be in Markdown format.
bq_dataset
type: string
mandatory
BigQuery dataset for the table.
bq_table
type: string
mandatory
Name of the table. Note that you can use the FD_DATE templates. Example: my_table${{FD_DATE}} will be rendered with the TTT logical date => my_table$20240223 The available templates are:
default_bq_data_location type: string optional
The default BigQuery data location. Default value: EU
criticality
type: string
optional
Task criticality which defines the impact of an error on this task on the data operation global status and behavior.
Criticality can be: critical, warning, transparent or break.
See "criticality" description in the SQL task above for more details
Default value: "critical"
Last updated