Storage to Tables configuration file
This is the description of the configuration file of a Storage to Tables data operation.
The configuration file is in JSON format. It contains the following sections:
Global parameters: General information about the data operation.
Source parameters: Information related to the data source provider.
Destination parameters: Information about input file templates and destination tables. The "destinations" section will refer to DDL files, which contain the schema of the destination tables.
👁️🗨️ Example
Here is an example of STT configuration file for a GCS to BigQuery transfer:
🌐 Global parameters
General information about the data operation
Parameter | Description |
---|---|
$schema type: string optional | The url of the json-schema that contains the properties that your configuration must verify. Most Code Editor can use that to validate your configuration, display help boxes and enlighten issues. |
configuration_type type: string mandatory | Type of data operation. For an STT data operation, the value is always "storage-to-tables". |
configuration_id type: string mandatory | ID of the data operation. You can pick any name you want, but is has to be unique for this data operation type. Note that in case of conflict, the newly deployed data operation will overwrite the previous one. To guarantee its uniqueness, the best practice is to name your data operation by concatenating:
|
version type: string mandatory | Use only version 2, version 1 is depreciated. |
environment type: string mandatory | Deployment context. Values: PROD, PREPROD, STAGING, DEV. |
account type: string mandatory | Your account ID is a 6-digit number assigned to you by your Tailer Platform administrator. |
activated type: boolean optional | Flag used to enable/disable the execution of the data operation. If not specified, the default value will be "true". |
archived type: boolean optional | Flag used to enable/disable the visibility of the data operation's configuration and runs in Tailer Studio. If not specified, the default value will be "false". |
max_active_runs type: integer optional | This parameter limits the number of concurrent runs for this data operation. If not set, the default value is 1. |
short_description type: string optional | Short description of the context of the configuration. |
doc_md type: string optional | Path to a file containing a detailed description. The file must be in Markdown format. |
⬇️ Source parameters (GCS)
The destination section contains all information related to the data source provider.
Parameter | Description |
---|---|
type type: string mandatory | Source type. The only supported source type for now is "gcs". |
gcp_project_id type: string mandatory | Specify the Google Cloud Platform project where to deploy the data operation and its associated cloud functions. If not set, the user will be prompted to choose a project. |
gcs_source_bucket type: string mandatory | |
gcs_source_prefix type: string mandatory | |
gcs_archive_prefix type: string optional | Path where the source files will be archived. If present and populated, the STT data operation will archive the source files in the location specified, in the GCS source bucket. If not present or empty, there will be no archiving. |
gcp_credentials_secret type: dict mandatory | Encrypted credentials needed to read/move data from the source bucket. You should have generated credentials when setting up GCP. To learn how to encrypt them, refer to this page. |
⬆️ Destination parameters (BigQuery)
The destination section contains all the information related to the data destinations.
The destinations parameter is an array containing maps. Each map can contain a type of destination and many actual "tables" as ultimate destination.
Example:
Global destination parameters
Parameter | Description |
---|---|
type type: string mandatory | Type of destination. The only supported destination type for now is "bigquery". |
gcp_project_id type: string optional | Default GCP Project ID. This parameter can be set for each table sub-object, and will be overridden by that value if it is different. |
gbq_dataset type: string optional | Default BigQuery Dataset. This parameter can be set for each table sub-object, and will be overridden by that value if it is different. |
gcp_credentials_secret type: object optional | Encrypted credentials needed to interact with Storage and BigQuery. You should have generated credentials when setting up GCP. To learn how to encrypt them, refer to this page. |
source_format type: string optional | Default source format for input files. Possible values (case sensitive):
This parameter can be set for each table sub-object, and will be overridden by that value if it is different. |
create_disposition type: string optional | Specifies behavior for creating tables (see Google BigQuery documentation). Possible values:
This parameter can be set for each table sub-object, and will be overridden by that value if it is different. |
write_disposition type: string optional | Action that occurs if the destination table already exists (see Google BigQuery documentation). Possible values:
|
skip_leading_rows type: integer optional | Number of rows to skip when reading data, CSV only. This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: 1 |
field_delimiter type: string optional | Separator for fields in a CSV file, e.g. ";". Note: For Tab separator, set to "\t". This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: |
quote_character type: string optional | Character used to quote data sections, CSV only (see Google BigQuery documentation). Note: For quote and double quotes, set to "'" and """ respectively. This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: "" |
null_marker type: string optional | Represents a null value, CSV only (see Google BigQuery documentation). This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: "" |
bq_load_job_ignore_unknown_values type: boolean optional | Ignore extra values not represented in the table schema (see Google BigQuery documentation). This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: false |
bq_load_job_max_bad_records type: integer optional | Number of invalid rows to ignore (see Google BigQuery documentation). This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: 0 |
bq_load_job_schema_update_options type: array optional | Specifies updates to the destination table schema to allow as a side effect of the load job (see Google BigQuery documentation). This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: [] |
bq_load_job_allow_quoted_newlines type: boolean optional | Allows quoted data containing newline characters, CSV only (see Google BigQuery documentation). This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: false |
bq_load_job_allow_jagged_rows type : boolean optional | Allows missing trailing optional columns, CSV only (see Google BigQuery documentation). This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: false |
add_tailer_metadata type : boolean optional | [NEW] Allows automatic metadata feature that add specific columns during the ingestion process related to the input source. The added columns are:
This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value : false |
Table sub-object parameters
The "table" object contains the definition of expected input files and their BigQuery target.
Parameter | Description |
table_name type: string mandatory | Name of the destination BigQuery table. |
short_description type: string optional | Short description of the destination BigQuery table. |
filename_template type: string mandatory | Template for the files to be processed. The following placeholders are currently supported:
Information:
Example 1 This template:
will allow you to process this type of files: "stores_20201116_124213.txt" Example 2 This template:
will allow you to process this type of files: "20201116_12397_fixedvalue_12312378934.gz" Example 3 If table_name is set to: and filename_template to: A file named "20201116_124523_fixedvalue_stores.csv" will be loaded into a table named: "table_stores_20191205" A file named "20190212_063412_fixedvalue_visits.csv" will be loaded into a table named: "table_visits_20190212" |
ddl_mode type: string optional | This parameter allows you to specify how the schema of the table will be obtained. Possible values:
Default value: file |
ddl_file type: string mandatory if ddl_mode is set to "file" | Path to the DDL file where the destination schema is described. |
doc_md type: string optional | Path to the Markdown file containing detailed information about the destination table. |
add_tailer_metadata type : boolean optional | [NEW] Allows automatic metadata feature that add specific columns during the ingestion process related to the input source. The added columns are:
|
Last updated