Storage to Tables configuration file
This is the description of the configuration file of a Storage to Tables data operation.
The configuration file is in JSON format. It contains the following sections:
Global parameters: General information about the data operation.
Source parameters: Information related to the data source provider.
Destination parameters: Information about input file templates and destination tables. The "destinations" section will refer to DDL files, which contain the schema of the destination tables.
👁️🗨️ Example
Here is an example of STT configuration file for a GCS to BigQuery transfer:
🌐 Global parameters
General information about the data operation
$schema
type: string
optional
The url of the json-schema that contains the properties that your configuration must verify. Most Code Editor can use that to validate your configuration, display help boxes and enlighten issues.
configuration_type
type: string
mandatory
Type of data operation. For an STT data operation, the value is always "storage-to-tables".
configuration_id
type: string
mandatory
ID of the data operation.
You can pick any name you want, but is has to be unique for this data operation type.
Note that in case of conflict, the newly deployed data operation will overwrite the previous one. To guarantee its uniqueness, the best practice is to name your data operation by concatenating:
your account ID,
the source bucket name,
and the source directory name.
version type: string mandatory
Use only version 2, version 1 is depreciated.
environment
type: string
mandatory
Deployment context. Values: PROD, PREPROD, STAGING, DEV.
account
type: string
mandatory
Your account ID is a 6-digit number assigned to you by your Tailer Platform administrator.
activated
type: boolean
optional
Flag used to enable/disable the execution of the data operation. If not specified, the default value will be "true".
archived
type: boolean
optional
Flag used to enable/disable the visibility of the data operation's configuration and runs in Tailer Studio. If not specified, the default value will be "false".
max_active_runs
type: integer
optional
This parameter limits the number of concurrent runs for this data operation.
If not set, the default value is 50.
short_description
type: string
optional
Short description of the context of the configuration.
doc_md
type: string
optional
Path to a file containing a detailed description. The file must be in Markdown format.
⬇️ Source parameters (GCS)
The destination section contains all information related to the data source provider.
type
type: string
mandatory
Source type.
The only supported source type for now is "gcs".
gcp_project_id
type: string
mandatory
Specify the Google Cloud Platform project where to deploy the data operation and its associated cloud functions.
If not set, the user will be prompted to choose a project.
gcs_source_bucket
type: string
mandatory
Name of the source bucket. ⚠️ See restrictions below
gcs_source_prefix
type: string
mandatory
Path where the files will be found, e.g. "some/sub/dir". ⚠️ See restrictions below
gcs_archive_prefix
type: string optional
Path where the source files will be archived.
If present and populated, the STT data operation will archive the source files in the location specified, in the GCS source bucket.
If not present or empty, there will be no archiving.
gcp_credentials_secret
type: dict
mandatory
Encrypted credentials needed to read/move data from the source bucket.
You should have generated credentials when setting up GCP. To learn how to encrypt them, refer to this page.
⬆️ Destination parameters (BigQuery)
The destination section contains all the information related to the data destinations.
The destinations parameter is an array containing maps. Each map can contain a type of destination and many actual "tables" as ultimate destination.
Example:
Global destination parameters
type
type: string
mandatory
Type of destination.
The only supported destination type for now is "bigquery".
gcp_project_id
type: string
optional
Default GCP Project ID.
This parameter can be set for each table sub-object, and will be overridden by that value if it is different.
gbq_dataset
type: string
optional
Default BigQuery Dataset.
This parameter can be set for each table sub-object, and will be overridden by that value if it is different.
gcp_credentials_secret
type: object
optional
Encrypted credentials needed to interact with Storage and BigQuery.
You should have generated credentials when setting up GCP. To learn how to encrypt them, refer to this page.
source_format
type: string
optional
Default source format for input files.
Possible values (case sensitive):
"CSV" (default)
"JSON"
"PARQUET" (see BigQuery doc for more info)
"AVRO" (see BigQuery doc for more info)
This parameter can be set for each table sub-object, and will be overridden by that value if it is different.
create_disposition
type: string
optional
Specifies behavior for creating tables (see Google BigQuery documentation).
Possible values:
"CREATE_IF_NEEDED" (default): If the table does not exist, BigQuery creates the table.
"CREATE_NEVER": The table must already exist.
This parameter can be set for each table sub-object, and will be overridden by that value if it is different.
write_disposition
type: string
optional
Action that occurs if the destination table already exists (see Google BigQuery documentation).
Possible values:
"WRITE_TRUNCATE" (default): The run will write table data from the beginning. If the table already contained lines, they will all be deleted and replaced by the new lines. This option is used most of the time for daily runs to avoid duplicates.
"WRITE_APPEND": The run will append new lines to the table. When using this option, make sure not to run the data operation several times.
"WRITE_EMPTY": This option only allows adding data to an empty table. If the table already contains data, it returns an error. It is hardly ever used as data operations are usually run periodically, so they will always contain data after the first run. This parameter can be set for each table sub-object, and will be overridden by that value if it is different.
skip_leading_rows
type: integer
optional
Number of rows to skip when reading data, CSV only.
This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: 1
field_delimiter
type: string
optional
Separator for fields in a CSV file, e.g. ";".
Note: For Tab separator, set to "\t". This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value:
quote_character
type: string
optional
Character used to quote data sections, CSV only (see Google BigQuery documentation).
Note: For quote and double quotes, set to "'" and """ respectively.
This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: ""
null_marker
type: string
optional
Represents a null value, CSV only (see Google BigQuery documentation).
This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: ""
bq_load_job_ignore_unknown_values
type: boolean
optional
Ignore extra values not represented in the table schema (see Google BigQuery documentation).
This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: false
bq_load_job_max_bad_records
type: integer
optional
Number of invalid rows to ignore (see Google BigQuery documentation).
This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: 0
bq_load_job_schema_update_options
type: array
optional
Specifies updates to the destination table schema to allow as a side effect of the load job (see Google BigQuery documentation).
This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: []
bq_load_job_allow_quoted_newlines
type: boolean
optional
Allows quoted data containing newline characters, CSV only (see Google BigQuery documentation).
This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: false
bq_load_job_allow_jagged_rows
type : boolean
optional
Allows missing trailing optional columns, CSV only (see Google BigQuery documentation).
This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: false
add_tailer_metadata
type : boolean
optional
[NEW] Allows automatic metadata feature that add specific columns during the ingestion process related to the input source.
The added columns are:
tlr_ingestion_timestamp_utc
(TIMESTAMP) tlr_input_file_source_type
(STRING)
tlr_input_file_name
(STRING) tlr_input_file_full_resource_name
(STRING)
This parameter can be set for each table sub-object, and will be overridden by that value if it is different.
Default value : false
Table sub-object parameters
The "table" object contains the definition of expected input files and their BigQuery target.
Parameter
Description
table_name
type: string
mandatory
Name of the destination BigQuery table.
short_description
type: string
optional
Short description of the destination BigQuery table.
filename_template
type: string
mandatory
Template for the files to be processed. The following placeholders are currently supported:
"FD_DATE" looks for an 8-digit date (e.g. "20191015").
"FD_DATE_YEAR_4" looks for 4-digit year (e.g "2021").
"FD_DATE_YEAR_2" looks for 2-digit year (e.g "21").
"FD_DATE_MONTH" looks for 2-digit month (e.g "05").
"FD_DATE_DAY" looks for 2-digit day (e.g "12").
"FD_TIME" looks for a 6-digit time (e.g. "124213").
"FD_BLOB_n", where "n" is a non-zero positive integer, looks for a string of characters of "n" length.
FD_TABLE_NAME: This is a special template used when you have to process a large number of different files sharing the same destination schema. This template has to be used in conjunction with the table_name parameter.
Information:
if "FD_DATE" is specified, it will have priority upon "FD_DATE_YEAR_X".
if "FD_DATE_YEAR_4" or "FD_DATE_YEAR_2" is specified, the final date will be concatenated with "FD_DATE_MONTH" and "FD_DATE_DAY".
if "FD_DATE_YEAR_2" is specified, it will be prefixed by "20".
if "FD_DATE_YEAR_4" or "FD_DATE_YEAR_2" is specified only "FD_DATE_MONTH" and "FD_DATE_DAY" will be set to "01".
Example 1
This template:
"stores_{{FD_DATE}}
{{FD_TIME}}.txt"
will allow you to process this type of files:
"stores_20201116_124213.txt"
Example 2
This template:
"{{FD_DATE}}
{{FD_BLOB_5}}
fixedvalue
{{FD_BLOB_11}}.gz"
will allow you to process this type of files:
"20201116_12397_fixedvalue_12312378934.gz"
Example 3
If table_name is set to: "table_{{FD_TABLE_NAME}}"
and filename_template to: "{{FD_DATE}}_{{FD_TIME}}
fixedvalue
{{FD_TABLE_NAME}}.csv"
A file named "20201116_124523_fixedvalue_stores.csv" will be loaded into a table named: "table_stores_20191205"
A file named "20190212_063412_fixedvalue_visits.csv" will be loaded into a table named: "table_visits_20190212"
ddl_mode
type: string
optional
This parameter allows you to specify how the schema of the table will be obtained.
Possible values:
"file": Legacy mode. The table schema is described in the DDL file specified in the ddl_file parameter.
" file_template": The table schema is described in a DDL file provided in the source directory together with the source file. It must have the same filename as the source file, with the ".ddl.json" suffix.
"header" (CSV file only): The columns of the CSV file first line are automatically used as columns for the database table. All the columns are given the STRING type. No DDL file needs to be provided.
"autodetect" (not recommended): Google’s default mode. The schema is automatically detected from the source file. This mode doesn’t work well with CSV files, but gives good results with structured formats such as JSON. (see Google BigQuery documentation).
Default value: file
ddl_file
type: string
mandatory if ddl_mode is set to "file"
Path to the DDL file where the destination schema is described.
doc_md type: string optional
Path to the Markdown file containing detailed information about the destination table.
add_tailer_metadata
type : boolean
optional
[NEW] Allows automatic metadata feature that add specific columns during the ingestion process related to the input source.
The added columns are:
tlr_ingestion_timestamp_utc
(TIMESTAMP) tlr_input_file_source_type
(STRING)
tlr_input_file_name
(STRING) tlr_input_file_full_resource_name
(STRING)
Default value : false
Last updated