Tailer Documentation
  • What is Tailer Platform?
  • Getting Started
    • Prepare your local environment for Tailer
    • Install Tailer SDK
    • Set up Google Cloud Platform
    • Encrypt your credentials
  • [Tutorial] Create a first data pipeline
    • Introduction
    • Prepare the demonstration environment
    • Copy files from one bucket to another
    • Load files into BigQuery tables
    • Prepare data
    • Build predictions
    • Export data
    • Congratulations!
    • [Video] Automatic Script
      • SQL script file
      • DDL script file
      • Tables to Tables script file
      • Launch configuration and furthermore
  • Data Pipeline Operations
    • Overview
    • Set constants with Context
      • Context configuration file
    • Move files with Storage to Storage
      • Storage to Storage configuration file
    • Load data with Storage to Tables
      • Storage to Tables configuration file
      • Storage to Tables DDL files
    • Stream incoming data with API To Storage
      • API To Storage configuration file
      • API To Storage usage examples
    • Transform data with Tables to Tables
      • Tables to Tables configuration file
      • Table to Table SQL and DDL files
    • Export data with Tables to Storage
      • [V3] Table to Storage configuration file
      • Table to Storage SQL file
      • [V1-V2: deprecated] Table to Storage configuration file
    • Orchestrate processings with Workflow
      • [V2] Workflow configuration file
      • [V1: deprecated] Workflow configuration file
    • Convert XML to CSV
      • Convert XML to CSV configuration file
    • Use advanced features with VM Launcher
      • Process code with VM Launcher
        • VM Launcher configuration file for code processing
      • Encrypt/Decrypt data with VM Launcher
        • VM Launcher configuration file for data encryption
        • VM Launcher configuration file for data decryption
    • Monitoring and Alerting
      • Monitoring and alerting parameters
    • Asserting Data quality with Expectations
      • List of Expectations
    • Modify files with File Utilities
      • Encrypt/Decrypt data with File Utilities
        • Configuration file for data encryption
        • Configuration file for data decryption
    • Transfer data with GBQ to Firestore
      • Table to Storage: configuration file
      • Table to Storage: SQL file
      • VM Launcher: configuration file
      • File-to-firestore python file
  • Tailer Studio
    • Overview
    • Check data operations' details
    • Monitor data operations' status
    • Execute data operations
    • Reset Workflow data operations
    • Archive data operations
    • Add notes to data operations and runs
    • View your data catalog
    • Time your data with freshness
  • Tailer API
    • Overview
    • Getting started
    • API features
  • Release Notes
    • Tailer SDK Stable Releases
    • Tailer Beta Releases
      • Beta features
      • Beta configuration
      • Tailer SDK API
    • Tailer Status
Powered by GitBook
On this page
  • Example
  • Global parameters
  • Source parameters (GCS)
  • Destination parameters (BigQuery)
  • Global destination parameters
  • Table sub-object parameters

Was this helpful?

Edit on GitHub
  1. Data Pipeline Operations
  2. Load data with Storage to Tables

Storage to Tables configuration file

This is the description of the configuration file of a Storage to Tables data operation.

PreviousLoad data with Storage to TablesNextStorage to Tables DDL files

Last updated 8 months ago

Was this helpful?

The configuration file is in JSON format. It contains the following sections:

  • Global parameters: General information about the data operation.

  • Source parameters: Information related to the data source provider.

  • Destination parameters: Information about input file templates and destination tables. The "destinations" section will refer to , which contain the schema of the destination tables.

Example

Here is an example of STT configuration file for a GCS to BigQuery transfer:

{
  "$schema": "http://jsonschema.tailer.ai/schema/storage-to-tables-veditor",
  "configuration_type": "storage-to-tables",
  "configuration_id": "Load_sales_files_from_it",
  "version": "2",
  "environment": "DEV",
  "account": "000099",
  "activated": true,
  "archived": false,
  "max_active_runs" : 5,
  "short_description": "This Job load sales files into the Persistent Staging Area",
  "doc_md": "Load_sales_files_from_it.md",
  "source": {
    "type": "gcs",
    "gcp_project_id": "dlk_demo",
    "gcs_source_bucket": "mirror-fd-io-exc-demo-wbd--n-in",
    "gcs_source_prefix": "testjul",
    "gcs_archive_prefix": "archive",
    "gcp_credentials_secret": {
      "cipher_aes": "223xxx",
      "tag": "8ddxxx",
      "ciphertext": "4c7xxx",
      "enc_session_key": "830xxx"
    }
  },
  "destinations": [
    {
      "type": "bigquery",
      "gcp_project_id": "my-project",
      "gbq_dataset": "dlk_demo_wbd_psa",
      "source_format": "CSV",
      "create_disposition": "CREATE_IF_NEEDED",
      "write_disposition": "WRITE_TRUNCATE",
      "bq_load_job_ignore_unknown_values": true,
      "skip_leading_rows": 1,
      "field_delimiter": "|",
      "add_tailer_metadata": true,
      "gcp_credentials_secret": {
        "cipher_aes": "223xxx",
        "tag": "8ddxxx",
        "ciphertext": "4c7xxx",
        "enc_session_key": "830xxx"
      },
      "tables": [
        {
          "table_name": "sales_details_test",
          "short_description": "Daily sales with tickets and tickets lines",
          "filename_template": "input_{{FD_DATE}}-{{FD_TIME}}-sales_details.csv",
          "ddl_mode": "file",
          "ddl_file": "ddl/sales_details.json",
          "doc_md": "ddl/sales_details.md",
          "add_tailer_metadata": true,
          "skip_leading_rows": 0,
          "write_disposition": "WRITE_APPEND"
        },
        {
          "table_name": "stores",
          "short_description": "Full Stores referential",
          "filename_template": "input_{{FD_DATE}}-{{FD_TIME}}-stores.csv",
          "ddl_mode": "file",
          "ddl_file": "ddl/stores.json",
          "doc_md": "ddl/stores.md",
          "add_tailer_metadata": false
        }
      ]
    }
  ]
}

General information about the data operation

Parameter
Description

$schema

type: string

optional

The url of the json-schema that contains the properties that your configuration must verify. Most Code Editor can use that to validate your configuration, display help boxes and enlighten issues.

configuration_type

type: string

mandatory

Type of data operation. For an STT data operation, the value is always "storage-to-tables".

configuration_id

type: string

mandatory

ID of the data operation.

You can pick any name you want, but is has to be unique for this data operation type.

Note that in case of conflict, the newly deployed data operation will overwrite the previous one. To guarantee its uniqueness, the best practice is to name your data operation by concatenating:

  • your account ID,

  • the source bucket name,

  • and the source directory name.

version type: string mandatory

Use only version 2, version 1 is depreciated.

environment

type: string

mandatory

Deployment context. Values: PROD, PREPROD, STAGING, DEV.

account

type: string

mandatory

Your account ID is a 6-digit number assigned to you by your Tailer Platform administrator.

activated

type: boolean

optional

Flag used to enable/disable the execution of the data operation. If not specified, the default value will be "true".

archived

type: boolean

optional

Flag used to enable/disable the visibility of the data operation's configuration and runs in Tailer Studio. If not specified, the default value will be "false".

max_active_runs

type: integer

optional

This parameter limits the number of concurrent runs for this data operation.

If not set, the default value is 50.

short_description

type: string

optional

Short description of the context of the configuration.

doc_md

type: string

optional

Path to a file containing a detailed description. The file must be in Markdown format.

The destination section contains all information related to the data source provider.

"source": {
  "type": "gcs",
  "gcp_project_id": "my-project",
  "gcs_source_bucket": "mirror-fd-io-exc-demo-wbd--n-in",
  "gcs_source_prefix": "testjul",
  "gcs_archive_prefix": "archive",
  "gcp_credentials_secret": {
    "cipher_aes": "223xxx",
    "tag": "8ddxxx",
    "ciphertext": "4c7xxx",
    "enc_session_key": "830xxx"
  }
}
Parameter
Description

type

type: string

mandatory

Source type.

The only supported source type for now is "gcs".

gcp_project_id

type: string

mandatory

Specify the Google Cloud Platform project where to deploy the data operation and its associated cloud functions.

If not set, the user will be prompted to choose a project.

gcs_source_bucket

type: string

mandatory

gcs_source_prefix

type: string

mandatory

gcs_archive_prefix

type: string optional

Path where the source files will be archived.

If present and populated, the STT data operation will archive the source files in the location specified, in the GCS source bucket.

If not present or empty, there will be no archiving.

gcp_credentials_secret

type: dict

mandatory

Encrypted credentials needed to read/move data from the source bucket.

The destination section contains all the information related to the data destinations.

The destinations parameter is an array containing maps. Each map can contain a type of destination and many actual "tables" as ultimate destination.

Example:

"destinations": [
{
  "type": "bigquery",
  "gcp_project_id": "my-project",
  "gbq_dataset": "dlk_demo_wbd_psa",
  "source_format": "CSV",
  "create_disposition": "CREATE_IF_NEEDED",
  "write_disposition": "WRITE_TRUNCATE",
  "skip_leading_rows": 1,
  "field_delimiter": "|",
  "add_tailer_metadata": true,
  "gcp_credentials_secret": {
    "cipher_aes": "223008256918c292ff3ec1axxx",
    "tag": "8dd3db4c71bfb963d475d1bbd1xxx",
    "ciphertext": "4c74df268d5a7541f8264c2e7a282fxxx",
    "enc_session_key": "830606cbc4f9401a29c7977d364398xxx"
    },
      "tables": [
      {
        "table_name": "sales_details_test",
        "short_description": "Daily detailed Sales with tickets and tickets lines",
        "filename_template": "input_{{FD_DATE}}-{{FD_TIME}}-ORS-ventes.csv",
        "ddl_mode": "file",
        "ddl_file": "ddl/sales_details.json",
        "doc_md": "ddl/sales_details.md",
        "add_tailer_metadata": true,
        "skip_leading_rows": 0,
        "write_disposition": "WRITE_APPEND"
      },
      {
        "table_name": "stores",
        "short_description": "Full Stores referential",
        "filename_template": "input_{{FD_DATE}}-{{FD_TIME}}-ORS-magasins.csv",
        "ddl_mode": "file",
        "ddl_file": "ddl/stores.json",
        "doc_md": "ddl/stores.md",
        "add_tailer_metadata": false
      }
    ]
  }
]

Global destination parameters

Parameter
Description

type

type: string

mandatory

Type of destination.

The only supported destination type for now is "bigquery".

gcp_project_id

type: string

optional

Default GCP Project ID.

This parameter can be set for each table sub-object, and will be overridden by that value if it is different.

gbq_dataset

type: string

optional

Default BigQuery Dataset.

This parameter can be set for each table sub-object, and will be overridden by that value if it is different.

gcp_credentials_secret

type: object

optional

Encrypted credentials needed to interact with Storage and BigQuery.

source_format

type: string

optional

Default source format for input files.

Possible values (case sensitive):

  • "CSV" (default)

  • "JSON"

This parameter can be set for each table sub-object, and will be overridden by that value if it is different.

create_disposition

type: string

optional

Possible values:

  • "CREATE_IF_NEEDED" (default): If the table does not exist, BigQuery creates the table.

  • "CREATE_NEVER": The table must already exist.

This parameter can be set for each table sub-object, and will be overridden by that value if it is different.

write_disposition

type: string

optional

Possible values:

  • "WRITE_TRUNCATE" (default): The run will write table data from the beginning. If the table already contained lines, they will all be deleted and replaced by the new lines. This option is used most of the time for daily runs to avoid duplicates.

  • "WRITE_APPEND": The run will append new lines to the table. When using this option, make sure not to run the data operation several times.

  • "WRITE_EMPTY": This option only allows adding data to an empty table. If the table already contains data, it returns an error. It is hardly ever used as data operations are usually run periodically, so they will always contain data after the first run. This parameter can be set for each table sub-object, and will be overridden by that value if it is different.

skip_leading_rows

type: integer

optional

Number of rows to skip when reading data, CSV only.

This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: 1

field_delimiter

type: string

optional

Separator for fields in a CSV file, e.g. ";".

Note: For Tab separator, set to "\t". This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value:

quote_character

type: string

optional

Note: For quote and double quotes, set to "'" and """ respectively.

This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: ""

null_marker

type: string

optional

This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: ""

bq_load_job_ignore_unknown_values

type: boolean

optional

This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: false

bq_load_job_max_bad_records

type: integer

optional

This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: 0

bq_load_job_schema_update_options

type: array

optional

This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: []

bq_load_job_allow_quoted_newlines

type: boolean

optional

This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: false

bq_load_job_allow_jagged_rows

type : boolean

optional

This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: false

add_tailer_metadata

type : boolean

optional

[NEW] Allows automatic metadata feature that add specific columns during the ingestion process related to the input source.

The added columns are: tlr_ingestion_timestamp_utc (TIMESTAMP) tlr_input_file_source_type (STRING) tlr_input_file_name (STRING) tlr_input_file_full_resource_name (STRING)

This parameter can be set for each table sub-object, and will be overridden by that value if it is different.

Default value : false

Table sub-object parameters

The "table" object contains the definition of expected input files and their BigQuery target.

Parameter

Description

table_name

type: string

mandatory

Name of the destination BigQuery table.

short_description

type: string

optional

Short description of the destination BigQuery table.

filename_template

type: string

mandatory

Template for the files to be processed. The following placeholders are currently supported:

  • "FD_DATE" looks for an 8-digit date (e.g. "20191015").

  • "FD_DATE_YEAR_4" looks for 4-digit year (e.g "2021").

  • "FD_DATE_YEAR_2" looks for 2-digit year (e.g "21").

  • "FD_DATE_MONTH" looks for 2-digit month (e.g "05").

  • "FD_DATE_DAY" looks for 2-digit day (e.g "12").

  • "FD_TIME" looks for a 6-digit time (e.g. "124213").

  • "FD_BLOB_n", where "n" is a non-zero positive integer, looks for a string of characters of "n" length.

  • FD_TABLE_NAME: This is a special template used when you have to process a large number of different files sharing the same destination schema. This template has to be used in conjunction with the table_name parameter.

Information:

  • if "FD_DATE" is specified, it will have priority upon "FD_DATE_YEAR_X".

  • if "FD_DATE_YEAR_4" or "FD_DATE_YEAR_2" is specified, the final date will be concatenated with "FD_DATE_MONTH" and "FD_DATE_DAY".

  • if "FD_DATE_YEAR_2" is specified, it will be prefixed by "20".

  • if "FD_DATE_YEAR_4" or "FD_DATE_YEAR_2" is specified only "FD_DATE_MONTH" and "FD_DATE_DAY" will be set to "01".

Example 1

This template:

"stores_{{FD_DATE}}{{FD_TIME}}.txt"

will allow you to process this type of files:

"stores_20201116_124213.txt"

Example 2

This template:

"{{FD_DATE}}{{FD_BLOB_5}}fixedvalue{{FD_BLOB_11}}.gz"

will allow you to process this type of files:

"20201116_12397_fixedvalue_12312378934.gz"

Example 3

If table_name is set to: "table_{{FD_TABLE_NAME}}"

and filename_template to: "{{FD_DATE}}_{{FD_TIME}}fixedvalue{{FD_TABLE_NAME}}.csv"

A file named "20201116_124523_fixedvalue_stores.csv" will be loaded into a table named: "table_stores_20191205"

A file named "20190212_063412_fixedvalue_visits.csv" will be loaded into a table named: "table_visits_20190212"

ddl_mode

type: string

optional

This parameter allows you to specify how the schema of the table will be obtained.

Possible values:

  • "file": Legacy mode. The table schema is described in the DDL file specified in the ddl_file parameter.

  • " file_template": The table schema is described in a DDL file provided in the source directory together with the source file. It must have the same filename as the source file, with the ".ddl.json" suffix.

  • "header" (CSV file only): The columns of the CSV file first line are automatically used as columns for the database table. All the columns are given the STRING type. No DDL file needs to be provided.

Default value: file

ddl_file

type: string

mandatory if ddl_mode is set to "file"

doc_md type: string optional

Path to the Markdown file containing detailed information about the destination table.

add_tailer_metadata

type : boolean

optional

[NEW] Allows automatic metadata feature that add specific columns during the ingestion process related to the input source.

The added columns are: tlr_ingestion_timestamp_utc (TIMESTAMP) tlr_input_file_source_type (STRING) tlr_input_file_name (STRING) tlr_input_file_full_resource_name (STRING) Default value : false

Global parameters

Source parameters (GCS)

Name of the source bucket. See restrictions below

Path where the files will be found, e.g. "some/sub/dir". See restrictions below

You should have generated credentials when . To learn how to encrypt them, refer to .

Destination parameters (BigQuery)

You should have generated credentials when . To learn how to encrypt them, refer to .

"PARQUET" (see for more info)

"AVRO" (see for more info)

Specifies behavior for creating tables (see ).

Action that occurs if the destination table already exists (see ).

Character used to quote data sections, CSV only (see ).

Represents a null value, CSV only (see ).

Ignore extra values not represented in the table schema (see ).

Number of invalid rows to ignore (see ).

Specifies updates to the destination table schema to allow as a side effect of the load job (see ).

Allows quoted data containing newline characters, CSV only (see ).

Allows missing trailing optional columns, CSV only (see ).

"autodetect" (not recommended): Google’s default mode. The schema is automatically detected from the source file. This mode doesn’t work well with CSV files, but gives good results with structured formats such as JSON. (see ).

Path to the where the destination schema is described.

🌐
⬇️
⬆️
👁️‍🗨️
DDL files
⚠️
⚠️
setting up GCP
this page
setting up GCP
this page
BigQuery doc
BigQuery doc
Google BigQuery documentation
Google BigQuery documentation
Google BigQuery documentation
Google BigQuery documentation
Google BigQuery documentation
Google BigQuery documentation
Google BigQuery documentation
Google BigQuery documentation
Google BigQuery documentation
Google BigQuery documentation
DDL file