Storage to Tables configuration file

This is the description of the configuration file of a Storage to Tables data operation.

The configuration file is in JSON format. It contains the following sections:

  • Global parameters: General information about the data operation.

  • Source parameters: Information related to the data source provider.

  • Destination parameters: Information about input file templates and destination tables. The "destinations" section will refer to DDL files, which contain the schema of the destination tables.

👁️‍🗨️ Example

Here is an example of STT configuration file for a GCS to BigQuery transfer:

{
  "$schema": "http://jsonschema.tailer.ai/schema/storage-to-tables-veditor",
  "configuration_type": "storage-to-tables",
  "configuration_id": "Load_sales_files_from_it",
  "version": "2",
  "environment": "DEV",
  "account": "000099",
  "activated": true,
  "archived": false,
  "max_active_runs" : 5,
  "short_description": "This Job load sales files into the Persistent Staging Area",
  "doc_md": "Load_sales_files_from_it.md",
  "source": {
    "type": "gcs",
    "gcp_project_id": "dlk_demo",
    "gcs_source_bucket": "mirror-fd-io-exc-demo-wbd--n-in",
    "gcs_source_prefix": "testjul",
    "gcs_archive_prefix": "archive",
    "gcp_credentials_secret": {
      "cipher_aes": "223xxx",
      "tag": "8ddxxx",
      "ciphertext": "4c7xxx",
      "enc_session_key": "830xxx"
    }
  },
  "destinations": [
    {
      "type": "bigquery",
      "gcp_project_id": "my-project",
      "gbq_dataset": "dlk_demo_wbd_psa",
      "source_format": "CSV",
      "create_disposition": "CREATE_IF_NEEDED",
      "write_disposition": "WRITE_TRUNCATE",
      "bq_load_job_ignore_unknown_values": true,
      "skip_leading_rows": 1,
      "field_delimiter": "|",
      "add_tailer_metadata": true,
      "gcp_credentials_secret": {
        "cipher_aes": "223xxx",
        "tag": "8ddxxx",
        "ciphertext": "4c7xxx",
        "enc_session_key": "830xxx"
      },
      "tables": [
        {
          "table_name": "sales_details_test",
          "short_description": "Daily sales with tickets and tickets lines",
          "filename_template": "input_{{FD_DATE}}-{{FD_TIME}}-sales_details.csv",
          "ddl_mode": "file",
          "ddl_file": "ddl/sales_details.json",
          "doc_md": "ddl/sales_details.md",
          "add_tailer_metadata": true,
          "skip_leading_rows": 0,
          "write_disposition": "WRITE_APPEND"
        },
        {
          "table_name": "stores",
          "short_description": "Full Stores referential",
          "filename_template": "input_{{FD_DATE}}-{{FD_TIME}}-stores.csv",
          "ddl_mode": "file",
          "ddl_file": "ddl/stores.json",
          "doc_md": "ddl/stores.md",
          "add_tailer_metadata": false
        }
      ]
    }
  ]
}

🌐 Global parameters

General information about the data operation

Parameter
Description

$schema

type: string

optional

The url of the json-schema that contains the properties that your configuration must verify. Most Code Editor can use that to validate your configuration, display help boxes and enlighten issues.

configuration_type

type: string

mandatory

Type of data operation. For an STT data operation, the value is always "storage-to-tables".

configuration_id

type: string

mandatory

ID of the data operation.

You can pick any name you want, but is has to be unique for this data operation type.

Note that in case of conflict, the newly deployed data operation will overwrite the previous one. To guarantee its uniqueness, the best practice is to name your data operation by concatenating:

  • your account ID,

  • the source bucket name,

  • and the source directory name.

version type: string mandatory

Use only version 2, version 1 is depreciated.

environment

type: string

mandatory

Deployment context. Values: PROD, PREPROD, STAGING, DEV.

account

type: string

mandatory

Your account ID is a 6-digit number assigned to you by your Tailer Platform administrator.

activated

type: boolean

optional

Flag used to enable/disable the execution of the data operation. If not specified, the default value will be "true".

archived

type: boolean

optional

Flag used to enable/disable the visibility of the data operation's configuration and runs in Tailer Studio. If not specified, the default value will be "false".

max_active_runs

type: integer

optional

This parameter limits the number of concurrent runs for this data operation.

If not set, the default value is 50.

short_description

type: string

optional

Short description of the context of the configuration.

doc_md

type: string

optional

Path to a file containing a detailed description. The file must be in Markdown format.

⬇️ Source parameters (GCS)

The destination section contains all information related to the data source provider.

"source": {
  "type": "gcs",
  "gcp_project_id": "my-project",
  "gcs_source_bucket": "mirror-fd-io-exc-demo-wbd--n-in",
  "gcs_source_prefix": "testjul",
  "gcs_archive_prefix": "archive",
  "gcp_credentials_secret": {
    "cipher_aes": "223xxx",
    "tag": "8ddxxx",
    "ciphertext": "4c7xxx",
    "enc_session_key": "830xxx"
  }
}
Parameter
Description

type

type: string

mandatory

Source type.

The only supported source type for now is "gcs".

gcp_project_id

type: string

mandatory

Specify the Google Cloud Platform project where to deploy the data operation and its associated cloud functions.

If not set, the user will be prompted to choose a project.

gcs_source_bucket

type: string

mandatory

Name of the source bucket. ⚠️ See restrictions below

gcs_source_prefix

type: string

mandatory

Path where the files will be found, e.g. "some/sub/dir". ⚠️ See restrictions below

gcs_archive_prefix

type: string optional

Path where the source files will be archived.

If present and populated, the STT data operation will archive the source files in the location specified, in the GCS source bucket.

If not present or empty, there will be no archiving.

gcp_credentials_secret

type: dict

mandatory

Encrypted credentials needed to read/move data from the source bucket.

You should have generated credentials when setting up GCP. To learn how to encrypt them, refer to this page.

⬆️ Destination parameters (BigQuery)

The destination section contains all the information related to the data destinations.

The destinations parameter is an array containing maps. Each map can contain a type of destination and many actual "tables" as ultimate destination.

Example:

"destinations": [
{
  "type": "bigquery",
  "gcp_project_id": "my-project",
  "gbq_dataset": "dlk_demo_wbd_psa",
  "source_format": "CSV",
  "create_disposition": "CREATE_IF_NEEDED",
  "write_disposition": "WRITE_TRUNCATE",
  "skip_leading_rows": 1,
  "field_delimiter": "|",
  "add_tailer_metadata": true,
  "gcp_credentials_secret": {
    "cipher_aes": "223008256918c292ff3ec1axxx",
    "tag": "8dd3db4c71bfb963d475d1bbd1xxx",
    "ciphertext": "4c74df268d5a7541f8264c2e7a282fxxx",
    "enc_session_key": "830606cbc4f9401a29c7977d364398xxx"
    },
      "tables": [
      {
        "table_name": "sales_details_test",
        "short_description": "Daily detailed Sales with tickets and tickets lines",
        "filename_template": "input_{{FD_DATE}}-{{FD_TIME}}-ORS-ventes.csv",
        "ddl_mode": "file",
        "ddl_file": "ddl/sales_details.json",
        "doc_md": "ddl/sales_details.md",
        "add_tailer_metadata": true,
        "skip_leading_rows": 0,
        "write_disposition": "WRITE_APPEND"
      },
      {
        "table_name": "stores",
        "short_description": "Full Stores referential",
        "filename_template": "input_{{FD_DATE}}-{{FD_TIME}}-ORS-magasins.csv",
        "ddl_mode": "file",
        "ddl_file": "ddl/stores.json",
        "doc_md": "ddl/stores.md",
        "add_tailer_metadata": false
      }
    ]
  }
]

Global destination parameters

Parameter
Description

type

type: string

mandatory

Type of destination.

The only supported destination type for now is "bigquery".

gcp_project_id

type: string

optional

Default GCP Project ID.

This parameter can be set for each table sub-object, and will be overridden by that value if it is different.

gbq_dataset

type: string

optional

Default BigQuery Dataset.

This parameter can be set for each table sub-object, and will be overridden by that value if it is different.

gcp_credentials_secret

type: object

optional

Encrypted credentials needed to interact with Storage and BigQuery.

You should have generated credentials when setting up GCP. To learn how to encrypt them, refer to this page.

source_format

type: string

optional

Default source format for input files.

Possible values (case sensitive):

This parameter can be set for each table sub-object, and will be overridden by that value if it is different.

create_disposition

type: string

optional

Specifies behavior for creating tables (see Google BigQuery documentation).

Possible values:

  • "CREATE_IF_NEEDED" (default): If the table does not exist, BigQuery creates the table.

  • "CREATE_NEVER": The table must already exist.

This parameter can be set for each table sub-object, and will be overridden by that value if it is different.

write_disposition

type: string

optional

Action that occurs if the destination table already exists (see Google BigQuery documentation).

Possible values:

  • "WRITE_TRUNCATE" (default): The run will write table data from the beginning. If the table already contained lines, they will all be deleted and replaced by the new lines. This option is used most of the time for daily runs to avoid duplicates.

  • "WRITE_APPEND": The run will append new lines to the table. When using this option, make sure not to run the data operation several times.

  • "WRITE_EMPTY": This option only allows adding data to an empty table. If the table already contains data, it returns an error. It is hardly ever used as data operations are usually run periodically, so they will always contain data after the first run. This parameter can be set for each table sub-object, and will be overridden by that value if it is different.

skip_leading_rows

type: integer

optional

Number of rows to skip when reading data, CSV only.

This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: 1

field_delimiter

type: string

optional

Separator for fields in a CSV file, e.g. ";".

Note: For Tab separator, set to "\t". This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value:

quote_character

type: string

optional

Character used to quote data sections, CSV only (see Google BigQuery documentation).

Note: For quote and double quotes, set to "'" and """ respectively.

This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: ""

null_marker

type: string

optional

Represents a null value, CSV only (see Google BigQuery documentation).

This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: ""

bq_load_job_ignore_unknown_values

type: boolean

optional

Ignore extra values not represented in the table schema (see Google BigQuery documentation).

This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: false

bq_load_job_max_bad_records

type: integer

optional

Number of invalid rows to ignore (see Google BigQuery documentation).

This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: 0

bq_load_job_schema_update_options

type: array

optional

Specifies updates to the destination table schema to allow as a side effect of the load job (see Google BigQuery documentation).

This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: []

bq_load_job_allow_quoted_newlines

type: boolean

optional

Allows quoted data containing newline characters, CSV only (see Google BigQuery documentation).

This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: false

bq_load_job_allow_jagged_rows

type : boolean

optional

Allows missing trailing optional columns, CSV only (see Google BigQuery documentation).

This parameter can be set for each table sub-object, and will be overridden by that value if it is different. Default value: false

add_tailer_metadata

type : boolean

optional

[NEW] Allows automatic metadata feature that add specific columns during the ingestion process related to the input source.

The added columns are: tlr_ingestion_timestamp_utc (TIMESTAMP) tlr_input_file_source_type (STRING) tlr_input_file_name (STRING) tlr_input_file_full_resource_name (STRING)

This parameter can be set for each table sub-object, and will be overridden by that value if it is different.

Default value : false

Table sub-object parameters

The "table" object contains the definition of expected input files and their BigQuery target.

Parameter

Description

table_name

type: string

mandatory

Name of the destination BigQuery table.

short_description

type: string

optional

Short description of the destination BigQuery table.

filename_template

type: string

mandatory

Template for the files to be processed. The following placeholders are currently supported:

  • "FD_DATE" looks for an 8-digit date (e.g. "20191015").

  • "FD_DATE_YEAR_4" looks for 4-digit year (e.g "2021").

  • "FD_DATE_YEAR_2" looks for 2-digit year (e.g "21").

  • "FD_DATE_MONTH" looks for 2-digit month (e.g "05").

  • "FD_DATE_DAY" looks for 2-digit day (e.g "12").

  • "FD_TIME" looks for a 6-digit time (e.g. "124213").

  • "FD_BLOB_n", where "n" is a non-zero positive integer, looks for a string of characters of "n" length.

  • FD_TABLE_NAME: This is a special template used when you have to process a large number of different files sharing the same destination schema. This template has to be used in conjunction with the table_name parameter.

Information:

  • if "FD_DATE" is specified, it will have priority upon "FD_DATE_YEAR_X".

  • if "FD_DATE_YEAR_4" or "FD_DATE_YEAR_2" is specified, the final date will be concatenated with "FD_DATE_MONTH" and "FD_DATE_DAY".

  • if "FD_DATE_YEAR_2" is specified, it will be prefixed by "20".

  • if "FD_DATE_YEAR_4" or "FD_DATE_YEAR_2" is specified only "FD_DATE_MONTH" and "FD_DATE_DAY" will be set to "01".

Example 1

This template:

"stores_{{FD_DATE}}{{FD_TIME}}.txt"

will allow you to process this type of files:

"stores_20201116_124213.txt"

Example 2

This template:

"{{FD_DATE}}{{FD_BLOB_5}}fixedvalue{{FD_BLOB_11}}.gz"

will allow you to process this type of files:

"20201116_12397_fixedvalue_12312378934.gz"

Example 3

If table_name is set to: "table_{{FD_TABLE_NAME}}"

and filename_template to: "{{FD_DATE}}_{{FD_TIME}}fixedvalue{{FD_TABLE_NAME}}.csv"

A file named "20201116_124523_fixedvalue_stores.csv" will be loaded into a table named: "table_stores_20191205"

A file named "20190212_063412_fixedvalue_visits.csv" will be loaded into a table named: "table_visits_20190212"

ddl_mode

type: string

optional

This parameter allows you to specify how the schema of the table will be obtained.

Possible values:

  • "file": Legacy mode. The table schema is described in the DDL file specified in the ddl_file parameter.

  • " file_template": The table schema is described in a DDL file provided in the source directory together with the source file. It must have the same filename as the source file, with the ".ddl.json" suffix.

  • "header" (CSV file only): The columns of the CSV file first line are automatically used as columns for the database table. All the columns are given the STRING type. No DDL file needs to be provided.

  • "autodetect" (not recommended): Google’s default mode. The schema is automatically detected from the source file. This mode doesn’t work well with CSV files, but gives good results with structured formats such as JSON. (see Google BigQuery documentation).

Default value: file

ddl_file

type: string

mandatory if ddl_mode is set to "file"

Path to the DDL file where the destination schema is described.

doc_md type: string optional

Path to the Markdown file containing detailed information about the destination table.

add_tailer_metadata

type : boolean

optional

[NEW] Allows automatic metadata feature that add specific columns during the ingestion process related to the input source.

The added columns are: tlr_ingestion_timestamp_utc (TIMESTAMP) tlr_input_file_source_type (STRING) tlr_input_file_name (STRING) tlr_input_file_full_resource_name (STRING) Default value : false

Last updated