Load files into BigQuery tables
The second data operation of this tutorial will consist in loading the contents of our files into BigQuery tables.
Last updated
The second data operation of this tutorial will consist in loading the contents of our files into BigQuery tables.
Last updated
Access your tailer-demo folder.
Inside, create a folder named 2-Load_files for this new step.
In this folder, create a JSON file named 000099-tailer-demo-load-files.json for your data operation.
Copy the following contents into your file:
Edit the following values:
โพ In the source section, replace my-gcp-project with the ID of the GCP project containing the source bucket.
โพ In the source section, replace my-bucket with the name of the GCS bucket containing the input files (output files from the previous step).
โพIn the source section, replace the value of the gcp_credentials_secret parameter with the service account credentials for the GCP project containing the source bucket.
โพ In the destinations section, replace my-gcp-project with the ID of the GCP project containing the target dataset. It can be the same as in the source section or a different one.
โพ In the destinations section, replace my-gbq-dataset-YOUR-NAME with the name of the dataset that will contain the tables. You need to create this dataset in your destination project beforehand. The storage-to-tables data operation won't create an empty dataset if needed.
โพ In the destinations section, replace the value of the gcp_credentials_secret parameter with the service account credentials for the GCP project containing the target dataset. โพ If you share the demo project with other developers, then in the configuration_id, replace YOUR-NAME by a personal value, like your name. This way, you won't overwrite a configuration deployed by someone else. You should also add your name in the source's gcs_source_prefix and archive_prefix, and in the destinations' gbq_dataset to avoid any interferences with another developer's data operation.
Create a Markdown file named tailer-demo-stt.md. You can use it freely to describe the data operation.
Inside the 2-Load_files folder, create a folder named ddl. It will contain the table schema files.
Inside the ddl folder, create four files: โพ stores.json โพ products.json โพ sales.json โพ sales_daily.json
Copy the following contents into the stores.json file:
Copy the following contents into the products.json file.
Copy the following contents into the sales.json file.
Copy the following contents into the sales_daily.json file.
Create Markdown files for each DDL file. You can use them freely to describe the table schemas.
By default, in the DDL, all the database fields created have the "string" type. This will be modified during the next data pipeline operation if necessary.
Once your files are ready, you can deploy the data operation:
Access your working folder by running the following command:
To deploy the data operation, run the following command:
You may be asked to select a context (see this page for more information). If you haven't deployed any context, then choose "no context". You can also use the flag --context to specify the context of your choice, or NO_CONTEXT if that's what you want:
Your data operation is now deployed, which means the files will shortly be loaded into tables, and your data operation status is now visible in Tailer Studio.
Access Tailer Studio again.โ
In the left navigation menu, select Storage-to-tables.
In the Configurations tab, search for your data operation, 000099-tailer-demo-load-files. You can see its status is Activated.
Click the data operation ID to display its parameters and full JSON file, or to leave comments about it. in the Tables section, you can access the table schema, parameters, and documentation provided in the Markdown files.
Now that our configuration is deployed, we can test it. Let's mimic production behavior. Access the folders you created when preparing the demonstration environment:
In your source bucket, copy a file. The file name must match one of the filename_template specified in the configuration.
On Tailer Studio, in the Storage-to-Tables section, Runs tab, you should see a run for your data operation. It should appear as "running" and quickly get the status "success".
In your source bucket, input-folder should be empty.
In your source bucket, archive-tailer-demo-folder should contain a folder for each input file, named as the filename date.
Your destination dataset should contain a table corresponding to the input files.
You can check the full Storage to Tables documentation and try other features:
Load different input format as JSON, PARQUET or AVRO files, or gzip compressed files
Allow unknown supplementary fields using the "bq_load_job_ignore_unknown_values" parameter to allow partners to provide new columns without any risk of service interruption
Try the other ddl_mode, like "header" to infer a table format based on the header row, or "file_template" which allow to provide a ddl_file directly in the source bucket