Tailer Documentation
  • What is Tailer Platform?
  • Getting Started
    • Prepare your local environment for Tailer
    • Install Tailer SDK
    • Set up Google Cloud Platform
    • Encrypt your credentials
  • [Tutorial] Create a first data pipeline
    • Introduction
    • Prepare the demonstration environment
    • Copy files from one bucket to another
    • Load files into BigQuery tables
    • Prepare data
    • Build predictions
    • Export data
    • Congratulations!
    • [Video] Automatic Script
      • SQL script file
      • DDL script file
      • Tables to Tables script file
      • Launch configuration and furthermore
  • Data Pipeline Operations
    • Overview
    • Set constants with Context
      • Context configuration file
    • Move files with Storage to Storage
      • Storage to Storage configuration file
    • Load data with Storage to Tables
      • Storage to Tables configuration file
      • Storage to Tables DDL files
    • Stream incoming data with API To Storage
      • API To Storage configuration file
      • API To Storage usage examples
    • Transform data with Tables to Tables
      • Tables to Tables configuration file
      • Table to Table SQL and DDL files
    • Export data with Tables to Storage
      • [V3] Table to Storage configuration file
      • Table to Storage SQL file
      • [V1-V2: deprecated] Table to Storage configuration file
    • Orchestrate processings with Workflow
      • [V2] Workflow configuration file
      • [V1: deprecated] Workflow configuration file
    • Convert XML to CSV
      • Convert XML to CSV configuration file
    • Use advanced features with VM Launcher
      • Process code with VM Launcher
        • VM Launcher configuration file for code processing
      • Encrypt/Decrypt data with VM Launcher
        • VM Launcher configuration file for data encryption
        • VM Launcher configuration file for data decryption
    • Monitoring and Alerting
      • Monitoring and alerting parameters
    • Asserting Data quality with Expectations
      • List of Expectations
    • Modify files with File Utilities
      • Encrypt/Decrypt data with File Utilities
        • Configuration file for data encryption
        • Configuration file for data decryption
    • Transfer data with GBQ to Firestore
      • Table to Storage: configuration file
      • Table to Storage: SQL file
      • VM Launcher: configuration file
      • File-to-firestore python file
  • Tailer Studio
    • Overview
    • Check data operations' details
    • Monitor data operations' status
    • Execute data operations
    • Reset Workflow data operations
    • Archive data operations
    • Add notes to data operations and runs
    • View your data catalog
    • Time your data with freshness
  • Tailer API
    • Overview
    • Getting started
    • API features
  • Release Notes
    • Tailer SDK Stable Releases
    • Tailer Beta Releases
      • Beta features
      • Beta configuration
      • Tailer SDK API
    • Tailer Status
Powered by GitBook
On this page
  • Fashion Data Expectations
  • Key constraints
  • Temporal continuity
  • Row count
  • Column properties

Was this helpful?

Edit on GitHub
  1. Data Pipeline Operations
  2. Asserting Data quality with Expectations

List of Expectations

PreviousAsserting Data quality with ExpectationsNextModify files with File Utilities

Last updated 1 year ago

Was this helpful?

Fashion Data Expectations

  • Key constraints

  • Temporal continuity

  • Row count

  • Column properties

Key constraints

primarykey_named(dataset, tablename, column, threshold)

CALL `tailer-ai.expect.primarykey_named`('my-project.my_dataset', 'sales_details', 'PK_sales_details', 0);
CALL `tailer-ai.expect.primarykey_named`('my-project.my_dataset', 'sales_details', 'CONCAT(ticket_id, "-", line_number)', 0.0001);          

Expect a column in a table to respect a pseudo Primary Key constraint.

This procedure checks that every value of the specified column is not null and unique within the current table. This is enforced by counting the total number of rows within the table and comparing it to the number of distinct element in the column. A threshold percentage can be provided, so the test is passed if the number of rejected rows divided by the table total row count represents less than the threshold. Use 0 if no rejected row is allowed.

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)

    • tablename (STRING) – The table name

    • column (STRING) – The column name, or an sql operation that creates a pseudo column that can be inserted into a count distinct

    • threshold (FLOAT64) – The threshold to use to trigger an assertion failure

  • Returns

primarykey(dataset, tablename, threshold)

CALL `tailer-ai.expect.primarykey`('my-project.my_dataset', 'sales_details', 0);

Expect a table to have a column that respects a pseudo Primary Key constraint.

This procedure looks for a column with a name that starts wiht 'PK' or with a desdcription that contains '#PK'. Then it checks that every value of this column is not null and unique within the current table. This is enforced by counting the total number of rows within the table and comparing it to the number of distinct element in the column. A threshold percentage can be provided, so the test is passed if the number of rejected rows divided by the table total row count represents less than the threshold. Use 0 if no rejected row is allowed.

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)

    • tablename (STRING) – The table name

    • threshold (FLOAT64) – The threshold to use to trigger an assertion failure

  • Returns

primarykey_where(dataset, tablename, threshold, where_clause)

CALL `tailer-ai.expect.primarykey_where`('my-project.my_dataset', 'stocks', 0, 'WHERE date >= current_date - 7');

Expect a table to have a column that respects a pseudo Primary Key constraint. Apply a WHERE condition before testing the primary key constraint to limit the number of rows requested.

The test performed is the as with the primarykey expectation. See more detail above.

Parameters

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)

    • tablename (STRING) – The table name

    • threshold (FLOAT64) – The threshold to use to trigger an assertion failure

    • where_clause (STRING) – A proper WHERE clause that will filter the table before applying the test.

  • Returns

foreignkey(dataset, tablename, column, target_dataset, target_tablename, target_column, threshold)

CALL `tailer-ai.expect.foreignkey`('my-project.my_dataset', 'sales_details', 'customer_id', 'my-project.my_dataset', 'customers', 'customer_id', 0.001);          

Expect a column in a table to respect a pseudo Foreign Key constraint.

This procedure checks that every non-null value in the column can be found in the values of the target column of the reference target table. A threshold percentage can be provided, so the test is passed if the number of rejected rows divided by the table total row count is less than the threshold. Use 0 if no rejected row is allowed.

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)

    • tablename (STRING) – The table name

    • column (STRING) – The column name

    • target_dataset (STRING) – The target dataset of the foreign table (and its GCP project)

    • target_tablename (STRING) – The foreign table name

    • target_column (STRING) – The foreign key column of the foreign table

    • threshold (FLOAT64) – The threshold to use to trigger an assertion failure

  • Returns

foreignkey_where(dataset, tablename, column, filter_condition, target_dataset, target_tablename, target_column, threshold)

CALL `tailer-ai.expect.foreignkey`('my-project.my_dataset', 'sales_details', 'customer_id', 'and SALE_DATE >= current_date - 90', 'my-project.my_dataset', 'customers', 'customer_id', 0.001);          

Expect a column in a table to respect a pseudo Foreign Key constraint.

Add a filter condition to the WHERE clause before testing the foreign key constraint to limit the number of rows requested.

This procedure checks that every non-null value in the column can be found in the values of the target column of the reference target table. A threshold percentage can be provided, so the test is passed if the number of rejected rows divided by the table total row count is less than the threshold. Use 0 if no rejected row is allowed.

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)

    • tablename (STRING) – The table name

    • column (STRING) – The column name

    • filter_condition (STRING) – The filter condition that will be applied on the table to check

    • target_dataset (STRING) – The target dataset of the foreign table (and its GCP project)

    • target_tablename (STRING) – The foreign table name

    • target_column (STRING) – The foreign key column of the foreign table

    • threshold (FLOAT64) – The threshold to use to trigger an assertion failure

  • Returns

Temporal continuity

everyday_since(dataset, tablename, column, start_date, exception, minimum)

CALL `tailer-ai.expect.everyday_since`('my-project.my_dataset', 'sales_details', 'sale_date', DATE_SUB(current_date, interval 31 day), ['2022-01-01', '2021-12-25', cast(current_date as string)], 1000);          

Expect a table to have a minimum number of rows per day since a start date. An exception list can be provided to avoid an error when a date has no data for a good reason.

This procedure counts the number of rows of the specified table grouped by date. If a day between the specified start_date and today is missing, or if a daily count is below minimum, then the test fails, except if the date is specified in the exception list.

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)

    • tablename (STRING) – The table name

    • column (STRING) – The column name

    • start_date (STRING) – The date to start the control

    • exception (ARRAY<DATE>) – An array that contains dates that will not be checked

    • minimum (INT64) – The minimum amount of lines per date expected.

  • Returns

everyday_increasing_since(dataset, tablename, value)

CALL `tailer-ai.expect.everyday_increasing_since`('my-gcp-project.my_dataset', 'products', cast('2021-11-01' as date));          

Expect a table to have a daily number of rows continuously increasing since a predefined date.

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)

    • tablename (STRING) – The table name

    • value (DATE) – The starting date to check for increase in value

  • Returns

everyweek_since(dataset, tablename, column, start_date, exception, minimum)

CALL `tailer-ai.expect.everyweek_since`('my-project.my_dataset', 'sales_details', 'sale_date', DATE_TRUNC(DATE_SUB(current_date, interval 2 month), week), ['2022-01-01', cast(current_date as string)], 1000);          

Expect a table to have a date column with a date every week since start_date, and containing a minimum number of rows. An exception list can be provided to avoid an error when a date has no data for a good reason.

This procedure generates a date array containing the start_date and the same day for every week until the current date. Then it counts the rows of the table grouped by date. If a day between the specified start date and today is missing, or if a daily count is below minimum, or if an extra date is in the table but does not fit in the monthly pattern, then the test fails, except if the date is specified in the exception list.

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)

    • tablename (STRING) – The table name

    • column (STRING) – The column name

    • start_date (STRING) – The date to start the control

    • exception (ARRAY) – An array that contains dates that will not be checked

    • minimum (INT64) – The minimum amount of lines per date expected.

  • Returns

everymonth_since(dataset, tablename, column, start_date, exception, minimum)

CALL `tailer-ai.expect.everymonth_since`('my-project.my_dataset', 'sales_details', 'sale_date', DATE_TRUNC(DATE_SUB(current_date, interval 13 month), month), ['2022-01-01', cast(current_date as string)], 1000);          

Expect a table to have a date column with a date every month since start_date, and containing a minimum number of rows. An exception list can be provided to avoid an error when a date has no data for a good reason.

This procedure generates a date array containing the start_date and the same day for every month until the current date. Then it counts the rows of the table grouped by date. If a day between the specified start date and today is missing, or if a daily count is below minimum, or if an extra date is in the table but does not fit in the monthly pattern, then the test fails, except if the date is specified in the exception list.

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)of the table

    • tablename (STRING) – The table name

    • column (STRING) – The column name

    • start_date (STRING) – The date to start the control

    • exception (ARRAY<DATE>) – An array that contains dates that will not be checked

    • minimum (INT64) – The minimum amount of lines per date expected.

  • Returns

Row count

table_count_greater(dataset, tablename, value)

CALL `tailer-ai.expect.table_count_greater`('my-project.my_dataset', 'stores', 1600, 0.01);          

Expect a table to have a count greater than or equal to a predefined value.

A threshold percentage can be provided, so the test is passed if the number of rejected rows divided by the table total row count represents less than the threshold. Use 0 if no rejected row is allowed.

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)

    • tablename (STRING) – The table name

    • value (INT64) – The value the table count must be greater to

    • threshold (FLOAT64) – The threshold to use to trigger an assertion failure

  • Returns

table_count_between(dataset, tablename, value)

CALL `tailer-ai.expect.table_count_between`('my-project.my_dataset', 'customers', ['2000000', '300000']);           

Expect a table to have a number of rows to be between two values.

The values for the comparison must be provided as string and will be cast to integer during the assertion. The order in the array is important as we use the “between” predicat function to enforce this expectation: the lower value must be before the upper value.

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)

    • tablename (STRING) – The table name

    • value (ARRAY<STRING>) – The array of values that will be used to check the table

  • Returns

table_count_equal(dataset, tablename, value)

CALL `tailer-ai.expect.table_count_equal`('my-project.my_dataset', 'stores', 500, 0.1);           

Expect a table to have a count equal to a predefined value.

A threshold percentage can be provided, so the test is passed if the number of rejected rows divided by the table total row count represents less than the threshold. Use 0 if no rejected row is allowed.

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)

    • tablename (STRING) – The table name

    • value (INT64) – the value the table count must be equal to

    • threshold (FLOAT64) – the threshold to use to trigger an assertion failure

  • Returns

table_count_equal_other_table(dataset, tablename, target_dataset, target_tablename, threshold)

CALL `tailer-ai.expect.table_count_equal_other_table`('my-project.my_dataset', 'stores', 'my-project.my_other_dataset', 'stores', 0.01);          sql

Expect a table to have the same number of lines than another table.

A threshold percentage can be provided, so the test is passed if the number of rejected rows divided by the table total row count represents less than the threshold. Use 0 if no rejected row is allowed.

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)

    • tablename (STRING) – The table name

    • target_dataset (STRING) – The target dataset

    • target_tablename (STRING) – The target table name

    • threshold (FLOAT64) – The threshold to use to trigger an assertion failure

  • Returns

Column properties

unique(dataset, tablename, column)

CALL `tailer-ai.expect.unique`('my-project.my_dataset', 'stores', 'store_id');          

Expect every value in the column to be unique.

This procedure checks that the number of distinct value of the specified column is equal to the total number of lines in the table. Null values are part of the process (so one line can be null but it must be the only one).

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)

    • tablename (STRING) – The table name

    • column (STRING) – The column name

  • Returns

not_null(dataset, tablename, column, threshold)

CALL `tailer-ai.expect.not_null`('my-project.my_dataset', 'sales_details', 'product_sku', 0.001);          

Expect a table to have a column to never be null.

This procedure counts the number of null in the specified column. A threshold percentage can be provided, so the test is passed if the number of rejected rows divided by the table total row count is less than the threshold. Use 0 if no rejected row is allowed.

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)

    • tablename (STRING) – The table name

    • column (STRING) – The column name to check for value

    • threshold (FLOAT64) – the threshold to use to trigger an assertion failure

  • Returns

null(dataset, tablename, column, threshold)

CALL `tailer-ai.expect.null`('my-project.my_dataset', 'logs', 'error_code', 0.05);          

Expect a table to have a column to be fully null.

This procedure counts the number of non-null values in the specified column. A threshold percentage can be provided, so the test is passed if the number of rejected rows divided by the table total row count is less than the threshold. Use 0 if no rejected row is allowed.

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)

    • tablename (STRING) – The table name

    • column (STRING) – The column name to check for value

  • Returns

type(dataset, tablename, column, type)

CALL `tailer-ai.expect.type`('my-project.my_dataset', 'stores', 'store_id', 'INT64');           

Expect a table to have a column that can be casted as the predefined type with no error.

  • Parameters

    • dataset (STRING) – The dataset of the table (and its related GCP project)

    • tablename (STRING) – The table name

    • column (STRING) – The column name

    • type (STRING) – The type of the column to check

  • Returns

values_to_contain(dataset, tablename, column, value, minimum, threshold)

 CALL `tailer-ai.expect.values_to_contain`('my-project.my_dataset', 'sales', 'date', '2022-01-22', 1000, 0.01);           

Expect a table to have a column to contain a certain value at a certain minimum level with a threshold.

The authorized value type must be in a string (even for numeric values) as there is a safe_cast to string in the verification predicat. A threshold might be specified so marginal value might not trigger any assertion exception.

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)

    • tablename (STRING) – The table name

    • column (STRING) – The column name

    • value (STRING) – The value of the predefined set

    • minimum (INT64) – The minimum value to have for the column

    • threshold (FLOAT64) – the threshold to use to trigger an assertion failure (as a percentage)

  • Returns

values_to_be_between(dataset, tablename, column, value, threshold)

CALL `tailer-ai.expect.values_to_be_between`('my-project.my_dataset', 'sales', 'quantity' ,['-20','20'], 0.01);       
CALL `tailer-ai.expect.values_to_be_between`('my-project.my_dataset', 'sales', 'date', ['2015-01-01','2025-01-01'], 0);          

Expect a table to have a column to be between two values.

The authorized value type may be integer, float or dates to work properly. The between predicate requires the parameter to be included and in the proper order (for exemple for a set of date, the first date must be before the second date). A threshold might be specified so marginal value might not trigger any assertion exception.

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)

    • tablename (STRING) – The table name

    • column (STRING) – The column name

    • value (ARRAY) – The array that contains the two values range

    • threshold (FLOAT64) – the threshold to use to trigger an assertion failure

  • Returns

values_to_be_in_set(dataset, tablename, column, value, threshold)

CALL `tailer-ai.expect.values_to_be_between`('my-project.my_dataset', 'sales', 'type', ['1','2', '3', '5', '7', '9'], 0);            

Expect a table to have a column to be in a predefined set.

The authorized value type must be in an array as string as there is a cast in the verification predicat. A threshold might be specified so marginal value might not trigger any assertion exception.

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)

    • tablename (STRING) – The table name

    • column (STRING) – The column name

    • value (ARRAY) – The values array of the predefined set

    • threshold (FLOAT64) – the threshold to use to trigger an assertion failure

  • Returns

values_to_not_be_in_set(dataset, tablename, column, value, threshold)

CALL `tailer-ai.expect.values_to_not_be_between`('my-project.my_dataset', 'sales', 'type', ['0', '6', '8'], 0);                   

Expect a table to have a column to NOT be in a predetermined set of values.

The not authorized value type must be in a array as string as there is a cast in the verification predicat. A threshold might be specified so marginal value might not trigger any assertion exception.

  • Parameters

    • dataset (STRING) – The dataset of the table (and its GCP project)

    • tablename (STRING) – The table name

    • column (STRING) – The column name

    • value (STRING) – The target project of the foreign table

    • threshold (FLOAT64) – the threshold to use to trigger an assertion failure

  • Returns

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

This procedure checks that a safe casted (to the wanted type) non-snull value will not be null. All BigQuery types are allowed (see BigQuery documentation ).

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

The last SQL job of the procedure returns a line as described . When this expectation is embedded in a table-to-table data operation, then this line is inserted into the tailer_common.expectation_results table in your GCP project.

here
primarykey_named
primarykey
primarykey_where
foreignkey
foreignkey_where
everyday_since
everyday_increasing_since
everyweek_since
everymonth_since
table_count_greater
table_count_between
table_count_equal
table_count_equal_other_table
unique
not_null
null
type
values_to_contain
values_to_be_between
values_to_be_in_set
values_to_not_be_in_set
here
here
here
here
here
here
here
here
here
here
here
here
here
here
here
here
here
here
here
here
here