Asserting Data quality with Expectations
This is the guide page to Fashion Data Expectation stored procedures.
What are expectations? A set of SQL Stored Procedures that ease the writing of tests. Too often, writing tests in SQL requires assertions that are complex to write, that are not factorized. In a complex project, those assertions are spread amongst many projects and scripts and this can be a tedious tasks to maintain them. Fashion Data Expectations are a way to solve these problems:
- a set of stored procedures that manage a large number of common assertions (check for primary key, check for integrity constraints, regular expression, etc.)
- one liners that are still part of the SQL ecosystem, so they live with your SQL code and your Tailer configurations and can benefit from classic SQL syntax.
- fast execution with parallel processing.
- full assertion metrics including number of rejected lines, assertion processing time, timestamping, metrics history, etc.
Let’s say that you imported data into a BigQuery project and you want to check for a primary key constraint with a certain threshold. In SQL you would write something like that:
(select count(distinct PK_products) from `dlk-demo.dlk_demo_pda.products`)
- (select count(*) from `dlk-demo.dlk_demo_pda.products`)
) = 0
) as "pk issue with table dlk-demo.dlk_demo_pda.products";
With Fashion Data Expectations, you just write:
CALL `tailer-ai.expect.primarykey_named`('dlk-demo.dlk_demo_pda', 'products', 'PK_products', 0);
You can launch an expectation directly from your BigQuery console and check your test. If the call is properly formed, BigQuery will launch the jobs described in the procedure and you will see the test status and the related metrics in the result of the last job. This eases the developpement of a set of expectations and can also be useful for ensuring adhoc quality of an element.
-- Expectations have usually the following format
-- CALL `tailer-ai.expect.EXPECTATION`('PROJECT_ID.DATASET_ID', 'TABLE_ID', SOME_PARAMETERS);
CALL `tailer-ai.expect.table_count_greater`('dlk-demo.dlk_demo_pda', 'products', 10000, 0);
To create an expectation, you need two elements:
- a dedicated task in a table-to-table configuration
- a dedicated SQL file
The dedicated task must be of type “expectation”. For example:
"short_description": "Check for data integrity (pk, count, dates,...).",
In your SQL file, you can add as much expectations as you want:
-- assert count greater than 0
CALL `tailer-ai.expect.table_count_greater`('dlk-demo.dlk_demo_pda', 'products', 100000, 0);
-- assert primary key is ok
CALL `tailer-ai.expect.primarykey`('dlk-demo.dlk_demo_pda', 'products', 0);
-- assert freshness on the final table (we want to have at least 10k products for today iteration)
CALL `tailer-ai.expect.values_to_contain`('dlk-demo.dlk_demo_pda', 'products', 'max_importdate', cast(current_date() as string), 10000, 0);
-- assert freshness on the psa table (we want to have at least 10k product for today psa)
CALL `tailer-ai.expect.table_count_greater`('dlk-demo.dlk_demo_psa', concat('products_', replace(cast(current_date() as string), '-', '')), 100000, 0);
-- assert freshness on the psa table for yesterday(we want to have at least 10k product for yesterday psa)
CALL `tailer-ai.expect.table_count_greater`('dlk-demo.dlk_demo_psa', concat('products_', replace(cast(date_sub(current_date(), interval 1 day) as string), '-', '')), 100000, 0);
Your call to a stored procedure will be treated as a SQL instruction. This allows writing great expectations with powerful features. For example, doing “CONCAT” or using “DATE_SUB” or “CURRENT_DATE” enable counting with a sliding window on a specific table:
-- Check line count for the table products_YYYYMMDD where YYYYMMDD is yesterday's date
concat('products_', replace(cast(date_sub(current_date(), interval 1 day) as string), '-', '')),
Everytime an expectation embedded in a table-to-table data operation is executed, it generates some metrics that are added to the tailer_common.expectation_results table in your project. For example:
SELECT * FROM `dlk-demo.tailer_common.expectation_results` LIMIT 1000
Here are the fields of this:
The field called “expectation_result” contains additionnal informations about the expectation in JSON format. The generic output is defined here, and some specific results could be added for some expectations (see details in the list of expectations):
You can find an Expectations Overview in Tailer Studio.
Click in the left pannel on "Expectations Overview" in the "Data Quality" section and see all the expectations that has been recently tested.