Accuracy Checks¶
Accuracy checks validate statistical properties of numeric columns to ensure values fall within expected ranges and meet business expectations. These checks perform mathematical calculations on your data to verify it meets quality standards.
Follow these best practices for accuracy checks:
Set meaningful thresholds based on historical patterns in your data.
Configure both
Minimum
andMaximum
checks where applicable to catch anomalies in both directions.Use standard deviation checks to identify statistical outliers.
Consider seasonality when setting thresholds for business metrics.
Update thresholds periodically as business conditions evolve.
Available Metrics¶
Minimum / Maximum: Validates value boundaries within acceptable ranges.
Average: Ensures mean values meet business expectations.
Sum: Validates aggregate totals for financial or quantity data.
Standard Deviation: Monitors data distribution consistency.
Variance: Tracks statistical variance within acceptable bounds.
Percentile: Validates specific percentile values.
Row Count: Ensures dataset size meets expectations.
Metric |
Description |
Supported Data Types |
Database Support |
---|---|---|---|
avg |
The average value in a numeric column |
number |
all |
max |
The greatest value in a numeric column |
number |
all |
min |
The smallest value in a numeric column |
number |
all |
percentile |
A percentile check evaluates whether a value in a dataset falls below or above a certain percentile threshold within a group of observations. Example: percentile(distance, 0.7) returns the value below which 70% of the distance values fall, and can be used to detect outliers or assess distribution skew. |
number |
PostgreSQL, Snowflake |
stddev |
The calculated standard deviation of values in a numeric column |
number |
BigQuery, PostgreSQL, Redshift, Snowflake |
stddev_pop |
The calculated population standard deviation of values in a numeric column |
number |
BigQuery, PostgreSQL, Redshift, Snowflake |
stddev_samp |
The calculated sample standard deviation of values in a numeric column |
number |
BigQuery, PostgreSQL, Redshift, Snowflake |
sum |
The calculated sum of the values in a numeric column |
number |
all |
variance |
The calculated variance of the values in a numeric column |
number, time |
BigQuery, PostgreSQL, Snowflake |
var_pop |
The calculated population variance of the values in a numeric column |
number, time |
BigQuery, PostgreSQL, Redshift, Snowflake |
var_samp |
The calculated sample variance of the values in a numeric column |
number, time |
BigQuery, PostgreSQL, Redshift, Snowflake |
Row_count (Table only) |
The number of rows in a dataset or column, if specified |
number, text, time |
all |
Configuration Examples¶
Check Description |
Configuration Steps |
Example Check |
---|---|---|
Ensure product prices are positive |
|
|
Validate reasonable customer ages |
|
|
Ensure average order value is within normal range |
|
|
Common Accuracy Check Scenarios¶
Business Requirement |
Metric Type |
Example Configuration |
Use Case |
---|---|---|---|
Financial Data Validation |
min, max, sum |
|
Ensure financial values are within realistic bounds |
Inventory Management |
avg, row_count |
|
Monitor stock levels and data availability |
Performance Metrics |
percentile, stddev |
|
Track system performance thresholds |
Data Distribution Analysis |
variance, stddev_pop |
|
Monitor data consistency and outliers |
Text Content Validation |
avg_length, max_length |
|
Ensure content meets quality standards |
Duplicate Detection |
duplicate_count, duplicate_percent |
|
Maintain data uniqueness |
Advanced Statistical Checks¶
Standard Deviation Checks¶
Use
stddev
for general population standard deviation.Use
stddev_pop
when working with complete populations.Use
stddev_samp
when working with sample data.
Variance Checks¶
Use
variance
for general variance calculations.Use
var_pop
for population variance.Use
var_samp
for sample variance.
Percentile Checks¶
Useful for identifying outliers and understanding data distribution.
percentile(column, 0.5)
calculates the median value.
percentile(column, 0.95)
identifies the 95th percentile.
Change-Over-Time Thresholds¶
Accuracy checks support dynamic thresholds that compare current values to historical measurements.
Basic Change Detection¶
-- Alert if row count changes by more than 50 rows from previous measurement
change for row_count between -20 and +50
-- Alert if average changes by more than 10% from previous week
change same day last week for avg(order_amount) > 10%
Advanced Historical Comparisons¶
-- Compare to average of last 7 measurements
change avg last 7 for row_count < 50
-- Compare to minimum of last 7 measurements
change min last 7 for duplicate_count(email) < 5
-- Compare percentage change to maximum of last 7 measurements
change max last 7 percent for sum(revenue) < 20%
Change-Over-Time Components¶
Calculation Type (optional):
avg
,min
,max
.Historical Period (optional):
last 7
.Percentage Mode (optional):
percent
.Metric: The numeric metric to calculate.
Comparison: The threshold for acceptable change.