Validity Checks

Validity checks in Alation Data Quality are used to detect invalid or unexpected values in datasets. These metrics help ensure data conforms to expected formats, ranges, patterns, and business rules by validating individual column values against predefined criteria.

These checks are applied at the column level to identify non-conforming data and is essential for data type validation, format verification, and business rule compliance. The checks returns Pass or Fail status based on whether values meet validation criteria.

Follow these best practices for validity checks:

  • For zero tolerance, use = 0 for critical fields requiring 100% validity.

  • For tolerance-based, use <= 5% for fields where some invalid data is acceptable.

  • For large datasets, prefer percentage-based thresholds over counts.

  • Optimize complex patterns for large datasets.

  • Test regex patterns on sample data before production use to avoid regex errors.

  • Use built-in formats when possible instead of custom regex.

  • Consider data volume when setting thresholds.

  • Start with common validation patterns before creating custom rules.

  • Allow flexibility for international formats (e.g. phone numbers or addresses).

  • Update validation rules as business requirements evolve.

  • Ensure column data type supports chosen validation. This avoids format mismatches.

  • Review and adjust validation rules based on real data patterns to avoid false positives.

Available Metrics

Metric

Purpose

Returns

Best Used For

invalid_count

Count absolute number of invalid values

Integer

Small datasets, zero-tolerance scenarios

invalid_percent

Calculate percentage of invalid values

Percentage

Large datasets, tolerance-based validation

Metric List

Metric

Description

Supported Data Types

Column Configuration Keys

invalid_count

The number of rows in a column that contain values that are not valid

number, text, time

invalid format, invalid values, valid format, valid length, valid max, valid max length, valid min, valid min length, valid values

invalid_count (text only)

Text-specific invalid count with regex support

text

invalid regex, valid regex

invalid_percent

The percentage of rows in a column, relative to the total row count, that contain values that are not valid

number, text, time

invalid format, invalid values, valid format, valid length, valid max, valid max length, valid min, valid min length, valid values

invalid_percent (text only)

Text-specific invalid percentage with regex support

text

invalid regex, valid regex

Configuration Keys

Configuration Key

Description

Expected Value Type

invalid format

Defines the format of a value that Soda ought to register as invalid. Only works with columns that contain data type TEXT.

Format name from valid formats list

invalid regex

Specifies a regular expression to define your own custom invalid values.

regex, no forward slash delimiters

invalid values

Specifies the values that Soda ought to consider invalid.

values in a list

valid format

Defines the format of a value that Soda ought to register as valid. Only works with columns that contain data type TEXT.

Format name from valid formats list

valid length

Specifies a valid length for a string. Works with columns that contain data type TEXT, and also with INTEGER on most databases, where implicit casting from string to integer is supported.

integer

valid max

Specifies a maximum numerical value for valid values.

integer or float

valid max length

Specifies a valid maximum length for a string. Only works with columns that contain data type TEXT.

integer

valid min

Specifies a minimum numerical value for valid values.

integer or float

valid min length

Specifies a valid minimum length for a string. Only works with columns that contain data type TEXT.

integer

valid regex

Specifies a regular expression to define your own custom valid values.

regex, no forward slash delimiters

valid values

Specifies the values that Soda ought to consider valid.

values in a list

Valid formats

Valid formats apply only to columns using data type TEXT, not DATE or NUMBER.

Format Name

Description

Example

credit card number

Four four-digit numbers separated by spaces, dashes, or sixteen-digit number

1234 5678 9012 3456, 1234-5678-9012-3456

date eu

European date format (validates date only, not time)

dd/mm/yyyy

date inverse

Inverse date format (validates date only, not time)

yyyy/mm/dd

date iso 8601

ISO 8601 date and/or time format

2021-04-28T09:00:00+02:00

date us

US date format (validates date only, not time)

mm/dd/yyyy

decimal

Number uses a , or . as a decimal indicator

123.45, 123,45

decimal comma

Number uses , as decimal indicator

123,45

decimal point

Number uses . as decimal indicator

123.45

email

Standard email address format

user@example.com

integer

Whole number

123

ip address

IPv4 address format

192.168.1.1

ipv4 address

IPv4 address format

192.168.1.1

ipv6 address

IPv6 address format

2001:0db8:85a3:0000:0000:8a2e:0370:7334

money

Money pattern with currency symbol + decimal + currency abbreviation

$123.45 USD

money comma

Money pattern with currency symbol + decimal comma + currency abbreviation

€123,45 EUR

money point

Money pattern with currency symbol + decimal point + currency abbreviation

$123.45 USD

negative decimal

Negative number with decimal indicator

-123.45, -123,45

negative decimal comma

Negative number with comma as decimal indicator

-123,45

negative decimal point

Negative number with point as decimal indicator

-123.45

negative integer

Negative whole number

-123

negative percentage

Negative percentage

-50%

negative percentage comma

Negative percentage with comma decimal indicator

-50,5%

negative percentage point

Negative percentage with point decimal indicator

-50.5%

percentage

Percentage value

50%

percentage comma

Percentage with comma decimal indicator

50,5%

percentage point

Percentage with point decimal indicator

50.5%

phone number

Various phone number formats

+12 123 123 1234, 123 123 1234, +1 123-123-1234, 555-2368

positive decimal

Positive number with decimal indicator

123.45, 123,45

positive decimal comma

Positive number with comma as decimal indicator

123,45

positive decimal point

Positive number with point as decimal indicator

123.45

positive integer

Positive whole number

123

positive percentage

Positive percentage

50%

positive percentage comma

Positive percentage with comma decimal indicator

50,5%

positive percentage point

Positive percentage with point decimal indicator

50.5%

time 12h

12-hour clock format with seconds

hh:mm:ss AM/PM

time 12h nosec

12-hour clock format without seconds

hh:mm AM/PM

time 24h

24-hour clock format with seconds

hh:mm:ss

time 24h nosec

24-hour clock format without seconds

hh:mm

timestamp 12h

12-hour clock timestamp

hh:mm:ss AM/PM

timestamp 24h

24-hour clock timestamp

hh:mm:ss

uuid

Universally unique identifier

550e8400-e29b-41d4-a716-446655440000

Common Use Cases

Scenario

Configuration Approach

Example

Email Validation

Use built-in email format

valid format: email

Phone Numbers

Use phone format or custom regex

valid format: phone number

Status Fields

Define allowed values list

valid values: PENDING, APPROVED, or REJECTED

ID Formats

Use regex patterns

valid regex: ^[A-Z]{2}-[0-9]{6}$

Numerical Ranges

Set min/max boundaries

valid min: 1, valid max: 100

Text Length

Control string length

valid min length: 5, valid max length: 50

Common Validity Check Types

Format-Based Validation

Use predefined formats for common data patterns:

invalid_count(email_address) = 0:
valid format: email

Value-Based Validation

Specify allowed/disallowed values.

invalid_count(status) = 0:
valid values: [ACTIVE, INACTIVE, PENDING]

Pattern-Based Validation

Use regex for custom patterns.

invalid_count(product_code) = 0:
valid regex: '^PROD-[0-9]{4}$'

Range-Based Validation

Set numerical or length boundaries.

invalid_count(age) = 0:
valid min: 0
valid max: 120

Integration with Other Checks

  • Combine with Completeness Checks: This ensures fields are both present and valid.

    Ensure emails exist and are properly formatted, valid format: email: missing_count(email) = 0 and invalid_count(email) = 0:

  • Layer with Uniqueness Checks: This validates format and uniqueness.

    Customer IDs must be unique and follow pattern, valid regex: ^CUST-[0-9]{8}$: duplicate_count(customer_id) = 0 and invalid_count(customer_id) = 0: