Configure Sampling and Profiling

Alation Cloud Service Applies to Alation Cloud Service instances of Alation

Customer Managed Applies to customer-managed instances of Alation

Enhanced Connector Enhanced connectors add extended capabilities and require a separate entitlement in addition to your Alation platform license.

The MaxCompute connector supports data sampling and profiling to provide insights into your data directly in the Alation catalog.

Overview

  • Data Sampling: Retrieves sample rows from tables to display in the catalog.

  • Data Profiling: Calculates column statistics such as min, max, value distributions, and null counts.

Both features use JDBC connections to execute queries against MaxCompute tables.

Prerequisites

Before configuring sampling and profiling:

  • The MaxCompute data source must be configured with valid Access Key credentials in General Settings.

  • The RAM user must have permissions to read data from the target tables.

Note

Sampling and profiling use the service account credentials (Access Key ID / Access Key Secret) configured in General Settings. No per-user credential setup is needed. This is different from Compose, which requires each user to provide their own Access Key credentials.

The following permissions are required:

  • odps:Read - Read access to projects and tables

  • odps:List - List access to project resources

  • odps:CreateInstance - Permission to submit SQL jobs for sampling and profiling queries

At the project level, grant the RAM user one of the following roles: Admin, Super_Administrator, or role_project_reader. Alternatively, grant specific permissions:

GRANT Read, CreateInstance ON PROJECT <project_name> TO USER <ram_user>;

For more information, see the Prerequisites page.

Additionally, metadata extraction must be completed for the tables you want to profile.

Configure Sampling and Profiling

  1. On the MaxCompute data source Settings page, go to the Sampling and Profiling tab.

    ../../../_images/MaxComputeSamplingTab.png
  2. Configure the following settings:

    ../../../_images/MaxComputeSamplingConfig.png

    Setting

    Description

    Enable Sampling

    Turn on to enable data sampling for tables.

    Sample Size

    Number of rows to retrieve for sampling (default: 100).

    Enable Profiling

    Turn on to enable column profiling.

  3. Click Save.

Running Profiling

To run profiling on a table:

  1. Navigate to the table’s catalog page in Alation.

  2. Click the Samples tab to view sample data.

  3. Click Profile to run column profiling.

    ../../../_images/MaxComputeProfilingResults.png

Alternatively, you can schedule profiling to run automatically:

  1. On the Sampling and Profiling tab, configure the profiling schedule.

  2. Select the frequency (daily, weekly) and time.

  3. Click Save.

Profiling Metrics

Note

To view enhanced profiling metrics (Min, Max, Mean, Null Count, Null Percentage), the Profiling V2 feature flag must be enabled in Alation.

  • Alation Cloud Service (ACS): This flag is enabled by default. No action required.

  • Customer-Managed Alation: Enable the flag by running the following command from the Alation shell (requires backend access):

    alation_conf alation.feature_flags.enable_profiling_v2 -s True
    

    Restart Alation for the change to take effect.

    If you do not have backend access, raise an SRE ticket to enable this flag.

The connector collects the following profiling metrics for each column:

For All Column Types:

Metric

Description

Row Count

Total number of rows sampled for profiling.

Value Distribution

Top values and their counts (histogram). Shows the most frequent values in the column. For example, a STRING column might show “Active”: 500, “Inactive”: 200.

Null Count

Number of NULL values in the column.

Null Percentage

Percentage of NULL values in the column.

Additional Metrics for Numerical Columns (INT, BIGINT, FLOAT, DOUBLE, DECIMAL):

Metric

Description

Min

Minimum value in the column.

Max

Maximum value in the column.

Mean

Average (mean) value in the column.

Example: Numerical Column Profiling

For numerical columns, the profile shows Min, Max, Mean, Null Count, Null Percentage, and Value Distribution:

../../../_images/MaxComputeProfilingNumerical.png

Example: Non-Numerical Column Profiling

For non-numerical columns (STRING, DATE, etc.), the profile shows Null Count, Null Percentage, and Value Distribution:

../../../_images/MaxComputeProfilingValueDistribution.png

Per-Project Connections

The MaxCompute connector uses per-project JDBC connections for profiling. When a table is profiled, a connection is created to that table’s specific project. This eliminates the need for cross-project permissions.

Limitations

  • Profiling large tables may take significant time and consume MaxCompute resources.

  • Profiling queries use full scan mode and may scan all partitions.

  • Complex data types (ARRAY, MAP, STRUCT) have limited profiling support:

    Data Type

    Supported Metrics

    Not Supported

    ARRAY

    Null Count, Null Percentage, Row Count

    Min, Max, Average, Histogram (value distribution)

    MAP

    Null Count, Null Percentage, Row Count

    Min, Max, Average, Histogram (value distribution)

    STRUCT

    Null Count, Null Percentage, Row Count

    Min, Max, Average, Histogram (value distribution)

    Note

    To profile data within complex types, consider flattening the data into separate columns using views with EXPLODE or LATERAL VIEW statements, then profile the view instead.

Troubleshooting

Profiling Failed

  • Ensure the RAM user has read permissions on the table.

  • Check MaxCompute quota and resource availability:

    How to check:

    1. Log in to the Alibaba Cloud MaxCompute Console.

    2. Navigate to Management > Quota Management to view your project’s quota usage.

    What to check:

    Resource

    Description

    Compute Unit (CU) quota

    Verify sufficient CUs are available. Profiling queries consume compute resources. If CU quota is exhausted, queries will be queued or fail.

    Concurrent query limit

    Check if the project has reached its maximum concurrent query limit (default: 100 per project).

    Storage quota

    Ensure the project has not exceeded its storage quota, which may affect query execution.

    SQL job status

    In the MaxCompute console, go to Job Management to check if profiling jobs are queued, running, or failed.

    For more information on quotas and limits, see MaxCompute Quotas and Limits in the Alibaba Cloud documentation.

Slow Profiling

  • Reduce the sample size for large tables.

  • Profile tables during off-peak hours.

  • Consider profiling only critical tables.