Configure Sampling and Profiling¶
Alation Cloud Service Applies to Alation Cloud Service instances of Alation
Customer Managed Applies to customer-managed instances of Alation
Enhanced Connector Enhanced connectors add extended capabilities and require a separate entitlement in addition to your Alation platform license.
The MaxCompute connector supports data sampling and profiling to provide insights into your data directly in the Alation catalog.
Overview¶
Data Sampling: Retrieves sample rows from tables to display in the catalog.
Data Profiling: Calculates column statistics such as min, max, value distributions, and null counts.
Both features use JDBC connections to execute queries against MaxCompute tables.
Prerequisites¶
Before configuring sampling and profiling:
The MaxCompute data source must be configured with valid Access Key credentials in General Settings.
The RAM user must have permissions to read data from the target tables.
Note
Sampling and profiling use the service account credentials (Access Key ID / Access Key Secret) configured in General Settings. No per-user credential setup is needed. This is different from Compose, which requires each user to provide their own Access Key credentials.
The following permissions are required:
odps:Read- Read access to projects and tablesodps:List- List access to project resourcesodps:CreateInstance- Permission to submit SQL jobs for sampling and profiling queries
At the project level, grant the RAM user one of the following roles: Admin, Super_Administrator, or role_project_reader. Alternatively, grant specific permissions:
GRANT Read, CreateInstance ON PROJECT <project_name> TO USER <ram_user>;
For more information, see the Prerequisites page.
Additionally, metadata extraction must be completed for the tables you want to profile.
Configure Sampling and Profiling¶
On the MaxCompute data source Settings page, go to the Sampling and Profiling tab.
Configure the following settings:
Setting
Description
Enable Sampling
Turn on to enable data sampling for tables.
Sample Size
Number of rows to retrieve for sampling (default: 100).
Enable Profiling
Turn on to enable column profiling.
Click Save.
Running Profiling¶
To run profiling on a table:
Navigate to the table’s catalog page in Alation.
Click the Samples tab to view sample data.
Click Profile to run column profiling.
Alternatively, you can schedule profiling to run automatically:
On the Sampling and Profiling tab, configure the profiling schedule.
Select the frequency (daily, weekly) and time.
Click Save.
Profiling Metrics¶
Note
To view enhanced profiling metrics (Min, Max, Mean, Null Count, Null Percentage), the Profiling V2 feature flag must be enabled in Alation.
Alation Cloud Service (ACS): This flag is enabled by default. No action required.
Customer-Managed Alation: Enable the flag by running the following command from the Alation shell (requires backend access):
alation_conf alation.feature_flags.enable_profiling_v2 -s True
Restart Alation for the change to take effect.
If you do not have backend access, raise an SRE ticket to enable this flag.
The connector collects the following profiling metrics for each column:
For All Column Types:
Metric |
Description |
|---|---|
Row Count |
Total number of rows sampled for profiling. |
Value Distribution |
Top values and their counts (histogram). Shows the most frequent values in the column. For example, a STRING column might show “Active”: 500, “Inactive”: 200. |
Null Count |
Number of NULL values in the column. |
Null Percentage |
Percentage of NULL values in the column. |
Additional Metrics for Numerical Columns (INT, BIGINT, FLOAT, DOUBLE, DECIMAL):
Metric |
Description |
|---|---|
Min |
Minimum value in the column. |
Max |
Maximum value in the column. |
Mean |
Average (mean) value in the column. |
Example: Numerical Column Profiling
For numerical columns, the profile shows Min, Max, Mean, Null Count, Null Percentage, and Value Distribution:
Example: Non-Numerical Column Profiling
For non-numerical columns (STRING, DATE, etc.), the profile shows Null Count, Null Percentage, and Value Distribution:
Per-Project Connections¶
The MaxCompute connector uses per-project JDBC connections for profiling. When a table is profiled, a connection is created to that table’s specific project. This eliminates the need for cross-project permissions.
Limitations¶
Profiling large tables may take significant time and consume MaxCompute resources.
Profiling queries use full scan mode and may scan all partitions.
Complex data types (ARRAY, MAP, STRUCT) have limited profiling support:
Data Type
Supported Metrics
Not Supported
ARRAY
Null Count, Null Percentage, Row Count
Min, Max, Average, Histogram (value distribution)
MAP
Null Count, Null Percentage, Row Count
Min, Max, Average, Histogram (value distribution)
STRUCT
Null Count, Null Percentage, Row Count
Min, Max, Average, Histogram (value distribution)
Note
To profile data within complex types, consider flattening the data into separate columns using views with
EXPLODEorLATERAL VIEWstatements, then profile the view instead.
Troubleshooting¶
Profiling Failed
Ensure the RAM user has read permissions on the table.
Check MaxCompute quota and resource availability:
How to check:
Log in to the Alibaba Cloud MaxCompute Console.
Navigate to Management > Quota Management to view your project’s quota usage.
What to check:
Resource
Description
Compute Unit (CU) quota
Verify sufficient CUs are available. Profiling queries consume compute resources. If CU quota is exhausted, queries will be queued or fail.
Concurrent query limit
Check if the project has reached its maximum concurrent query limit (default: 100 per project).
Storage quota
Ensure the project has not exceeded its storage quota, which may affect query execution.
SQL job status
In the MaxCompute console, go to Job Management to check if profiling jobs are queued, running, or failed.
For more information on quotas and limits, see MaxCompute Quotas and Limits in the Alibaba Cloud documentation.
Slow Profiling
Reduce the sample size for large tables.
Profile tables during off-peak hours.
Consider profiling only critical tables.