Lineage Extraction (Beta)

Alation Cloud Service Applies to Alation Cloud Service instances of Alation

Customer Managed Applies to customer-managed instances of Alation

For Databricks Unity Catalog data sources, Alation calculates lineage based on lineage metadata extracted from lineage system tables and on data definition language (DDL) query statements from Compose. Lineage extraction is available as a beta feature that is based on system lineage tables in Databricks:

Lineage is extracted directly from these tables (Direct Lineage Extraction).

Direct Lineage Extraction

Direct lineage extraction—The connector extracts lineage directly from system tables in Databricks that store lineage. Direct lineage extraction will be automatically triggered as a “downstream” job dependent on the metadata extraction job. Query history, joins, filters, and popularity information will not be available as the result of direct lineage extraction as they require query log ingestion (QLI).

For lineage data to be generated, the service account needs access to the system tables that store lineage. See Grant Permissions for Lineage Extraction.

After you have granted the service account these permissions, lineage will be extracted automatically. No additional configuration is required on the Alation side.

Direct Lineage Feature Flags

Direct lineage extraction from OCF data sources is enabled by default. It is controlled by two alation_conf flags:

  • alation.ocf.mde.direct_lineage.enable_extraction—Enables or disables the direct lineage feature for all data sources in the catalog that support it.

  • alation.ocf.mde.direct_lineage.incremental_extraction—Enables or disables incremental lineage extraction. This flag only applies if the main feature flag alation.ocf.mde.direct_lineage.enable_extraction is set to True.

For more on alation_conf, see Using alation_conf.

For more details on incremental lineage, see Incremental Lineage Extraction below.

Capturing Lineage

On the Databricks side, DDL query runs generate lineage records in the lineage tables in the Unity Catalog metastore: system.access.table_lineage and system.access.column_lineage.

In Alation, direct lineage extraction is triggered as a downstream job after metadata extraction (MDE). The direct lineage extraction job reads the system lineage tables and extracts and ingests lineage information into Alation. Ingested lineage will become available on the Lineage tab of the catalog pages of data objects (tables and views) under the Databricks Unity Catalog OCF data source.

Direct lineage extraction depends on the lineage capture feature in Databricks, inheriting its requirements and limitations. If specific records are not available in the system lineage tables due to limitations on the Databricks side, they will not be available in Alation either. Review the requirements and limitations in Databricks documentation:

Note

  • Lineage records are stored in the system lineage tables in Databricks for 30 days. Dropping a view or table with lineage will not immediately remove its associated lineage. Alation would still show the object on the Lineage diagram, while the corresponding catalog page will display the message This object appears to have been removed from the data source.

  • Altering columns to a table or view after lineage has already been created does not alter the existing lineage records.

Dataflow Content From Direct Lineage Extraction

Dataflow objects generated by lineage extraction will not show the SQL queries. The Dataflow Content field will contain the URL of the Databricks entity that generated the lineage (a notebook, a dashboard, a workflow, or a Databricks SQL query).

../../../_images/OCF_Databricks_UC_InstallConfig_Dataflow.png

For more on Dataflow objects, see Dataflow Objects.

Note

  • For JDBC URIs created with VPC enabled, to access the URL of the Databricks entity from outside the VPC, you must have access to the specified VPC.

Lineage from Compose Queries

In addition to lineage extracted from Databricks, Alation will capture lineage from DDL SQL queries executed in Compose. The Compose queries will be available in the Dataflow Content field on the Lineage diagram.

Important

Use the multipart schema names when referencing schemas, tables, and views in Compose for lineage to be captured—catalog.schema.table.

Incremental Lineage Extraction

Incremental lineage extraction is available from connector version 1.1.0.4393 and Alation release 2023.1.4.

Incremental lineage extraction is supplemental to direct lineage extraction. The lineage extraction job creates a timestamp “bookmark” and stores it in Postgres. The bookmark is based on the event_time value in the system lineage tables that is the latest timestamp. During the next MDE, Alation will only extract those lineage records where the event_time value is later than the value stored in the bookmark. For example, if an initial MDE job extracts 50 lineage records into the Alation catalog on day one and creates a bookmark, then the next MDE will only extract lineage records where the event_time value is later than the bookmark stored in Alation, thus adding lineage records incrementally. The same extraction job will create a new bookmark to be used during the subsequent MDE.

If the the incremental lineage feature flag is disabled, the MDE job will extract all available lineage records but only ingest the records that were not previously extracted and are not present in Alation. This may increase the time of the MDE extraction job, depending on how much metadata you are extracting.

See Direct Lineage Feature Flags for information on the feature flag that controls incremental lineage extraction.