Version 1.3.0 or Newer¶
Alation Cloud Service Applies to Alation Cloud Service instances of Alation
Customer Managed Applies to customer-managed instances of Alation
Important
This section is applicable for Alation version 2023.3.4 or higher and Google BigQuery OCF connector version 1.3.0 or higher.
Overview¶
Metadata extraction (MDE) is the process of fetching data source information, such as tables, columns, column data types, views, primary keys, foreign keys, functions, policies, tags, and procedures. Alation initiates Google BigQuery API requests to retrieve this metadata, which becomes catalog objects.
You can initiate MDE on demand or schedule it for regular catalog updates.
Configure MDE in Alation¶
Metadata extraction fetches data source information, such as datasets, tables, columns, keys, functions, and more.
Alation queries your database to retrieve this metadata, which becomes catalog objects. You can initiate MDE on demand or schedule it for regular catalog updates.
Steps involved in metadata extraction are:
Test Access and Fetch Datasets¶
Before fetching the datasets for extraction, Alation tests if the service account has the required permissions to run metadata extractions. For information on the permissions required, see Grant Permissions for Access Check in Prerequisites.
Perform these steps to test the access and fetch datasets.
Note
Ensure that the service account has the necessary permissions to access required system views and retrieve accessible datasets (see Prerequisites).
Ensure that the service account has the resourcemanager.projects.get permission to retrieve accessible projects. This test executes an API request to fetch the list of datasets.
On the Settings page of your Google BigQuery OCF data source, go to the Metadata Extraction tab.
Click Run.
The retrieved list of datasets appears in the Datasets table under the Select datasets for extraction section of the Metadata Extraction page.
Select Datasets for Extraction¶
Select datasets for extraction, to which you have access, instead of extracting all the datasets. When selecting datasets for extraction, you retrieve the metadata only for the selected ones. This makes the extraction quicker and consumes fewer resources than extracting all the datasets.
By default, all the datasets Alation fetches from the data source will be selected for extraction. You can adjust the selection of by:
Selecting Datasets using Filters
Selecting Datasets Manually
If you do not select any dataset manually or using rules, Alation extracts all the datasets upon running the metadata extraction.
Select Datasets using Filters¶
If you want to apply extraction filters, perform these steps:
On the Settings page of your Google BigQuery OCF connector, go to the Metadata Extraction tab.
Under the Select datasets for extraction section, enable the Enable advanced settings toggle.
Select the required extraction filter option from the Extract drop down:
Only selected datasets — extracts metadata only from the selected datasets. This is the default value.
All datasets except selected — extracts metadata from all Datasets except the selected Datasets.
Select the Keep the catalog synchronized with the current selection of datasets checkbox, to soft-delete the Datasets from previous extraction that are not part of the current dataset selection.
Create a filter.
From the first drop down, select Dataset or Catalog.
Select the filter criteria (Contains, Starts with, Ends with, Regex).
Specify the keyword to look for from the dataset or catalog.
Use this option if you frequently change datasets or if you use extensive metadata.
You can add multiple filters by clicking the Add another filter link.
Note
You must use rules if you plan to schedule MDE.
Click Apply filters.
The Datasets table displays the selected Datasets that match the rules that you had set.
Note
After applying rules, you cannot manually adjust the selection of datasets.
Select Datasets Manually¶
If you opt to manually select the datasets for extraction, perform these steps:
On the Settings page of your Google BigQuery OCF data source, go to the Metadata Extraction tab.
Under the Select datasets for extraction section, turn off the Enable advanced settings toggle if not disabled already.
Select the required datasets from the list of datasets in the Datasets table.
Alternatively, you can select datasets by searching for the required dataset from the table using either the dataset name or any keyword or string in the dataset name.
After you have selected the datasets, your dataset selection count is displayed above the Datasets table.
Run Extraction¶
Under the Run extraction section (General Settings > Metadata Extraction), click Run Extraction to extract metadata on demand.
The status of the extraction action is logged in the Extraction Job Status table under the MDE Job History tab.
Schedule Extraction¶
You can also schedule the extraction. To schedule the extraction, perform these steps:
On the Settings page of your Google BigQuery OCF data source, go to the Metadata Extraction.
Under the Run extraction section, enable the Enable extraction schedule toggle.
Using the date and time widgets, select the recurrence period and day and time for the desired MDE schedule. The next metadata extraction job for your data source will run on the schedule you have specified.
Note
Here are some of the recommended schedules:
Schedule extraction to run for every 12 hours at the 30th minute of the hour.
Schedule extraction to run for every 2 days at 11:30 PM.
Schedule extraction to run every week on the Sunday and Wednesday of the week.
Schedule extraction to run for every 3 months on the 15th day of the month.
View the MDE Job History¶
You can view the status of the extraction actions after you run the extraction or after Alation triggers the MDE as per the schedule. Also, you can view the status of the datasets retreived from the Test Access and Fetch Datasets step.
To view the status of extraction, go to Metadata Extraction > MDE Job History on the Settings page of your Google BigQuery OCF data source. The Extraction job status table is displayed.
The Extraction job status table logs the following status:
Did Not Start - Indicates that the metadata extraction did not start due to configuration or other issues.
Succeeded - Indicates that the extraction was successful.
Partial Success - Indicates that the extraction was successful with warnings. If Alation fails to extract some of the objects during the metadata extraction process, it skips them and proceeds with the extraction process, resulting in partial success.
Failed - Indicates that the extraction failed with errors.
Click the View Details link to view a detailed report of metadata extraction. If there are errors, the Job errors table displays the error category, error message, and a hint (ways to resolve the issue). Follow the instructions under the Hints column to resolve the error.
In some cases, Generate Error Report link is displayed above the Job errors table. Click the Generate Error Report link above the Job errors table to generate an archive (.zip) containing CSV files for different error categories, such as Data and Connection errors. Click Download Error Report to download the files.
Enable Raw Dump or Replay¶
You can enable or disable the Raw Metadata Dump or Replay feature for debugging MDE. By default, this feature is disabled. We recommend enabling it for extraction debugging only. The full use of this feature requires access to the Alation server.
If Raw Metadata Dump or Replay is enabled, Alation performs MDE into these stages:
“Dump” the extracted metadata into files. You can access and review the files on the Alation server to debug extraction issues before attempting to ingest the metadata into the catalog.
Ingest the metadata from the files into the catalog (Replay).
Both the stages are manually controlled from the user interface.
To enable the Raw Metadata Dump or Replay perform these steps:
On the Settings page of your Google BigQuery OCF data source, go to the Metadata Extraction > Troubleshooting: Enable raw dump or replay section.
From the Enable Raw Metadata Dump or Replay dropdown list, select the Enable Raw Metadata Dump option.
Click Save.
This enables the first stage of MDE where the extracted metadata is dumped into the following files in a subdirectory within the opt/alation/site/tmp/ directory on the Alation server (inside the Alation shell):
attribute.dump, function.dump, schema.dump, table.dump —in a subdirectory of the directory opt/alation/site/tmp/ on the Alation server (inside the Alation shell).
Click Run extraction.
Alation performs a raw metadata dump into files. In the Extraction job status table on the MDE Job History tab, click the View Details link to display the details of the MDE job. The log lists the location of the .dump files for the MDE job. For example: /opt/alation/site/tmp/rosemeta/170/extraction_dump/5028.
Access and review the metadata dump files to intercept any potential extraction issues.
From the Enable Raw Metadata Dump or Replay dropdown list, select the option Enable Ingestion Replay.
Click Save.
This enables the second stage where the metadata from the files is ingested into the Alation catalog.
Click Run extraction.
The metadata from the files are ingested into the catalog.