Parquet (Amazon S3)

Applies from 2021.3

Alation has certified the Parquet files on Amazon S3 data source with the following CData driver:

  • cdata.jdbc.parquet.ParquetDriver.cdata.jdbc.Parquet

The driver reads the metadata from Parquet files stored in an Amazon S3 bucket and extracts them as tables. The different files in the S3 bucket are extracted as individual tables in Alation.

Alation can provide the required CData driver license for Parquet. Refer to How to get a CData Driver and use the scenario appropriate to your case.

Scope of Support

  • Supported as Custom DB with the CData Driver for Parquet (Amazon S3)

  • Metadata Extraction (MDE)

    • Automated MDE

  • Compose

  • Data Profiling

  • Data Sampling

  • QLI is not supported

  • Lineage is not supported

Ports

ProxyPort is the property exposed by the driver in order to support the port of your Proxy Server.

Service Account

Create an AWS account to use the S3 bucket. Refer to AWS Account.

Permissions

Make sure that the service account has the following permissions:

  • ListBucket

  • ListJobs

  • ListStorageLensConfigurations

  • ListBucketVersions

  • ListBucketMultipartUploads

  • ListMultipartUploadParts

  • ListBucketVersions

  • GetObject

Required Information

  • JDBC URI for the Parquet (Amazon S3) data source

JDBC URI

When building the URI, include the following minimal list of required parameters:

  • AWSAccessKey - AWS account access key that is available on AWS security credentials page.

  • AWSSecretKey - AWS secret key that is available on AWS security credentials page.

  • URI - URI of the Amazon S3 bucket where the Parquet resources are located.

  • IncludeSubdirectories - Whether to read files from nested folders. In the case of a name collision, table names are prefixed by the underscore-separated folder names. Set the IncludeSubdirectories to True.

  • RTK Key - when you purchase a CData driver from Alation, you are provided an RTK that needs to be included into the URI. If you purchased the driver from CData, you should have a license file that needs to be placed on the Alation server together with the driver .jar file.

Below are the list of optional parameters that can be either included in the URI or in the Properties field on the General Settings page if required:

  • ProxyServer - The hostname or IP address of a proxy to route HTTP traffic through.

  • ProxyPort (If proxy connection is used) - The TCP port the ProxyServer proxy is running on.

Use the following format for the JDBC URI:

parquet://AWSAccessKey=<awsaccesskey>;AWSSecretKey=<awssecretkey>;URI=<S3Bucket_URI>;IncludeSubdirectories>=True;RTK=<RTK_Key>

Example:

parquet://AWSAccessKey=AKIATEIZM7LWKA2ZLJEZ;AWSSecretKey=LG8en7gyGQfJqrMA4izWJ0G7vkvT0Yqv/xGwdh;URI=<S3Bucket_URI>;IncludeSubdirectories=True;RTK=444752465641535552425641454E545042424D33323632390000000000000000000000000000414C5800005559475655474E4E464242370000

Use the following format for the JDBC URI if proxy connection is used:

parquet://AWSAccessKey=<awsaccesskey>;AWSSecretKey=<awssecretkey>;URI=<S3Bucket_URI>;ProxyServer="<ProxyServer_HostName>";ProxyPort="8866";ProxySSLType="NEVER”;IncludeSubdirectories=True;RTK=<RTK_Key>

Example:

parquet://AWSAccessKey=AKIATEIZM7LWKA2ZLJEZ;AWSSecretKey=LG8en7gyGQfJqrMA4izWJ0G7vkvT0Yqv/xGwdh;URI=<S3Bucket_URI>;ProxyServer="Localhost";ProxyPort="8866";ProxySSLType="NEVER";IncludeSubdirectories=True;RTK=444752465641535552425641454E545042424D33323632390000000000000000000000000000414C5800005559475655474E4E464242370000

Set Up in Alation

Step 1: Add the CData Driver for Parquet to Alation

Depending on how you purchased the CData driver, from Alation or from CData, the driver installation process will be different. Refer to Add the CData Driver into Alation Instance and use the scenario appropriate to your case.

STEP 2: Add a New Data Source

Add a new Data Source on the Sources page.

Step 3: Set up the Connection

  1. On the Add a Data Source screen of the wizard, specify:

    • Database Type: Custom DB

    • JDBC URI: URI in the required format. See JDBC URI.

    • Select Driver: select the JDBC driver for Parquet from the Select Driver drop-down list:

      cdata.jdbc.parquet.ParquetDriver.cdata.jdbc.Parquet

    • Do not select the Use Kerberos checkbox

  2. Click Save and Continue. The next wizard screen - Set Up a Service Account - will open.

    ../../_images/Parquet_01.png

Step 4: Enter Service Account Credentials

  1. Select Yes.

  2. Provide the username and password of the service account created for Alation.

  3. Click Save and Continue. The next wizard screen, Configure Your Data Source, will open.

Step 5: Configure Your Data Source

Click Skip this Step. After this step, you are navigated to the Settings page of your data source.

General Settings

Once you successfully create the connection, you will land on the General Settings page.

Note

If you click the Test button under Network Connection, an error message Data Source URI has no host specified will appear instead of the connection test status. Please disregard this message. This is currently expected behavior.

Metadata Extraction

Configure and perform metadata extraction and verify the results:

  • In Settings > Custom Settings, set the Catalog Object Definition to Catalog.Schema.Table:

    ../../_images/Parquet_02.png
  • In Settings > Metadata Extraction, set up and perform MDE. Refer to Metadata Extraction.

Profiling

Configure and perform Sampling and Profiling:

  • Users can run a sample for an individual table on the Samples tab of the Table Catalog page or profile an individual column on the Overview tab of the Column page.

  • Automatic full and selective Profiling is supported.

  • Use the Per-Object Parameters tab of the Settings to specify which objects to profile.

    Note

    Make sure that the Skip Views checkbox of the respective schemas is unchecked to perform the Profiling.

  • Custom query-based Sampling is supported. Custom Query-Based Sampling allows you to provide a custom query for profiling each specific table.

  • Deep Column Profiling (Profiling V2 ) is supported.

Query Log Ingestion

Not supported.

Compose

Log into Compose:

  • Authenticate in Compose with your Amazon S3 credentials.

  • Use the Catalog.Schema.Table format for writing queries.

Troubleshooting

Logs to collect/review:

  • For logs related to MDE: taskserver.log, taskserver_err.log.

  • For logs related to Compose: connector.log, connector_err.log.

  • For any other errors: alation-error.log, alation-debug.log

To show driver activity from query execution to network traffic, use Logfile and Verbosity. At the end of the JDBC URI, add Logfile=/tmp/log.file;Verbosity=3; which will generate the log file in the specified directory. Set the Verbosity level as required, refer to Logging.

Contact Alation support for help tracing the source of an error or to avoid a performance issue. Following are the examples of common connection errors and show how to use these properties to get more context.

  • Authentication Errors: Recording a Logfile at Verbosity 4 is necessary to get full details on an authentication error.

  • Queries Time Out: A server that takes too long to respond will exceed the driver’s client-side timeout. Setting the Timeout property to a higher value will avoid the connection error. Also you can disable the timeout by setting the property to 0 and setting the Verbosity to 2 will show where the time is spent.

  • The certificate presented by the server cannot be validated: This error indicates that the driver cannot validate the server’s certificate through the chain of trust. If you are using a self signed certificate, there is only one certificate in the chain.

To resolve this error, you must verify yourself that the certificate can be trusted and specify to the driver that you trust the certificate. One way you can specify that you trust a certificate is to add the certificate to the trusted system store; another is to set SSLServerCert.