Move the Alation Database to AWS RDS¶
Customer Managed Applies to customer-managed instances of Alation
Available from version 2023.3.5
Overview¶
Alation uses a PostgreSQL database to store customer metadata and other information, such as some event data. Due to the amount of data stored in Alation, this internal database can grow quite large, which can result in performance issues for backups or migrations. To free up space on an Alation customer-managed server, Alation versions 2023.3.5 and later offer server administrators the ability to extract the database from the server and store it on a separate Amazon Web Service (AWS) Relational Database Service (RDS) instance.
We recommend that the RDS instance be located in the same VPC and region as your Alation instance.
Alation has two PostgreSQL databases that can be moved in this way. Many of the steps below must be done for both databases:
rosemeta, the main database
the lineage database
To move the internal Postgres databases to RDS, the databases are first copied to the RDS instance. Alation and its internal Postgres databases continue to function as normal during this process. We use a logical replication process that continues to stream database changes to the new copy until all data has been moved. Once a database copy on RDS is ready, you point Alation to that database, and the internal database becomes unused.
We recommend taking a backup before starting this procedure. As a precaution, keep the backup after completing this procedure.
This procedure was designed to cause as little disruption as possible. Most of the procedure can be done with no downtime, and users can continue using Alation without trouble. When a step does require downtime, it will be noted.
The overall process goes like this:
Step 2: Find and Remove Corrupted Data
Step 3: Gather Information for the RDS Instance
Step 4: Create the RDS Instance
Step 5: Migrate the Data to the RDS Instance
Step 6: Switch Alation to Use the RDS Instance
You can also Troubleshoot the Migration and Maintain Alation with the Database on an AWS RDS Instance.
Each step is described in detail below.
Step 1: Delete Old Data¶
Note
This step can be done without any downtime. Alation users can continue using Alation as usual.
This step is optional but highly recommended. To make the migration to RDS faster and more reliable, we recommend deleting old data from certain large tables. We have provided a script to delete specific data from specific tables. The following table shows which database tables are affected and how far back data is deleted.
Table |
Delete Data Older Than |
---|---|
|
6 months |
|
5 months |
|
30 days |
To delete old data:
Use SSH to connect to the Alation server.
Enter the Alation shell using the following command:
sudo /etc/init.d/alation shell
Switch to the alation user:
sudo su alation
Run the pre-validation SQL queries and save the results in a separate file:
psql -d rosemeta -U alation -f /opt/alation/ops/postgres_rds_migration/pruning_scripts/pre-validation.sql > /tmp/pre-validation.out
Run the scan_prune.pyc script to delete the old data:
python /opt/alation/django/rosemeta/one_off_scripts/scan_prune.pyc -p purge -i metadata metrics qli > /tmp/scan_prune.log 2>&1
Run the post-validation SQL queries and save the results in a separate file:
psql -d rosemeta -U alation -f /opt/alation/ops/postgres_rds_migration/pruning_scripts/post-dbvalidation.sql > /tmp/post-dbvalidation.out
Search the log file at /tmp/scan_prune.log for the following message:
--- Finished execution of scan problematic tables ---
If you find this message, it indicates that everything is good, and you can skip to Step 2: Find and Remove Corrupted Data.
If you don’t find the finished message, do the following:
Search scan_prune.log for the following message:
The difference between one of pointer table is greater than tolerated difference of {} , Please create support ticket".format(TOLERATED_PROCESSED))
If you find this message in the log, something has gone wrong. Stop here, and contact Alation Support. Send the pre-validation.out, scan_prune.log, and post-dbvalidation.out files.
If you don’t find this message, do the next step.
Search scan_prune.log for the following message:
At this point, The QLI archival did not clear any data. Please review
If you find that message in the log, check for the following messages:
INFO - There's close to 0 records to process for rosemeta_executionevent. INFO - There's close to 0 records to process for rosemeta_executioneventmention. INFO - There's close to 0 records to process for rosemeta_executioneventexpressionmention.
If there are close to 0 records to process, you can proceed to Step 2: Find and Remove Corrupted Data.
If not, stop here, and contact Alation Support. Send the pre-validation.out, scan_prune.log, and post-dbvalidation.out files.
Step 2: Find and Remove Corrupted Data¶
Note
This step can be done without any downtime. Alation users can continue using Alation as usual.
To ensure the migration works successfully, you have to make sure there’s no corrupted data in Alation’s PostgreSQL databases. If there is corrupted data, it has to be removed.
To find and remove corrupted data:
If you’re not on the Alation server already:
Use SSH to connect to the Alation server.
Enter the Alation shell using the following command:
sudo /etc/init.d/alation shell
Switch to the alation user:
sudo su alation
Run the script db_scan_parallel.pyc against the rosemeta database:
python /opt/alation/ops/postgres_rds_migration/scanning_script/db_scan_parallel.pyc -b /bin/ -h /tmp -p 5432 -U alation -d rosemeta
Note
This may take some time. For example, on a t3.2xlarge instance with a 364 GB PostgreSQL database, it may take about 30 minutes.
Examine the output from db_scan_parallel.pyc.
If the scan detected no corruption, you’ll see this at the end of the output:
Total tables successfully dumped: <#>, failed_tables: []
If you see a message indicating that some tables failed, it means some rows in those tables are corrupted and need to be removed. Make a list of each failed table.
Example failure message:
(ERROR) PID=31625: b'pg_dump: error: Dumping the contents of table "data_storage_blobaccesspostgres" failed: PQgetResult() failed.\npg_dump: error: Error message from server: ERROR: missing chunk number 0 for toast value 3523697 in pg_toast_3131713\npg_dump: error: The command was: COPY public.data_storage_blobaccesspostgres (id, model_code, blob_key, blob_value) TO stdout;'
From the message above, you would take note that the table data_storage_blobaccesspostgres failed.
Run the script db_scan_parallel.pyc against the lineage database:
python /opt/alation/ops/postgres_rds_migration/scanning_script/db_scan_parallel.pyc -b /bin/ -h /tmp -p 5432 -U alation -d lineage
Examine the output from db_scan_parallel.pyc.
If the scan detected no corruption, you’ll see this at the end of the output:
Total tables successfully dumped: <#>, failed_tables: []
If you see a message indicating that some tables failed, it means some rows in those tables are corrupted and need to be removed. Make a list of each failed table.
Example failure message:
(ERROR) PID=31625: b'pg_dump: error: Dumping the contents of table "data_storage_blobaccesspostgres" failed: PQgetResult() failed.\npg_dump: error: Error message from server: ERROR: missing chunk number 0 for toast value 3523697 in pg_toast_3131713\npg_dump: error: The command was: COPY public.data_storage_blobaccesspostgres (id, model_code, blob_key, blob_value) TO stdout;'
Exit from the alation user:
exit
Did the scans find corruption?
If the results of db_scan_parallel.pyc found no corruption in both rosemeta and lineage, you can skip to Step 3: Gather Information for the RDS Instance.
If the scan did find corruption, proceed with steps 9 to 15 below.
If the scan found corruption in one or more tables, you need to identify exactly which rows were corrupted. To do this, we have provided a find_bad_rows function with your Alation installation. To use the function, first switch to the postgres user:
sudo su postgres
Add the find_bad_rows function to Postgres:
psql -h /tmp -d rosemeta < /opt/alation/ops/postgres_rds_migration/corruption_check/find_bad_rows.sql
Execute find_bad_rows as shown below, replacing <FAILED_TABLE_NAME> with the name of a failed table:
psql -h /tmp -d rosemeta -c "select find_bad_rows('public.<FAILED_TABLE_NAME>');"
The function will output the ctid for each corrupted row. Example output:
NOTICE: Corrupted ctid: (56,70) NOTICE: XX001: missing chunk number 0 for toast value 3523697 in pg_toast_3131713 NOTICE: Corrupted ctid: (57,109) NOTICE: XX001: missing chunk number 0 for toast value 3523702 in pg_toast_3131713
Note
This step may take a long time, depending on the size of the tables.
Double check each row to make sure it’s corrupted. Run the following command for each corrupted row:
psql -h /tmp -d rosemeta -c "SELECT \* FROM public.<FAILED_TABLE_NAME> where ctid = '(<ROW_CTID>)'"
Example output:
missing chunk number 0 for toast value 3523697 in pg_toast_3131713
Delete the corrupted rows. Run the following command for each corrupted row:
psql -h /tmp -d rosemeta -c "DELETE FROM public.<FAILED_TABLE_NAME> where ctid = '(<ROW_CTID>)'"
Example output:
DELETE 1
Repeat the prior two steps for each corrupted row in the table.
Repeat the prior three steps for each failed table.
Exit from the postgres user:
exit
Step 3: Gather Information for the RDS Instance¶
Note
This step can be done without any downtime. Alation users can continue using Alation as usual.
In this step, you’ll make sure you have all the information needed to create the RDS instance.
Ensure that you have AWS keys to create the RDS instance using AWS APIs.
Ensure that the machine you’re using has access to the AWS VPC where you are creating the RDS instance. We recommend that the RDS instance be located in the same VPC and region as your Alation instance.
Determine how much storage your RDS instance will require.
If you’re not on the Alation server already:
Use SSH to connect to the Alation server.
Enter the Alation shell using the following command:
sudo /etc/init.d/alation shell
Enter the PostgreSQL shell:
alation_psql
Display the size of the lineage and rosemeta databases
\l+
The output will look something like this:
List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description -----------+----------+----------+-------------+-------------+-----------------------+-----------+------------+-------------------------------------------- lineage | alation | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | 5774 MB | pg_default | postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | 8221 kB | pg_default | default administrative connection database rdsadmin | rdsadmin | UTF8 | en_US.UTF-8 | en_US.UTF-8 | rdsadmin=CTc/rdsadmin+| No Access | pg_default | | | | | | rdstopmgr=Tc/rdsadmin | | | rosemeta | alation | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | 177 GB | pg_default | template0 | rdsadmin | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/rdsadmin +| 8041 kB | pg_default | unmodifiable empty database | | | | | rdsadmin=CTc/rdsadmin | | | template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres +| 8213 kB | pg_default | default template for new databases | | | | | postgres=CTc/postgres | | | (6 rows)
Look in the Size column, and add the sizes for the lineage and rosemeta databases together. Then double that number.
Example: In the example output above, lineage is 5774 MB and rosemeta is 177 GB, so the total is about 183 GB. Double that number is 366 GB.
If your final number is less than 20 GB, plan to use the minimum of 20 GB. The maximum is 16384 GB.
Exit the PostgreSQL shell:
\q
Determine the RDS instance class that’s required to hold your PostgreSQL database. The table below shows some recommendations based on the size of your database, concurrent users, and Alation instance size.
PostgreSQL Object Count
Concurrent User Load
Alation Host Size on AWS
RDS Instance Type Required
Up to 5 million
150 users
m5a.4xl
db.m6g.large
Up to 20 million
150 users
m5a.8xl
db.m5.2xlarge
Over 20 million
150 to 800 users
m5a.12xl
db.m5.4xlarge
Figure out how many CPUs you have on your Alation server. This will determine how many migration workers you can use. In the Alation shell, run this command:
lscpu | grep -E '^CPU\('
Example output:
CPU(s): 8
Save this number for use in a later step.
Generate two passwords:
One for the RDS administration, referred to as
ALATION_RDS_ADMIN_PASSWORD
in these instructions.One for the PostgreSQL user that Alation will use to connect to the migrated PostgreSQL database. This password is referred to as
ALATION_PASSWORD
in these instructions.
You can generate these passwords however you like, or you can use this command on Linux to generate two random 16-digit passwords:
cat /dev/urandom | tr -dc A-Za-z | head -c 16 ; echo ''
Save these two passwords in a secure location. You will need them in later steps.
Step 4: Create the RDS Instance¶
Note
This step can be done without any downtime. Alation users can continue using Alation as usual.
You’ll now create the AWS RDS instance that you’ll move the Alation database to. We have provided some Terraform scripts in the Alation installation that will automatically create the RDS instance.
To create the RDS instance from the Alation server:
If you’re not on the Alation server already:
Use SSH to connect to the Alation server.
Enter the Alation shell using the following command:
sudo /etc/init.d/alation shell
Install Terraform if it’s not already installed. You can find installers on Terraform’s installation page. As an example, on an AMD64 Linux machine, you could install Terraform like this:
sudo curl -O https://releases.hashicorp.com/terraform/1.0.6/terraform_1.0.6_linux_amd64.zip
sudo unzip terraform_1.0.6_linux_amd64.zip
sudo mv terraform /usr/bin/
Switch to the alation user:
sudo su alation
Using terminal, set some environment variables that the Terraform scripts rely on. Replace the angle brackets and their contents with your own values:
export AWS_ACCESS_KEY_ID=<ADMIN_API_KEY> export AWS_SECRET_ACCESS_KEY=<ADMIN_SECRET_ACCESS_KEY> export TF_VAR_alation_region=<ALATION_AWS_REGION> export TF_VAR_on_prem_alation_ec2_id=<ALATION_EC2_INSTANCE_ID such as i-0f76425cf32057e81> export TF_VAR_rds_admin_password=<ALATION_RDS_ADMIN_PASSWORD> export TF_VAR_allocated_storage=<STORAGE_SIZE_IN_GB> export TF_VAR_multi_az=true export TF_VAR_rds_instance_class=<RDS_INSTANCE_CLASS> export TF_VAR_storage_type=gp3
Go into the folder containing the Terraform scripts:
cd /opt/alation/ops/postgres_rds_migration/rds_setup_terraform/
Initialize Terraform:
terraform init
Validate that the Terraform configuration is valid:
terraform validate
If you notice any errors in the output, contact Alation Support.
Verify that the Terraform plan is accurate:
terraform plan
If you notice any errors in the output, contact Alation Support.
Run the Terraform script:
terraform apply --auto-approve
This will automatically create an RDS instance and display its RDS endpoint, which you will need in later steps. This step may take several minutes. Near the end of the output you will see the RDS endpoint. Example output:
data.dns_a_record_set.rds_endpoint: Read complete after 0s [id=onprem-alation-i-02ad22d793b9d6dfa-rds.calwlq0balzb.us-west-2.rds.amazonaws.com]
The value after
id=
is the RDS endpoint. In this example, it isonprem-alation-i-02ad22d793b9d6dfa-rds.calwlq0balzb.us-west-2.rds.amazonaws.com
.
Now that the RDS instance exists, you can migrate the Alation database to it. Proceed to Step 5: Migrate the Data to the RDS Instance below.
Step 5: Migrate the Data to the RDS Instance¶
Important
This step requires about 10 minutes of downtime.
If this process fails for any reason, you can reset the RDS instance using the steps under Reset the RDS Instance. Then try this section again.
In the following steps, you will copy the rosemeta and lineage databases to your RDS instance.
If you’re not in the Alation shell already:
Use SSH to connect to the Alation server.
Enter the Alation shell using the following command:
sudo /etc/init.d/alation shell
Switch to the alation user:
sudo su alation
Use a text editor to replace the contents of setup_internal_postgres.pyc with the following:
# /opt/alation/env/bin/python import socket import sys import pty import subprocess EXPECTED_POSTGRES_VERSIONS = {"13.1", "13.6", "13.8", "13.11"} PGLOGICAL_PACKAGE = "pglogical" def get_ip(host): return socket.gethostbyname(host) def execute_command(bash_command, shell_or_die=False): master_fd, slave_fd = pty.openpty() sp = subprocess.Popen( bash_command, stdin=slave_fd, stderr=subprocess.PIPE, stdout=subprocess.PIPE, shell=True ) out, err = sp.communicate() if shell_or_die and sp.returncode != 0: raise Exception("Failed: %s", out.decode(), err.decode()) return sp.returncode, out.decode(), err.decode() def version_check(): cmd = 'sudo -u postgres psql -qtAX -h /tmp -d rosemeta -c "SHOW server_version;"' code, stdout, stderr = execute_command(cmd) if code != 0: raise Exception("Cannot get postgres version") elif stdout.strip() not in EXPECTED_POSTGRES_VERSIONS: raise Exception( "Its version %s is not equal to %s" % (stdout.strip(), str(EXPECTED_POSTGRES_VERSIONS)) ) print("## Its version is %s" % str(EXPECTED_POSTGRES_VERSIONS)) def setup_postgresql_auto_conf(num_of_wall_senders): cmd = "sudo -u postgres cat /data1/pgsql/13/data/postgresql.auto.conf" code, stdout, stderr = execute_command(cmd) if code != 0: raise Exception("Cannot cat /data1/pgsql/13/data/postgresql.auto.conf") if "listen_addresses = '*'" not in stdout: cmd = "sudo -u postgres -- sh -c $'echo \"listen_addresses = \\'*\\'\" >> /data1/pgsql/13/data/postgresql.auto.conf'" execute_command(cmd, shell_or_die=True) if "wal_level = logical" not in stdout: cmd = "sudo -u postgres -- sh -c $'echo \"wal_level = logical\" >> /data1/pgsql/13/data/postgresql.auto.conf'" execute_command(cmd, shell_or_die=True) if "max_replication_slots = %s" % num_of_wall_senders not in stdout: cmd = ( "sudo -u postgres -- sh -c $'echo \"%s\" >> /data1/pgsql/13/data/postgresql.auto.conf'" % ("max_replication_slots = %s" % num_of_wall_senders) ) execute_command(cmd, shell_or_die=True) if "max_wal_senders = %s" % num_of_wall_senders not in stdout: cmd = ( "sudo -u postgres -- sh -c $'echo \"%s\" >> /data1/pgsql/13/data/postgresql.auto.conf'" % ("max_wal_senders = %s" % num_of_wall_senders) ) execute_command(cmd, shell_or_die=True) if "shared_preload_libraries = 'pglogical'" not in stdout: cmd = "sudo -u postgres -- sh -c $'echo \"shared_preload_libraries = \\'pglogical\\'\" >> /data1/pgsql/13/data/postgresql.auto.conf'" execute_command(cmd, shell_or_die=True) print("## Completed postgresql.auto.conf update") def setup_pg_hba_conf(rds_private_ip): cmd = "sudo -u postgres cat /data1/pgsql/13/data/pg_hba.conf" code, stdout, stderr = execute_command(cmd) if code != 0: raise Exception("Cannot cat /data1/pgsql/13/data/pg_hba.conf") if rds_private_ip not in stdout: cmd = ( "sudo -u postgres -- sh -c 'echo \"host all all %s/32 trust\" >> /data1/pgsql/13/data/pg_hba.conf'" % rds_private_ip ) execute_command(cmd, shell_or_die=True) if "host all all 127.0.0.1/0 trust" not in stdout: cmd = "sudo -u postgres -- sh -c $'echo \"host all all 127.0.0.1/0 trust\" >> /data1/pgsql/13/data/pg_hba.conf'" execute_command(cmd, shell_or_die=True) if "host all alation ::1/128 trust" not in stdout: cmd = "sudo -u postgres -- sh -c $'echo \"host all alation ::1/128 trust\" >> /data1/pgsql/13/data/pg_hba.conf'" execute_command(cmd, shell_or_die=True) print("## Completed pg_hba.conf update") def restart_postgres(): cmd = "alation_action stop_postgres" execute_command(cmd, shell_or_die=True) cmd = "alation_action start_postgres" execute_command(cmd, shell_or_die=True) print("## Restart completed") def main(argv): if len(argv) < 2: raise Exception("python setup_internal_postgres.py <rds_endpoint> <num_of_wall_senders>") rds_endpoint, num_of_wall_senders = argv[0], argv[1] rds_private_ip = get_ip(rds_endpoint) print(rds_private_ip) version_check() setup_postgresql_auto_conf(num_of_wall_senders) setup_pg_hba_conf(rds_private_ip) restart_postgres() if __name__ == "__main__": main(sys.argv[1:])
Important
The next step requires Alation to be down for about ten minutes.
Still as the alation user, run the setup_internal_postgres.pyc script and provide the RDS endpoint and number of CPUs you identified earlier:
python /opt/alation/ops/postgres_rds_migration/rds_migration_scripts/setup_internal_postgres.pyc <RDS_ENDPOINT> <#_CPUs>
Example output:
## Its version is {'13.11', '13.6', '13.1', '13.8'} ## Completed postgresql.auto.conf update ## Completed pg_hba.conf update ## Restart completed
Note
Items 5–7 apply to the rosemeta database. Items 8–10 apply to lineage. These steps can take a long time. You can follow the items for rosemeta in parallel to the items for lineage.
To migrate the rosemeta database to your new RDS instance, run the migrate_postgres_to_rds.pyc script as shown below. This will sync the data to your RDS instance. We use a logical replication process that continues to stream database changes to the new copy until all data has been moved. This ensures no data will be lost.
Important
This process runs asynchronously. It may take a long time. As an example, if you have:
An Alation instance with 8 cores and 32 GB of memory
An RDS instance with 4 cores and 16 GB of memory
It may take about four to five hours to migrate a 175 GB rosemeta database.
To run the script, use the command below. Replace
<RDS_ENDPOINT>
with the endpoint of your RDS instance,<ALATION_RDS_ADMIN_PASSWORD>
with the RDS admin password you created earlier, and<ALATION_PASSWORD>
with the Alation password you created earlier.python /opt/alation/ops/postgres_rds_migration/rds_migration_scripts/migrate_postgres_to_rds.pyc --rds_port 5432 --rds_endpoint <RDS_ENDPOINT> --rds_admin_password <ALATION_RDS_ADMIN_PASSWORD> --new_alation_password <ALATION_PASSWORD> --remove_indexes true --target_database rosemeta
Successful output will look like this:
initialized RateSampler, sample 100% of traces initialized RateSampler, sample 100% of traces Connecting to DogStatsd(udp://localhost:8125) {"data": {"module": "alation_conf", "version": null, "requestid": null, "msg": "Reloading conf. current time: 1702581142.2829413"}, "header": {"timestamp": "2023-12-14T19:12:22.282994", "appname": null, "instanceurl": null, "tenantid": null, "loglevel": "DEBUG", "traceid": "0"}, "dd.trace_id": "0", "dd.span_id": "0", "dd.env": ""} {"data": {"module": "alation_conf", "version": null, "requestid": null, "msg": "alation_conf is initialized. current time: 1702581142.2829413"}, "header": {"timestamp": "2023-12-14T19:12:22.467949", "appname": null, "instanceurl": null, "tenantid": null, "loglevel": "DEBUG", "traceid": "0"}, "dd.trace_id": "0", "dd.span_id": "0", "dd.env": ""} Check pg_hba entry for replication... Check pg_hba entry for replication...Done Check listen_addresses... Check listen_addresses...Done Check if migration_rosemeta already created Drop file_fdw extension Start export postgres schema only... Create indexes backup file ...Done Removed indexes from schema file ...Done Start export postgres schema only...Done Check if alation exists Create new alation user in RDS... Prepare alation user in RDS...Done Check if alation_db_user exists Create new alation_db_user user in RDS... Prepare alation user in RDS...Done Check if alation_db exists Create new alation_db user in RDS... Prepare alation user in RDS...Done Check if datadog exists Create new datadog user in RDS... Prepare alation user in RDS...Done Check if rosemeta exists No rosemeta exists Create new alation user... Build existing extension set Create extensions intarray,hstore,amcheck,pglogical,pg_visibility if not exists ... Create extensions intarray,hstore,amcheck,pglogical,pg_visibility ...Done Import /tmp/rosemeta_schema_only.dump ... Import /tmp/rosemeta_schema_only.dump ...Done List target tables own by public and alation Check if publication migration_rosemeta is created migration_rosemeta is created Validate publication table by table All tables are included in public Check if subscription migration_rosemeta already created ## create subscription migration_rosemeta connection 'host=<RDS IP> port=5432 dbname=rosemeta user=alation' publication migration_rosemeta Stopping AgentWriter thread
Check on the progress of migrating the rosemeta database by running get_subscription_status.pyc. Replace
<RDS_ENDPOINT>
with the endpoint of your RDS instance,<ALATION_PASSWORD>
with the Alation password you created earlier.python /opt/alation/ops/postgres_rds_migration/rds_migration_scripts/get_subscription_status.pyc --rds_endpoint <RDS_ENDPOINT> --rds_port 5432 --new_alation_password <ALATION_PASSWORD> --target_database rosemeta
While the process is still running, you’ll see the following in the output:
## res: NEED_TO_WAIT
When the process is finished, you’ll see the following in the output:
## res: STOP_WAITING
When get_subscription_status.pyc shows
## res: STOP_WAITING
, it automatically starts another script, create_indexes.py. This script creates indexes for all the tables in the rosemeta database. It runs in the background. This script must finish before you can move on to Step 6: Switch Alation to Use the RDS Instance below. You can check the status of this script by tailing the log file:tail -f /tmp/create_indexes.log
When the process is done, you’ll see this in the log:
Created 2090/2091 indexes. Created 2091/2091 indexes. Creation of indexes /tmp/rosemeta_postdata_only.dump ...Done Removed indexes dump file /tmp/rosemeta_postdata_only.dump since indexes have been applied.
Note
Items 8–10 below apply to the lineage database. You can follow these steps in parallel to items 5–7.
To migrate the lineage database to your new RDS instance, run the migrate_postgres_to_rds.pyc script as shown below. Replace
<RDS_ENDPOINT>
with the endpoint of your RDS instance,<ALATION_RDS_ADMIN_PASSWORD>
with the RDS admin password you created earlier, and<ALATION_PASSWORD>
with the Alation password you created earlier.python /opt/alation/ops/postgres_rds_migration/rds_migration_scripts/migrate_postgres_to_rds.pyc --rds_port 5432 --rds_endpoint <RDS_ENDPOINT> --rds_admin_password <ALATION_RDS_ADMIN_PASSWORD> --new_alation_password <ALATION_PASSWORD> --remove_indexes true --target_database lineage
This process runs asynchronously. Successful output will look like this:
initialized RateSampler, sample 100% of traces initialized RateSampler, sample 100% of traces Connecting to DogStatsd(udp://localhost:8125) {"data": {"module": "alation_conf", "version": null, "requestid": null, "msg": "Reloading conf. current time: 1702581232.9394724"}, "header": {"timestamp": "2023-12-14T19:13:52.939548", "appname": null, "instanceurl": null, "tenantid": null, "loglevel": "DEBUG", "traceid": "0"}, "dd.trace_id": "0", "dd.span_id": "0", "dd.env": ""} {"data": {"module": "alation_conf", "version": null, "requestid": null, "msg": "alation_conf is initialized. current time: 1702581232.9394724"}, "header": {"timestamp": "2023-12-14T19:13:53.135925", "appname": null, "instanceurl": null, "tenantid": null, "loglevel": "DEBUG", "traceid": "0"}, "dd.trace_id": "0", "dd.span_id": "0", "dd.env": ""} Check pg_hba entry for replication... Check pg_hba entry for replication...Done Check listen_addresses... Check listen_addresses...Done Check if migration_lineage already created Drop file_fdw extension Start export postgres schema only... Create indexes backup file ...Done Removed indexes from schema file ...Done Start export postgres schema only...Done Check if alation exists Reset alation user password in RDS... Prepare alation user in RDS...Done Check if alation_db_user exists Reset alation_db_user user password in RDS... Prepare alation user in RDS...Done Check if alation_db exists Reset alation_db user password in RDS... Prepare alation user in RDS...Done Check if datadog exists Reset datadog user password in RDS... Prepare alation user in RDS...Done Check if lineage exists No lineage exists Create new alation user... Build existing extension set Create extensions pg_visibility,intarray,hstore,pglogical,amcheck if not exists ... Create extensions pg_visibility,intarray,hstore,pglogical,amcheck ...Done Import /tmp/lineage_schema_only.dump ... Import /tmp/lineage_schema_only.dump ...Done List target tables own by public and alation Check if publication migration_lineage is created migration_lineage is created Validate publication table by table All tables are included in public Check if subscription migration_lineage already created ## create subscription migration_lineage connection 'host=<RDS IP> port=5432 dbname=lineage user=alation' publication migration_lineage Stopping AgentWriter thread
Check on the progress of migrating the lineage database by running get_subscription_status.pyc. Replace
<RDS_ENDPOINT>
with the endpoint of your RDS instance,<ALATION_PASSWORD>
with the Alation password you created earlier.python /opt/alation/ops/postgres_rds_migration/rds_migration_scripts/get_subscription_status.pyc --rds_endpoint <RDS_ENDPOINT> --rds_port 5432 --new_alation_password <ALATION_PASSWORD> --target_database lineage
While the script is still running, you’ll see the following in the output:
## res: NEED_TO_WAIT
When the script is finished, you’ll see the following in the output:
## res: STOP_WAITING
When get_subscription_status.pyc shows
## res: STOP_WAITING
, it automatically starts another script, create_indexes.py. This script creates indexes for all the tables in the lineage database. It runs in the background. This script must finish before you can move on to Step 6: Switch Alation to Use the RDS Instance below. You can check the status of this script by tailing the log file:tail -f /tmp/create_indexes.log
When the process is done, you’ll see this in the log:
Created 2090/2091 indexes. Created 2091/2091 indexes. Creation of indexes /tmp/rosemeta_postdata_only.dump ...Done Removed indexes dump file /tmp/rosemeta_postdata_only.dump since indexes have been applied.
Wait until the create_indexes.py script has finished indexing the tables for both rosemeta (item 8 above) and lineage (item 10 above). Then move on to the next step.
Step 6: Switch Alation to Use the RDS Instance¶
Important
This step requires about an hour of downtime.
If this process fails for any reason, you can reset the RDS instance using the steps under Reset the RDS Instance. Then try the migration again.
Now that all the data has been migrated to the RDS instance, you need to tell Alation to use the RDS instance.
Important
The first three steps require about one hour of downtime.
When the create_indexes.py script has finished indexing the tables for both rosemeta and lineage, you are ready to direct Alation to start using the copies that are now on the RDS instance.
Still as the alation user on the Alation server, stop Alation from running:
alation_supervisor stop celery:* ingestion customer-portal-agent java:* web:* lineage logical-metadata:*
Important
The next step requires downtime.
Run the post_postgres_migration.pyc script to tell Alation to start using the rosemeta database on the RDS instance. Replace the following placeholders:
<RDS_ENDPOINT>
with the endpoint of your RDS instance<ALATION_RDS_ADMIN_PASSWORD>
with the RDS admin password you created earlier<ALATION_PASSWORD>
with the Alation password you created earlier<NUM_CORES>
with the number of CPUs you identified earlier
python /opt/alation/ops/postgres_rds_migration/rds_migration_scripts/post_postgres_migration.pyc --rds_port 5432 --rds_endpoint <RDS_ENDPOINT> --rds_admin_password <ALATION_RDS_ADMIN_PASSWORD> --new_alation_password <ALATION_PASSWORD> --num_of_threads <NUM_CORES> --target_database rosemeta
The output will be very lengthy. If it’s successful, you’ll see this at the end of the output:
Switching to RDS <YOUR RDS ENDPOINT>...Done
Important
The next step requires downtime.
Run the post_postgres_migration.pyc script to tell Alation to start using the lineage database on the RDS instance.
<RDS_ENDPOINT>
with the endpoint of your RDS instance<ALATION_RDS_ADMIN_PASSWORD>
with the RDS admin password you created earlier<ALATION_PASSWORD>
with the Alation password you created earlier<NUM_CORES>
with the number of CPUs you identified earlier
python /opt/alation/ops/postgres_rds_migration_scripts/rds_migration_scripts/post_postgres_migration.pyc --rds_port 5432 --rds_endpoint <RDS_ENDPOINT> --rds_admin_password <ALATION_RDS_ADMIN_PASSWORD> --new_alation_password <ALATION_PASSWORD> --num_of_threads <NUM_CORES> --target_database lineage
If it’s successful, you’ll see this at the end of the output:
Switching to RDS <YOUR RDS ENDPOINT>...Done
Start Alation back up:
alation_supervisor start celery:* ingestion customer-portal-agent java:* web:* lineage logical-metadata:*
Alation should now be using the external RDS instance.
Stop the internal Postgres:
sudo -u postgres /usr/pgsql-13/bin/pg_ctl -D /data1/pgsql/13/data stop -w
Confirm that Alation is using the RDS instance:
alation_conf pgsql.config.host
The parameters
lineage-service.pgsql.config.host
andpgsql.config.host
should show your RDS endpoint.
The process is now complete. You should be able to log into Alation and use it as usual.
After you’ve verified that the migration was successful and Alation is using the RDS instance without problems, you can now drop the data from the internal rosemeta and lineage databases. As a precaution, we recommend keeping the latest backups from before the migration.
Troubleshoot the Migration¶
If you encounter problems with migrating the Alation databases to an AWS RDS instance, please contact Alation Support.
Reset the RDS Instance¶
If you run into trouble during Step 5: Migrate the Data to the RDS Instance, you can reset the RDS instance and try again. To reset the RDS instance:
If you’re not on the Alation server already:
Use SSH to connect to the Alation server.
Enter the Alation shell using the following command:
sudo /etc/init.d/alation shell
Connect to the rosemeta database on the RDS instance:
export PGPASSWORD=<ALATION_PASSWORD>; psql -h <RDS endpoint> -U alation -d rosemeta
Then drop the subscription migration:
rosemeta=> drop subscription migration;
Connect to the lineage database on the RDS instance:
export PGPASSWORD=<ALATION_PASSWORD>; psql -h <RDS endpoint> -U alation -d lineage
Then drop the subscription migration:
lineage=> drop subscription migration;
Connect to the database on the RDS instance as the postgres user:
export PGPASSWORD=<ALATION_RDS_ADMIN_PASSWORD>; psql -h <RDS_ENDPOINT> -U postgres
Change the owner of the rosemeta database to the postgres user, then drop the database:
postgres=> alter database rosemeta owner to postgres;
postgres=> drop database rosemeta;
Change the owner of the lineage database to the postgres user, then drop the database:
postgres=> alter database lineage owner to postgres;
postgres=> drop database lineage;
If rosemeta or lineage is still being used, stop Alation and terminate the sessions:
alation_action stop_alation
postgres=> select pg_terminate_backend(pid) from pg_stat_activity where datname='rosemeta';
postgres=> select pg_terminate_backend(pid) from pg_stat_activity where datname='lineage';
Roll back the changes to alation_conf:
alation_conf pgsql.config.remote -s False
alation_conf pgsql.config.host -s /tmp
alation_conf pgsql.config.port -s 5432
alation_conf pgsql.config.password -c
alation_conf lineage-service.pgsql.config.host -s /tmp
alation_conf lineage-service.pgsql.password -c
You should now be ready to retry Step 5: Migrate the Data to the RDS Instance.
Maintain Alation with the Database on an AWS RDS Instance¶
Moving the database to an AWS RDS instance will not change how you upgrade, back up, or restore the Alation applications. The commands for upgrading, backing up, and restoring the Alation applications remain the same. However, there are some differences in the backup and restore behavior.
Upgrade¶
There are no changes to how you upgrade Alation with the database(s) on an AWS RDS instance.
Back Up¶
After moving the database to AWS RDS, it will no longer be included directly in the Alation backup tar file. Instead, the Alation backup process creates an RDS snapshot in AWS using the existing RDS functionality. The snapshot is mapped to the Alation backup so it can be restored later along with Alation, if needed.
You can also use the Amazon RDS backup functionality separately from Alation backups.
Restore¶
When restoring Alation from a backup, the destructive_restore_all
action will automatically retrieve the AWS RDS snapshot that is mapped to the backup. The old AWS RDS instance will be stopped, and the new one will become available. In the AWS console, you will see both the old and newly restored RDS instances. The old one can be deleted if you want to save costs.