How to Scan Postgres For Corrupted Indexes¶
Customer Managed Applies to customer-managed instances of Alation
Applies from release 2020.4
Alation provides a tool to scan the internal Alation server database (Postgres) for corrupted indexes. We recommend scanning Postgres once a week and reviewing the results.
The Postgres Scan does not impact the performance of the Alation server, CPU, or memory utilization.
In this topic:
Run the Postgres Scan¶
You run the Postgres Scan manually on demand from the Alation shell. You can also schedule it to run automatically.
To scan Postgres:
Enter the shell on the Alation host.
sudo /etc/init.d/alation shell
You can either run a default or a full scan. The default scan is usually enough to validate the Postgres health status. A default scan performs an index scan and a scan for frozen rows.
Note
The default scan typically takes approximately 54 minutes to process 576 GB of data, whereas the full scan requires about 1 hour and 17 minutes for the same amount of data.
To run a default Postgres Scan:
# change user to alation: sudo su alation # scan postgres alation_action scan_postgres
If Postgres is healthy and there are no corrupted indexed, the scan will complete without any “ERROR” messages. If the scan finds any corrupted indexes, it will issue errors to the console:
You can enable a full Postgres Scan using the alation_conf command. In addition to the index and frozen rows scans, the full scan checks invariants that span parent/child index relationships and the number of rows with the index scan disabled. The full scan is more time-consuming when compared to the default scan. It is recommended to run a full scan if the default scan results in errors in order to get more detailed logs.
To enable the full Postgres Scan, run the command given below from the Alation shell. No restart is required.
alation_conf alation.postgresql.full_scan -s True
When the full scan is enabled, use the same command
alation_action scan_postgres
to perform a full scan.# change user to alation sudo su alation # scan postgres alation_action scan_postgres
Details of the scan can be found in the scan-postgres.log file, with the file’s location varying by version:
Versions 2021.2 and newer: /opt/alation/site/logs (path inside the Alation shell)
Versions before 2021.2: /var/log (path inside the Alation shell)
Important
If the Postgres Scan results in errors, contact Alation Support to resolve the Postgres issues.
Schedule the Postgres Scan¶
You can schedule the Postgres Scan using the alation_conf command and the parameters described below. Depending on the current value of the parameter alation.postgresql.full_scan
(True
or False
), you schedule a default or full scan. For help with alation_conf, see Using alation_conf.
Changing any of the scheduling parameters require a restart of the celery-beat
component.
alation.postgresql.auto_scan
—Use to schedule the Postgres Scan.Default value:
False
Allowed values:
True
orFalse
. Set toTrue
to enable the schedule.
alation.postgresql.auto_scan.scheduled_hour
—The hour value to start the scheduled scan.Default value:
4
Allowed values:
0
to23
alation.postgresql.auto_scan.scheduled_minute
—The minutes value to start the automated scan.Default value:
0
Allowed values:
0
to59
alation.postgresql.auto_scan.scheduled_day
—The day value to start the automated scan.Default value:
sun,mon,tue,wed,thu,fri,sat
Allowed values: Single or multiple weekdays names
To restart celery-beat
, from the Alation shell, run:
alation_supervisor restart celery:celery-beat
For example, to schedule the Postgres Scan to run at 12:30 am every Saturday, do:
alation_conf alation.postgresql.scheduled_hour -s 0 alation_conf alation.postgresql.scheduled_minute -s 30 alation_conf alation.postgresql.scheduled_day -s sat alation_conf alation.postgresql.auto_scan -s True alation_supervisor restart celery:celery-beat
Starting in 2022.1, you can get email alerts when the Postgres Scan finds problems. To activate the alerts, see Enabling Admin Alerts. To add or remove the Postgres Scan from the list of alerts, see Configuring Which Alerts to Get.
Note
In versions older than 2022.1, the Postgres Scan doesn’t issue any email alerts for Alation admins. The admin has to either monitor the console output or review the logs to evaluate the state of Postgres.
Advanced Configuration for the Postgres Scan¶
Applies from version 2024.3.1
You can modify the Postgres Scan configuration to scan more databases, adjust performance of the tool, or get more information in the output.
This advanced configuration is available with the Postgres Scan via pg_amcheck
that is enabled by default from version 2024.3.1. The Postgres Scan via pg_amcheck
is also known as Postgres Scan V2. It’s not recommended to disable V2, although such a capability exists. V2 can be enabled or disabled using the alation_conf parameter alation.postgresql.scan_v2.enabled
(True
by default).
The advanced configuration options include:
Peformance parameters
Verbose logging
Table scanning
Scanning of multiple internal databases
The configuration can be done on the backend of the Alation server using alation_conf. No restart is required after changing any of the parameters. For help with alation_conf, see Using alation_conf.
The following configuration options are available:
alation.postgresql.scan_v2.databases
—Defines the databases to be scanned.Default value:
rosemeta
Allowed values: Comma-separated list of internal databases (
rosemeta
,alation_analytics_v2
,lineage
)
alation.postgresql.scan_v2.scan_type
—Determines whether to only scan indexes or both tables and indexes.Default value:
index
Allowed values:
index
,relation
. Setting the parameter torelation
will scan both tables and indexes. Keeping the default valueindex
is enough to identify corruption. If corruption is detected at theindex
level, you can change the value torelation
to check for corruption in both tables and indexes.
alation.postgresql.scan_v2.num_of_threads
—Specifies the number of parallel connections to use for scanning tables.Default value:
2
Allowed values: Any positive integer. The maximum recommended value is the number of CPUs on the instance. Increasing this value will also increase the I/O requirements.
alation.postgresql.scan_v2.io_nice
—Sets the I/O priority of the Scan.Default value:
2
Allowed values:
1
-3
. Setting the value to1
may increase I/O spikes, while setting it to3
may reduce them.
alation.postgresql.scan_v2.verbose
—Enables verbose logging providing more detailed information beyond scanning progress, such as which table or index is currently being scanned.Default value:
False
Allowed values:
True
,False
Example of standard logging:
(env) PROD [alation@ip-10-13-40-230 /]$ alation_action scan_postgres <class 'ddtrace.internal.module.ModuleWatchdog'> installed initialized RateSampler, sample 100.0% of traces initialized DatadogSampler(agent_rates={}, limiter=RateLimiter(rate_limit=100, tokens=100, last_update_ns=2672701784317, effective_rate=1.0), rules=[]) initialized RateSampler, sample 100.0% of traces initialized trace processor TraceTagsProcessor() initialized trace processor TraceSamplingProcessor(_compute_stats_enabled=False) initialized processor TopLevelSpanProcessor() initialized processor SpanAggregator(_partial_flush_enabled=True, _partial_flush_min_spans=500, _trace_processors=[TraceTagsProcessor(), TraceSamplingProcessor(_compute_stats_enabled=False)], _writer=AgentWriter(status=<ServiceStatus.STOPPED: 'stopped'>, _interval=1.0)) {"data": {"module": "alation_conf", "version": null, "requestid": null, "msg": "Reloading conf. current time: 1725958356.1246192"}, "header": {"timestamp": "2024-09-10T08:52:36.124678", "appname": null, "instanceurl": null, "tenantid": null, "loglevel": "DEBUG", "traceid": "0"}, "dd.trace_id": "0", "dd.span_id": "0", "dd.env": ""} {"data": {"module": "alation_conf", "version": null, "requestid": null, "msg": "alation_conf is initialized. current time: 1725958356.1246192"}, "header": {"timestamp": "2024-09-10T08:52:36.439600", "appname": null, "instanceurl": null, "tenantid": null, "loglevel": "DEBUG", "traceid": "0"}, "dd.trace_id": "0", "dd.span_id": "0", "dd.env": ""} 2024-09-10 08:52:36,463 - INFO - Starting Checking rosemeta Database. 2024-09-10 08:52:36,501 - INFO - NOTICE: extension "amcheck" already exists, skipping 2024-09-10 08:52:36,540 - INFO - 0/3640 relations (0%), 0/4697 pages (0%) 2024-09-10 08:52:37,000 - INFO - 3060/3640 relations (84%), 4117/4697 pages (87%) 2024-09-10 08:52:37,076 - INFO - 3640/3640 relations (100%), 4697/4697 pages (100%) 2024-09-10 08:52:37,081 - INFO - Postgres bt_index_check scan ran successfully. No errors found. 2024-09-10 08:52:37,108 - INFO - Postgres check visibility map is not corrupted 2024-09-10 08:52:37,113 - INFO - scan_postgres is successfully completed. Waiting 5 seconds for tracer to finish. Hit ctrl-c to quit.
Example of verbose logging:
(env) PROD [alation@ip-10-13-40-230 /]$ alation_conf alation.postgresql.scan_v2.verbose -s True alation.postgresql.scan_v2.verbose = True (env) PROD [alation@ip-10-13-40-230 /]$ alation_action scan_postgres <class 'ddtrace.internal.module.ModuleWatchdog'> installed initialized RateSampler, sample 100.0% of traces initialized DatadogSampler(agent_rates={}, limiter=RateLimiter(rate_limit=100, tokens=100, last_update_ns=3307500256024, effective_rate=1.0), rules=[]) initialized RateSampler, sample 100.0% of traces initialized trace processor TraceTagsProcessor() initialized trace processor TraceSamplingProcessor(_compute_stats_enabled=False) initialized processor TopLevelSpanProcessor() initialized processor SpanAggregator(_partial_flush_enabled=True, _partial_flush_min_spans=500, _trace_processors=[TraceTagsProcessor(), TraceSamplingProcessor(_compute_stats_enabled=False)], _writer=AgentWriter(status=<ServiceStatus.STOPPED: 'stopped'>, _interval=1.0)) {"data": {"module": "alation_conf", "version": null, "requestid": null, "msg": "Reloading conf. current time: 1725958990.9148085"}, "header": {"timestamp": "2024-09-10T09:03:10.914866", "appname": null, "instanceurl": null, "tenantid": null, "loglevel": "DEBUG", "traceid": "0"}, "dd.trace_id": "0", "dd.span_id": "0", "dd.env": ""} {"data": {"module": "alation_conf", "version": null, "requestid": null, "msg": "alation_conf is initialized. current time: 1725958990.9148085"}, "header": {"timestamp": "2024-09-10T09:03:11.235590", "appname": null, "instanceurl": null, "tenantid": null, "loglevel": "DEBUG", "traceid": "0"}, "dd.trace_id": "0", "dd.span_id": "0", "dd.env": ""} 2024-09-10 09:03:11,259 - INFO - Starting Checking rosemeta Database. 2024-09-10 09:03:11,293 - INFO - pg_amcheck: including database "rosemeta" 2024-09-10 09:03:11,295 - INFO - NOTICE: extension "amcheck" already exists, skipping 2024-09-10 09:03:11,296 - INFO - pg_amcheck: in database "rosemeta": using amcheck version "1.3" in schema "public" 2024-09-10 09:03:11,329 - INFO - 0/3640 relations (0%), 0/4697 pages (0%) 2024-09-10 09:03:11,329 - INFO - pg_amcheck: checking btree index "rosemeta.pg_catalog.pg_depend_depender_index" 2024-09-10 09:03:11,332 - INFO - pg_amcheck: checking btree index "rosemeta.pg_catalog.pg_attribute_relid_attnam_index" 2024-09-10 09:03:11,337 - INFO - pg_amcheck: checking btree index "rosemeta.pg_catalog.pg_attribute_relid_attnum_index" 2024-09-10 09:03:11,346 - INFO - pg_amcheck: checking btree index "rosemeta.pg_catalog.pg_depend_reference_index" 2024-09-10 09:03:11,349 - INFO - pg_amcheck: checking btree index "rosemeta.pg_catalog.pg_class_relname_nsp_index" .... 2024-09-10 09:03:11,991 - INFO - 3640/3640 relations (100%), 4697/4697 pages (100%) 2024-09-10 09:03:11,997 - INFO - Postgres bt_index_check scan ran successfully. No errors found. 2024-09-10 09:03:12,027 - INFO - Postgres check visibility map is not corrupted 2024-09-10 09:03:12,031 - INFO - scan_postgres is successfully completed. Waiting 5 seconds for tracer to finish. Hit ctrl-c to quit.