How to Scan Postgres For Corrupted Indexes

Customer Managed Applies to customer-managed instances of Alation

Applies from release 2020.4

Alation provides a tool to scan the internal Alation server database (Postgres) for corrupted indexes. We recommend scanning Postgres once a week and reviewing the results.

The Postgres Scan does not impact the performance of the Alation server, CPU, or memory utilization.

In this topic:

Run the Postgres Scan

You run the Postgres Scan manually on demand from the Alation shell. You can also schedule it to run automatically.

To scan Postgres:

  1. Enter the shell on the Alation host.

    sudo /etc/init.d/alation shell
    
  2. You can either run a default or a full scan. The default scan is usually enough to validate the Postgres health status. A default scan performs an index scan and a scan for frozen rows.

    Note

    The default scan typically takes approximately 54 minutes to process 576 GB of data, whereas the full scan requires about 1 hour and 17 minutes for the same amount of data.

    To run a default Postgres Scan:

    # change user to alation:
    
    sudo su alation
    
    # scan postgres
    
    alation_action scan_postgres
    

    If Postgres is healthy and there are no corrupted indexed, the scan will complete without any “ERROR” messages. If the scan finds any corrupted indexes, it will issue errors to the console:

    ../../_images/PostgresScan_01.png
  3. You can enable a full Postgres Scan using the alation_conf command. In addition to the index and frozen rows scans, the full scan checks invariants that span parent/child index relationships and the number of rows with the index scan disabled. The full scan is more time-consuming when compared to the default scan. It is recommended to run a full scan if the default scan results in errors in order to get more detailed logs.

    To enable the full Postgres Scan, run the command given below from the Alation shell. No restart is required.

    alation_conf alation.postgresql.full_scan -s True
    

    When the full scan is enabled, use the same command alation_action scan_postgres to perform a full scan.

    # change user to alation
    
    sudo su alation
    
    # scan postgres
    
    alation_action scan_postgres
    

Details of the scan can be found in the scan-postgres.log file, with the file’s location varying by version:

  • Versions 2021.2 and newer: /opt/alation/site/logs (path inside the Alation shell)

  • Versions before 2021.2: /var/log (path inside the Alation shell)

Important

If the Postgres Scan results in errors, contact Alation Support to resolve the Postgres issues.

Schedule the Postgres Scan

You can schedule the Postgres Scan using the alation_conf command and the parameters described below. Depending on the current value of the parameter alation.postgresql.full_scan (True or False), you schedule a default or full scan. For help with alation_conf, see Using alation_conf.

Changing any of the scheduling parameters require a restart of the celery-beat component.

  • alation.postgresql.auto_scan—Use to schedule the Postgres Scan.

    • Default value: False

    • Allowed values: True or False. Set to True to enable the schedule.

  • alation.postgresql.auto_scan.scheduled_hour—The hour value to start the scheduled scan.

    • Default value: 4

    • Allowed values: 0 to 23

  • alation.postgresql.auto_scan.scheduled_minute—The minutes value to start the automated scan.

    • Default value: 0

    • Allowed values: 0 to 59

  • alation.postgresql.auto_scan.scheduled_day—The day value to start the automated scan.

    • Default value: sun,mon,tue,wed,thu,fri,sat

    • Allowed values: Single or multiple weekdays names

To restart celery-beat, from the Alation shell, run:

alation_supervisor restart celery:celery-beat

For example, to schedule the Postgres Scan to run at 12:30 am every Saturday, do:

alation_conf alation.postgresql.scheduled_hour -s 0
alation_conf alation.postgresql.scheduled_minute -s 30
alation_conf alation.postgresql.scheduled_day -s sat
alation_conf alation.postgresql.auto_scan -s True
alation_supervisor restart celery:celery-beat

Starting in 2022.1, you can get email alerts when the Postgres Scan finds problems. To activate the alerts, see Enabling Admin Alerts. To add or remove the Postgres Scan from the list of alerts, see Configuring Which Alerts to Get.

Note

In versions older than 2022.1, the Postgres Scan doesn’t issue any email alerts for Alation admins. The admin has to either monitor the console output or review the logs to evaluate the state of Postgres.

Advanced Configuration for the Postgres Scan

Applies from version 2024.3.1

You can modify the Postgres Scan configuration to scan more databases, adjust performance of the tool, or get more information in the output.

This advanced configuration is available with the Postgres Scan via pg_amcheck that is enabled by default from version 2024.3.1. The Postgres Scan via pg_amcheck is also known as Postgres Scan V2. It’s not recommended to disable V2, although such a capability exists. V2 can be enabled or disabled using the alation_conf parameter alation.postgresql.scan_v2.enabled (True by default).

The advanced configuration options include:

  • Peformance parameters

  • Verbose logging

  • Table scanning

  • Scanning of multiple internal databases

The configuration can be done on the backend of the Alation server using alation_conf. No restart is required after changing any of the parameters. For help with alation_conf, see Using alation_conf.

The following configuration options are available:

  • alation.postgresql.scan_v2.databases—Defines the databases to be scanned.

    • Default value: rosemeta

    • Allowed values: Comma-separated list of internal databases (rosemeta, alation_analytics_v2, lineage)

  • alation.postgresql.scan_v2.scan_type—Determines whether to only scan indexes or both tables and indexes.

    • Default value: index

    • Allowed values: index, relation. Setting the parameter to relation will scan both tables and indexes. Keeping the default value index is enough to identify corruption. If corruption is detected at the index level, you can change the value to relation to check for corruption in both tables and indexes.

  • alation.postgresql.scan_v2.num_of_threads—Specifies the number of parallel connections to use for scanning tables.

    • Default value: 2

    • Allowed values: Any positive integer. The maximum recommended value is the number of CPUs on the instance. Increasing this value will also increase the I/O requirements.

  • alation.postgresql.scan_v2.io_nice—Sets the I/O priority of the Scan.

    • Default value: 2

    • Allowed values: 1 - 3. Setting the value to 1 may increase I/O spikes, while setting it to 3 may reduce them.

  • alation.postgresql.scan_v2.verbose—Enables verbose logging providing more detailed information beyond scanning progress, such as which table or index is currently being scanned.

    • Default value: False

    • Allowed values: True, False

    Example of standard logging:

    (env) PROD [alation@ip-10-13-40-230 /]$ alation_action scan_postgres
    <class 'ddtrace.internal.module.ModuleWatchdog'> installed
    initialized RateSampler, sample 100.0% of traces
    initialized DatadogSampler(agent_rates={}, limiter=RateLimiter(rate_limit=100, tokens=100,
        last_update_ns=2672701784317, effective_rate=1.0), rules=[])
    initialized RateSampler, sample 100.0% of traces
    initialized trace processor TraceTagsProcessor()
    initialized trace processor TraceSamplingProcessor(_compute_stats_enabled=False)
    initialized processor TopLevelSpanProcessor()
    initialized processor SpanAggregator(_partial_flush_enabled=True, _partial_flush_min_spans=500,
        _trace_processors=[TraceTagsProcessor(), TraceSamplingProcessor(_compute_stats_enabled=False)],
        _writer=AgentWriter(status=<ServiceStatus.STOPPED: 'stopped'>, _interval=1.0))
    {"data": {"module": "alation_conf", "version": null, "requestid": null,
        "msg": "Reloading conf. current time: 1725958356.1246192"},
        "header": {"timestamp": "2024-09-10T08:52:36.124678", "appname": null, "instanceurl": null,
        "tenantid": null, "loglevel": "DEBUG", "traceid": "0"}, "dd.trace_id": "0", "dd.span_id": "0", "dd.env": ""}
    {"data": {"module": "alation_conf", "version": null, "requestid": null,
        "msg": "alation_conf is initialized. current time: 1725958356.1246192"},
        "header": {"timestamp": "2024-09-10T08:52:36.439600", "appname": null, "instanceurl": null, "tenantid": null,
        "loglevel": "DEBUG", "traceid": "0"}, "dd.trace_id": "0", "dd.span_id": "0", "dd.env": ""}
    2024-09-10 08:52:36,463 - INFO - Starting Checking rosemeta Database.
    2024-09-10 08:52:36,501 - INFO - NOTICE:  extension "amcheck" already exists, skipping
    2024-09-10 08:52:36,540 - INFO - 0/3640 relations (0%),    0/4697 pages (0%)
    2024-09-10 08:52:37,000 - INFO - 3060/3640 relations (84%), 4117/4697 pages (87%)
    2024-09-10 08:52:37,076 - INFO - 3640/3640 relations (100%), 4697/4697 pages (100%)
    2024-09-10 08:52:37,081 - INFO - Postgres bt_index_check scan ran successfully. No errors found.
    2024-09-10 08:52:37,108 - INFO - Postgres check visibility map is not corrupted
    2024-09-10 08:52:37,113 - INFO - scan_postgres is successfully completed.
    Waiting 5 seconds for tracer to finish. Hit ctrl-c to quit.
    

    Example of verbose logging:

    (env) PROD [alation@ip-10-13-40-230 /]$ alation_conf alation.postgresql.scan_v2.verbose -s True
    alation.postgresql.scan_v2.verbose = True
    (env) PROD [alation@ip-10-13-40-230 /]$ alation_action scan_postgres
    <class 'ddtrace.internal.module.ModuleWatchdog'> installed
    initialized RateSampler, sample 100.0% of traces
    initialized DatadogSampler(agent_rates={}, limiter=RateLimiter(rate_limit=100, tokens=100,
        last_update_ns=3307500256024, effective_rate=1.0), rules=[])
    initialized RateSampler, sample 100.0% of traces
    initialized trace processor TraceTagsProcessor()
    initialized trace processor TraceSamplingProcessor(_compute_stats_enabled=False)
    initialized processor TopLevelSpanProcessor()
    initialized processor SpanAggregator(_partial_flush_enabled=True, _partial_flush_min_spans=500, _trace_processors=[TraceTagsProcessor(),
        TraceSamplingProcessor(_compute_stats_enabled=False)], _writer=AgentWriter(status=<ServiceStatus.STOPPED: 'stopped'>, _interval=1.0))
    {"data": {"module": "alation_conf", "version": null, "requestid": null,
        "msg": "Reloading conf. current time: 1725958990.9148085"}, "header": {"timestamp": "2024-09-10T09:03:10.914866", "appname": null,
        "instanceurl": null, "tenantid": null, "loglevel": "DEBUG", "traceid": "0"}, "dd.trace_id": "0", "dd.span_id": "0", "dd.env": ""}
    {"data": {"module": "alation_conf", "version": null, "requestid": null,
        "msg": "alation_conf is initialized. current time: 1725958990.9148085"}, "header": {"timestamp": "2024-09-10T09:03:11.235590",
        "appname": null, "instanceurl": null, "tenantid": null, "loglevel": "DEBUG", "traceid": "0"}, "dd.trace_id": "0",
        "dd.span_id": "0", "dd.env": ""}
    2024-09-10 09:03:11,259 - INFO - Starting Checking rosemeta Database.
    2024-09-10 09:03:11,293 - INFO - pg_amcheck: including database "rosemeta"
    2024-09-10 09:03:11,295 - INFO - NOTICE:  extension "amcheck" already exists, skipping
    2024-09-10 09:03:11,296 - INFO - pg_amcheck: in database "rosemeta": using amcheck version "1.3" in schema "public"
    2024-09-10 09:03:11,329 - INFO - 0/3640 relations (0%),    0/4697 pages (0%)
    2024-09-10 09:03:11,329 - INFO - pg_amcheck: checking btree index "rosemeta.pg_catalog.pg_depend_depender_index"
    2024-09-10 09:03:11,332 - INFO - pg_amcheck: checking btree index "rosemeta.pg_catalog.pg_attribute_relid_attnam_index"
    2024-09-10 09:03:11,337 - INFO - pg_amcheck: checking btree index "rosemeta.pg_catalog.pg_attribute_relid_attnum_index"
    2024-09-10 09:03:11,346 - INFO - pg_amcheck: checking btree index "rosemeta.pg_catalog.pg_depend_reference_index"
    2024-09-10 09:03:11,349 - INFO - pg_amcheck: checking btree index "rosemeta.pg_catalog.pg_class_relname_nsp_index"
    ....
    2024-09-10 09:03:11,991 - INFO - 3640/3640 relations (100%), 4697/4697 pages (100%)
    2024-09-10 09:03:11,997 - INFO - Postgres bt_index_check scan ran successfully. No errors found.
    2024-09-10 09:03:12,027 - INFO - Postgres check visibility map is not corrupted
    2024-09-10 09:03:12,031 - INFO - scan_postgres is successfully completed.
    Waiting 5 seconds for tracer to finish. Hit ctrl-c to quit.