#postgresql #backups #datacorruption

Monitoring for data corruption in PostgreSQL

Latest episode of Postges FM podcast does interesting overview of top things to be aware of when operating PostgreSQL deployments.

All are important, obviously, but if I were to chose one, corruption case is the worst.

Let's say we have page checksums on, do regular backups and run test restoration as mandatory step for each of them:

  1. run full am check,
  2. run pg_checksums,
  3. read whole db via pg_dumpall > /dev/null,
  4. run reindex schema,
  5. run some application layer checks (e.g delta of number of records +- same as on prod),
  6. run app test environments on such restored databases (not always possible due to protecting client data).

On top of that, we alert on logs for XX001 and XX002, among others of course.

Can we still do better? Maybe run pg_dumpall>/dev/null on production periodically, if load allows?

Also: Real fun doesn't start until you actually DO detect something is wrong...