Re: Trying to handle db corruption 9.6
On Tue, May 21, 2019 at 04:03:52PM +, Greg Clough wrote: My restore command copy the wals from archive dir in the primary to an archive dir in the secondary(different from the pg_xlog in the secondary) I think that you're restore command puts them back into the archive, and then uncompresses them into pg_xlog, which is what %p represents. Should I run it manually and see if the archives are copied to the archive dir in the secondary or should I just copy all of them to the xlog dir in the secondary ? That would be my first test, but as Thomas mentioned, you don't have any hint of WAL archives being restored in the postgresql.log... so it's not even trying. It's not likely that archive_command is your problem at the moment. I tried to start the secondary as a primary (I have a backup..) but I still got an error (invalid checkpoint record from primary./ secondary). Does it means that my backup is corrupted ? I think so, but Thomas could probably confirm if all hope is lost. Also, I'm not sure if there is a terminology difference but a "standby" is never considered a "backup". I realise it's late in the day, but even if you have a correctly configured Standby you should also take backups with pg_basebackup, Barman, pgBackRest, etc. Well, I have no idea. We still got no information about how the standby was created, if it was ever running fine, and so on. Considering it does not seem to be getting data from the archive, it might be the case it was created in some strange way and never really worked. And if there really are no log messages about the restore_command, it probably fails before the standby even tries to execute it. So I don't know. Restoring backups is where I would be heading now, as things seem terribly broken. Right. But my impression is there are no backups ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Trying to handle db corruption 9.6
Tomas - Well, when I run the restore_command manually it works (archive dir exists on the secondary..). Thank for the explanation on the system catalogs.. Greg - My restore command copy the wals from archive dir in the primary to an archive dir in the secondary(different from the pg_xlog in the secondary). Should I run it manually and see if the archives are copied to the archive dir in the secondary or should I just copy all of them to the xlog dir in the secondary ? I tried to start the secondary as a primary (I have a backup..) but I still got an error (invalid checkpoint record from primary./ secondary). Does it means that my backup is corrupted ? בתאריך יום ג׳, 21 במאי 2019 ב-16:07 מאת Tomas Vondra < tomas.von...@2ndquadrant.com>: > On Tue, May 21, 2019 at 12:01:31PM +0300, Mariel Cherkassky wrote: > >Tomas : > > > >Well, when you say it does not work, why do you think so? Does it print > >some error, or what? Does it even get executed? It does not seem to be > >the case, judging by the log (there's no archive_command message). > > > >How was the "secondary machine" created? You said you have all the WAL > >since then - how do you know that? > > > >Well, when I start the secondary in recovery mode (the primary is down, > >auto failover is disabled..) it doesnt start recovering the archive wals > >from the primary. The logs of the secondary : > >receiving incremental file list > >rsync: link_stat "/var/lib/pgsql/archive/0002.history" failed: No such > >file or directory (2) > > > >sent 8 bytes received 10 bytes 36.00 bytes/sec > >total size is 0 speedup is 0.00 > >rsync error: some files/attrs were not transferred (see previous errors) > >(code 23) at main.c(1505) [receiver=3.0.6] > >sh: /var/lib/pgsql/archive/0002.history: No such file or directory > >2019-05-20 09:41:33 PDT 18558 LOG: entering standby mode > >2019-05-20 09:41:33 PDT 18558 LOG: invalid primary checkpoint record > >2019-05-20 09:41:33 PDT 18558 LOG: invalid secondary checkpoint link in > >control file > >2019-05-20 09:41:33 PDT 18558 PANIC: could not locate a valid > checkpoint > >record > >2019-05-20 09:41:33 PDT 18555 LOG: startup process (PID 18558) was > >terminated by signal 6: Aborted > >2019-05-20 09:41:33 PDT 18555 LOG: aborting startup due to startup > >process failure > >2019-05-20 09:41:33 PDT 18555 LOG: database system is shut down > >2019-05-20 09:56:12 PDT 18701 LOG: database system was shut down in > >recovery at 2019-05-01 09:40:02 PDT > > > >As I said, the secondary was down for a month and I have all the archives > >of the wals in my primary. I was hoping that the secondary will use the > >restore_command to restore them : > >restore_command = 'rsync -avzhe ssh postgres@X.X.X.X > :/var/lib/pgsql/archive/%f > >/var/lib/pgsql/archive/%f ; gunzip < /var/lib/pgsql/archive/%f > %p' > > > >my archive_command on the primary was : > >archive_command = 'gzip < %p > /var/lib/pgsql/archive/%f' > > > >Am I missing something ? > > > > First of all, the way you quote message is damn confusing - there's no > clear difference between your message and the message you quote. I don't > know which mail client you're using, but I suppose it can be configured to > quote sensibly ... > > Well, clearly the standby tries to fetch WAL from archive, but the rsync > command fails for some reason. You're in the position to investigate > further, because you can run it manually - we can't. This has nothing to > do with PostgreSQL. My guess is you don't have /var/lib/pgsql/archive on > the standby, and it's confusing because archive uses the same path. > > > >Another question, If I'll run initdb and initiate a new cluster and i'll > >copy the data files of my old cluster into the new one, is there any > chance > >that it will work ? > >I mean right now, my primary is down and cant start up because it is > >missing an offset file in the pg_multixtrans/offset dir. > > > > No, because you won't have contents of system catalogs, mapping the data > files to relations (tables, indexes) and containing information about the > structure (which columns / data types are in the data). > > The data files are pretty useless on their own. It might be possible to do > some manualy recovery - say, you might create the same tables in the new > schema, and then guess which data files belong to them. But there are > various caveats e.g. due to dropped columns, etc. > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > >
Re: Trying to handle db corruption 9.6
On Tue, May 21, 2019 at 12:01:31PM +0300, Mariel Cherkassky wrote: Tomas : Well, when you say it does not work, why do you think so? Does it print some error, or what? Does it even get executed? It does not seem to be the case, judging by the log (there's no archive_command message). How was the "secondary machine" created? You said you have all the WAL since then - how do you know that? Well, when I start the secondary in recovery mode (the primary is down, auto failover is disabled..) it doesnt start recovering the archive wals from the primary. The logs of the secondary : receiving incremental file list rsync: link_stat "/var/lib/pgsql/archive/0002.history" failed: No such file or directory (2) sent 8 bytes received 10 bytes 36.00 bytes/sec total size is 0 speedup is 0.00 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1505) [receiver=3.0.6] sh: /var/lib/pgsql/archive/0002.history: No such file or directory 2019-05-20 09:41:33 PDT 18558 LOG: entering standby mode 2019-05-20 09:41:33 PDT 18558 LOG: invalid primary checkpoint record 2019-05-20 09:41:33 PDT 18558 LOG: invalid secondary checkpoint link in control file 2019-05-20 09:41:33 PDT 18558 PANIC: could not locate a valid checkpoint record 2019-05-20 09:41:33 PDT 18555 LOG: startup process (PID 18558) was terminated by signal 6: Aborted 2019-05-20 09:41:33 PDT 18555 LOG: aborting startup due to startup process failure 2019-05-20 09:41:33 PDT 18555 LOG: database system is shut down 2019-05-20 09:56:12 PDT 18701 LOG: database system was shut down in recovery at 2019-05-01 09:40:02 PDT As I said, the secondary was down for a month and I have all the archives of the wals in my primary. I was hoping that the secondary will use the restore_command to restore them : restore_command = 'rsync -avzhe ssh postgres@X.X.X.X:/var/lib/pgsql/archive/%f /var/lib/pgsql/archive/%f ; gunzip < /var/lib/pgsql/archive/%f > %p' my archive_command on the primary was : archive_command = 'gzip < %p > /var/lib/pgsql/archive/%f' Am I missing something ? First of all, the way you quote message is damn confusing - there's no clear difference between your message and the message you quote. I don't know which mail client you're using, but I suppose it can be configured to quote sensibly ... Well, clearly the standby tries to fetch WAL from archive, but the rsync command fails for some reason. You're in the position to investigate further, because you can run it manually - we can't. This has nothing to do with PostgreSQL. My guess is you don't have /var/lib/pgsql/archive on the standby, and it's confusing because archive uses the same path. Another question, If I'll run initdb and initiate a new cluster and i'll copy the data files of my old cluster into the new one, is there any chance that it will work ? I mean right now, my primary is down and cant start up because it is missing an offset file in the pg_multixtrans/offset dir. No, because you won't have contents of system catalogs, mapping the data files to relations (tables, indexes) and containing information about the structure (which columns / data types are in the data). The data files are pretty useless on their own. It might be possible to do some manualy recovery - say, you might create the same tables in the new schema, and then guess which data files belong to them. But there are various caveats e.g. due to dropped columns, etc. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Trying to handle db corruption 9.6
Tomas : Well, when you say it does not work, why do you think so? Does it print some error, or what? Does it even get executed? It does not seem to be the case, judging by the log (there's no archive_command message). How was the "secondary machine" created? You said you have all the WAL since then - how do you know that? Well, when I start the secondary in recovery mode (the primary is down, auto failover is disabled..) it doesnt start recovering the archive wals from the primary. The logs of the secondary : receiving incremental file list rsync: link_stat "/var/lib/pgsql/archive/0002.history" failed: No such file or directory (2) sent 8 bytes received 10 bytes 36.00 bytes/sec total size is 0 speedup is 0.00 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1505) [receiver=3.0.6] sh: /var/lib/pgsql/archive/0002.history: No such file or directory 2019-05-20 09:41:33 PDT 18558 LOG: entering standby mode 2019-05-20 09:41:33 PDT 18558 LOG: invalid primary checkpoint record 2019-05-20 09:41:33 PDT 18558 LOG: invalid secondary checkpoint link in control file 2019-05-20 09:41:33 PDT 18558 PANIC: could not locate a valid checkpoint record 2019-05-20 09:41:33 PDT 18555 LOG: startup process (PID 18558) was terminated by signal 6: Aborted 2019-05-20 09:41:33 PDT 18555 LOG: aborting startup due to startup process failure 2019-05-20 09:41:33 PDT 18555 LOG: database system is shut down 2019-05-20 09:56:12 PDT 18701 LOG: database system was shut down in recovery at 2019-05-01 09:40:02 PDT As I said, the secondary was down for a month and I have all the archives of the wals in my primary. I was hoping that the secondary will use the restore_command to restore them : restore_command = 'rsync -avzhe ssh postgres@X.X.X.X:/var/lib/pgsql/archive/%f /var/lib/pgsql/archive/%f ; gunzip < /var/lib/pgsql/archive/%f > %p' my archive_command on the primary was : archive_command = 'gzip < %p > /var/lib/pgsql/archive/%f' Am I missing something ? Another question, If I'll run initdb and initiate a new cluster and i'll copy the data files of my old cluster into the new one, is there any chance that it will work ? I mean right now, my primary is down and cant start up because it is missing an offset file in the pg_multixtrans/offset dir. בתאריך יום ג׳, 21 במאי 2019 ב-0:04 מאת Tomas Vondra < tomas.von...@2ndquadrant.com>: > On Mon, May 20, 2019 at 08:20:33PM +0300, Mariel Cherkassky wrote: > > Hey Greg, > > Basically my backup was made after the first pg_resetxlog so I was > wrong. > > Bummer. > > > However, the customer had a secondary machine that wasn't synced for a > > month. I have all the walls since the moment the secondary went out of > > sync. Once I started it I hoped that it will start recover the wals and > > fill the gap. However I got an error in the secondary : > >2019-05-20 10:11:28 PDT 19021 LOG: entering standby mode > > 2019-05-20 10:11:28 PDT 19021 LOG: invalid primary checkpoint record > > 2019-05-20 10:11:28 PDT 19021 LOG: invalid secondary checkpoint > link in > > control file > > 2019-05-20 10:11:28 PDT 19021 PANIC: could not locate a valid > > checkpoint record > > 2019-05-20 10:11:28 PDT 19018 LOG: startup process (PID 19021) was > > terminated by signal 6: Aborted > > 2019-05-20 10:11:28 PDT 19018 LOG: aborting startup due to startup > > process failure > > 2019-05-20 10:11:28 PDT 19018 LOG: database system is shut down. > > > I checked my secondary archive dir and pg_xlog dir > and > > it seems that the restore command doesnt work. My restore_command: > > > restore_command = 'rsync -avzhe ssh > > postgres@x.x.x.x:/var/lib/pgsql/archive/%f /var/lib/pgsql/archive/%f ; > > gunzip < /var/lib/pgsql/archive/%f > %p' > > archive_cleanup_command = '/usr/pgsql-9.6/bin/pg_archivecleanup > > /var/lib/pgsql/archive %r' > > Well, when you say it does not work, why do you think so? Does it print > some error, or what? Does it even get executed? It does not seem to be > the case, judging by the log (there's no archive_command message). > > How was the "secondary machine" created? You said you have all the WAL > since then - how do you know that? > > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >
Re: Trying to handle db corruption 9.6
On Mon, May 20, 2019 at 08:20:33PM +0300, Mariel Cherkassky wrote: Hey Greg, Basically my backup was made after the first pg_resetxlog so I was wrong. Bummer. However, the customer had a secondary machine that wasn't synced for a month. I have all the walls since the moment the secondary went out of sync. Once I started it I hoped that it will start recover the wals and fill the gap. However I got an error in the secondary : 2019-05-20 10:11:28 PDT 19021 LOG: entering standby mode 2019-05-20 10:11:28 PDT 19021 LOG: invalid primary checkpoint record 2019-05-20 10:11:28 PDT 19021 LOG: invalid secondary checkpoint link in control file 2019-05-20 10:11:28 PDT 19021 PANIC: could not locate a valid checkpoint record 2019-05-20 10:11:28 PDT 19018 LOG: startup process (PID 19021) was terminated by signal 6: Aborted 2019-05-20 10:11:28 PDT 19018 LOG: aborting startup due to startup process failure 2019-05-20 10:11:28 PDT 19018 LOG: database system is shut down. I checked my secondary archive dir and pg_xlog dir and it seems that the restore command doesnt work. My restore_command: restore_command = 'rsync -avzhe ssh postgres@x.x.x.x:/var/lib/pgsql/archive/%f /var/lib/pgsql/archive/%f ; gunzip < /var/lib/pgsql/archive/%f > %p' archive_cleanup_command = '/usr/pgsql-9.6/bin/pg_archivecleanup /var/lib/pgsql/archive %r' Well, when you say it does not work, why do you think so? Does it print some error, or what? Does it even get executed? It does not seem to be the case, judging by the log (there's no archive_command message). How was the "secondary machine" created? You said you have all the WAL since then - how do you know that? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Trying to handle db corruption 9.6
On Mon, May 20, 2019 at 04:20:45PM +, Bimal wrote: I had ran into same issue about year back, luckily I had standby to quickly promote. But, I wish there was better a documentation on how to handle WAL log fill up and resetting them. pg_resetxlog is not a tool to deal with "WAL fill up". It's a last resort option to deal with corrupted WAL, and can easily make matters worse when used without due consideration. That seems to be the case here, unfortunately. On a properly behaving system, running out of disk space for pg_xlog results in database shutdown. If you also get corrupted WAL, you have bigger problems, I'm afraid. Also, data corruption issues are one-off events, mostly unique. That makes it rather difficult (~impossible) to write docs about recovering from them. And it's why there are no magic tools. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Trying to handle db corruption 9.6
Hey Greg, Basically my backup was made after the first pg_resetxlog so I was wrong. However, the customer had a secondary machine that wasn't synced for a month. I have all the walls since the moment the secondary went out of sync. Once I started it I hoped that it will start recover the wals and fill the gap. However I got an error in the secondary : 2019-05-20 10:11:28 PDT 19021 LOG: entering standby mode 2019-05-20 10:11:28 PDT 19021 LOG: invalid primary checkpoint record 2019-05-20 10:11:28 PDT 19021 LOG: invalid secondary checkpoint link in control file 2019-05-20 10:11:28 PDT 19021 PANIC: could not locate a valid checkpoint record 2019-05-20 10:11:28 PDT 19018 LOG: startup process (PID 19021) was terminated by signal 6: Aborted 2019-05-20 10:11:28 PDT 19018 LOG: aborting startup due to startup process failure 2019-05-20 10:11:28 PDT 19018 LOG: database system is shut down. I checked my secondary archive dir and pg_xlog dir and it seems that the restore command doesnt work. My restore_command: restore_command = 'rsync -avzhe ssh postgres@x.x.x.x:/var/lib/pgsql/archive/%f /var/lib/pgsql/archive/%f ; gunzip < /var/lib/pgsql/archive/%f > %p' archive_cleanup_command = '/usr/pgsql-9.6/bin/pg_archivecleanup /var/lib/pgsql/archive %r' On Mon, May 20, 2019, 7:20 PM Bimal wrote: > I had ran into same issue about year back, luckily I had standby to > quickly promote. But, I wish there was better a documentation on how to > handle WAL log fill up and resetting them. > > On Monday, May 20, 2019, 9:08:19 AM PDT, Mariel Cherkassky < > mariel.cherkas...@gmail.com> wrote: > > > A backup was made after the corruption appeared but before I tried using > the pg_resetxlog command. Basically I just want to start the database with > the data that is available in the files(I'm ok with loosing data that was > in the cache and wasnt written to disk). > My question is how can I continue from here ? > I also sent this mail to pgadmin mail list.. > > בתאריך יום ב׳, 20 במאי 2019 ב-18:59 מאת Greg Clough < > greg.clo...@ihsmarkit.com>: > > > Yes I understand that.. I'm trying to handle it after the backup that I > have taken.. > > > IMHO the best option here is to keep safe a copy as you have already done > and then restore from a backup, and replay whatever WAL you have. The > database you have is terminally corrupted, and should never be relied upon > going forward. > > > > You can try to get it running, and then extract the data with pg_dump... > but even then you will need to manually verify it’s OK because you have no > idea which dirty blocks from memory have been written to disk and which > have not. Without the WAL you have no way of making it consistent, and if > they have been destroyed then you’re out of luck. > > > > If you don’t have backups and archived WAL then fixing what you’ve got may > be your only option, but you should only go down that route if you have > to. If you have to “repair”, then I’d recommend engaging a reputable > PostgreSQL consultancy to help you. > > > > Regards, > > Greg. > > P.S. This conversation should probably be moved to something like > pgsql-admin > > > > -- > > This e-mail, including accompanying communications and attachments, is > strictly confidential and only for the intended recipient. Any retention, > use or disclosure not expressly authorised by IHSMarkit is prohibited. This > email is subject to all waivers and other terms at the following link: > https://ihsmarkit.com/Legal/EmailDisclaimer.html > > Please visit www.ihsmarkit.com/about/contact-us.html for contact > information on our offices worldwide. > >
Re: Trying to handle db corruption 9.6
I had ran into same issue about year back, luckily I had standby to quickly promote. But, I wish there was better a documentation on how to handle WAL log fill up and resetting them. On Monday, May 20, 2019, 9:08:19 AM PDT, Mariel Cherkassky wrote: A backup was made after the corruption appeared but before I tried using the pg_resetxlog command. Basically I just want to start the database with the data that is available in the files(I'm ok with loosing data that was in the cache and wasnt written to disk).My question is how can I continue from here ?I also sent this mail to pgadmin mail list.. בתאריך יום ב׳, 20 במאי 2019 ב-18:59 מאת Greg Clough <greg.clo...@ihsmarkit.com>: > Yes I understand that.. I'm trying to handle it after the backup that I have > taken.. IMHO the best option here is to keep safe a copy as you have already done and then restore from a backup, and replay whatever WAL you have. The database you have is terminally corrupted, and should never be relied upon going forward. You can try to get it running, and then extract the data with pg_dump... but even then you will need to manually verify it’s OK because you have no idea which dirty blocks from memory have been written to disk and which have not. Without the WAL you have no way of making it consistent, and if they have been destroyed then you’re out of luck. If you don’t have backups and archived WAL then fixing what you’ve got may be your only option, but you should only go down that route if you have to. If you have to “repair”, then I’d recommend engaging a reputable PostgreSQL consultancy to help you. Regards, Greg. P.S. This conversation should probably be moved to something like pgsql-admin This e-mail, including accompanying communications and attachments, is strictly confidential and only for the intended recipient. Any retention, use or disclosure not expressly authorised by IHSMarkit is prohibited. This email is subject to all waivers and other terms at the following link: https://ihsmarkit.com/Legal/EmailDisclaimer.html Please visit www.ihsmarkit.com/about/contact-us.html for contact information on our offices worldwide.
Re: Trying to handle db corruption 9.6
A backup was made after the corruption appeared but before I tried using the pg_resetxlog command. Basically I just want to start the database with the data that is available in the files(I'm ok with loosing data that was in the cache and wasnt written to disk). My question is how can I continue from here ? I also sent this mail to pgadmin mail list.. בתאריך יום ב׳, 20 במאי 2019 ב-18:59 מאת Greg Clough < greg.clo...@ihsmarkit.com>: > > Yes I understand that.. I'm trying to handle it after the backup that I > have taken.. > > > IMHO the best option here is to keep safe a copy as you have already done > and then restore from a backup, and replay whatever WAL you have. The > database you have is terminally corrupted, and should never be relied upon > going forward. > > > > You can try to get it running, and then extract the data with pg_dump... > but even then you will need to manually verify it’s OK because you have no > idea which dirty blocks from memory have been written to disk and which > have not. Without the WAL you have no way of making it consistent, and if > they have been destroyed then you’re out of luck. > > > > If you don’t have backups and archived WAL then fixing what you’ve got may > be your only option, but you should only go down that route if you have > to. If you have to “repair”, then I’d recommend engaging a reputable > PostgreSQL consultancy to help you. > > > > Regards, > > Greg. > > P.S. This conversation should probably be moved to something like > pgsql-admin > > > > -- > > This e-mail, including accompanying communications and attachments, is > strictly confidential and only for the intended recipient. Any retention, > use or disclosure not expressly authorised by IHSMarkit is prohibited. This > email is subject to all waivers and other terms at the following link: > https://ihsmarkit.com/Legal/EmailDisclaimer.html > > Please visit www.ihsmarkit.com/about/contact-us.html for contact > information on our offices worldwide. >
Re: Trying to handle db corruption 9.6
Yes I understand that.. I'm trying to handle it after the backup that I have taken.. On Mon, May 20, 2019, 5:49 PM Flo Rance wrote: > Hi, > > First of all, as stated in the wiki, you'll need to do a filesystem level > copy of the database files and put them on another drive before attempting > to do anything else ! > > https://wiki.postgresql.org/wiki/Corruption > > regards, > Flo > > On Mon, May 20, 2019 at 4:40 PM Mariel Cherkassky < > mariel.cherkas...@gmail.com> wrote: > >> Hey, >> I'm trying to handle a corruption that one of our customers is facing. >> His disk space was full and as a result of that he decided to run >> pg_resetxlog a few times(bad idea..) . >> When I connected to the machine I saw that the db was down. >> When I started the db (service postgresql start) I saw the next error in >> the logs : >> >> DETAIL: Could not open file "pg_multixact/offsets/": No such file or >> directory. >> >> The pg_multixact/offset dir contained one file (0025). >> The pg_multixact/members dir contains 2 files : and 0001. >> >> I tried to follow the documentation of pg_resetxlog, and run pg_resetxlog >> with -m 0xF0A604,0xEA50CE which are 0025*65536 and 0026*65536 in hexa. >> However, it didnt help and the same error appeared. >> So I tried to rename the file to and then the db searched for a file >> in members that wasnt exist. >> I followed the documentation and changed the multitransaction offset >> (-O) and the transactions id (-c ) based on the doc and then the db was >> started succesfully. >> However after it started I saw the next msg in the logs : >> Multixact member wraparound protections are disabled because oldest >> checkpointed Multixact 65536 doesnt exist. In addition, no one is able to >> connect to the db (we keep getting errors database doesnt exist or user >> doesnt exist , even for postgresql user). >> >> current relevant rows from the control data : >> >> pg_control version number:960 >> >> Catalog version number: 201608131 >> >> Database system identifier: 6692952810876880414 >> >> Database cluster state: shut down >> >> pg_control last modified: Mon 20 May 2019 07:07:30 AM PDT >> >> Latest checkpoint location: 1837/E328 >> >> Prior checkpoint location:1837/E228 >> >> Latest checkpoint's REDO location:1837/E328 >> >> Latest checkpoint's REDO WAL file:0001183700E3 >> >> Latest checkpoint's TimeLineID: 1 >> >> Latest checkpoint's PrevTimeLineID: 1 >> >> Latest checkpoint's full_page_writes: on >> >> Latest checkpoint's NextXID: 0:3 >> >> Latest checkpoint's NextOID: 1 >> >> Latest checkpoint's NextMultiXactId: 131072 >> >> Latest checkpoint's NextMultiOffset: 52352 >> >> Latest checkpoint's oldestXID:3 >> >> Latest checkpoint's oldestXID's DB: 0 >> >> Latest checkpoint's oldestActiveXID: 0 >> >> Latest checkpoint's oldestMultiXid: 65536 >> >> Latest checkpoint's oldestMulti's DB: 0 >> >> Latest checkpoint's oldestCommitTsXid:4604 >> >> Latest checkpoint's newestCommitTsXid:5041 >> >> >> >> I also checked and I saw that the customer has all the wals (backed up) >> but without any basebackup.. >> Any recommendations how to handle the case ? >> >
Re: Trying to handle db corruption 9.6
Hi, First of all, as stated in the wiki, you'll need to do a filesystem level copy of the database files and put them on another drive before attempting to do anything else ! https://wiki.postgresql.org/wiki/Corruption regards, Flo On Mon, May 20, 2019 at 4:40 PM Mariel Cherkassky < mariel.cherkas...@gmail.com> wrote: > Hey, > I'm trying to handle a corruption that one of our customers is facing. > His disk space was full and as a result of that he decided to run > pg_resetxlog a few times(bad idea..) . > When I connected to the machine I saw that the db was down. > When I started the db (service postgresql start) I saw the next error in > the logs : > > DETAIL: Could not open file "pg_multixact/offsets/": No such file or > directory. > > The pg_multixact/offset dir contained one file (0025). > The pg_multixact/members dir contains 2 files : and 0001. > > I tried to follow the documentation of pg_resetxlog, and run pg_resetxlog > with -m 0xF0A604,0xEA50CE which are 0025*65536 and 0026*65536 in hexa. > However, it didnt help and the same error appeared. > So I tried to rename the file to and then the db searched for a file > in members that wasnt exist. > I followed the documentation and changed the multitransaction offset (-O) > and the transactions id (-c ) based on the doc and then the db was started > succesfully. > However after it started I saw the next msg in the logs : > Multixact member wraparound protections are disabled because oldest > checkpointed Multixact 65536 doesnt exist. In addition, no one is able to > connect to the db (we keep getting errors database doesnt exist or user > doesnt exist , even for postgresql user). > > current relevant rows from the control data : > > pg_control version number:960 > > Catalog version number: 201608131 > > Database system identifier: 6692952810876880414 > > Database cluster state: shut down > > pg_control last modified: Mon 20 May 2019 07:07:30 AM PDT > > Latest checkpoint location: 1837/E328 > > Prior checkpoint location:1837/E228 > > Latest checkpoint's REDO location:1837/E328 > > Latest checkpoint's REDO WAL file:0001183700E3 > > Latest checkpoint's TimeLineID: 1 > > Latest checkpoint's PrevTimeLineID: 1 > > Latest checkpoint's full_page_writes: on > > Latest checkpoint's NextXID: 0:3 > > Latest checkpoint's NextOID: 1 > > Latest checkpoint's NextMultiXactId: 131072 > > Latest checkpoint's NextMultiOffset: 52352 > > Latest checkpoint's oldestXID:3 > > Latest checkpoint's oldestXID's DB: 0 > > Latest checkpoint's oldestActiveXID: 0 > > Latest checkpoint's oldestMultiXid: 65536 > > Latest checkpoint's oldestMulti's DB: 0 > > Latest checkpoint's oldestCommitTsXid:4604 > > Latest checkpoint's newestCommitTsXid:5041 > > > > I also checked and I saw that the customer has all the wals (backed up) > but without any basebackup.. > Any recommendations how to handle the case ? >
Trying to handle db corruption 9.6
Hey, I'm trying to handle a corruption that one of our customers is facing. His disk space was full and as a result of that he decided to run pg_resetxlog a few times(bad idea..) . When I connected to the machine I saw that the db was down. When I started the db (service postgresql start) I saw the next error in the logs : DETAIL: Could not open file "pg_multixact/offsets/": No such file or directory. The pg_multixact/offset dir contained one file (0025). The pg_multixact/members dir contains 2 files : and 0001. I tried to follow the documentation of pg_resetxlog, and run pg_resetxlog with -m 0xF0A604,0xEA50CE which are 0025*65536 and 0026*65536 in hexa. However, it didnt help and the same error appeared. So I tried to rename the file to and then the db searched for a file in members that wasnt exist. I followed the documentation and changed the multitransaction offset (-O) and the transactions id (-c ) based on the doc and then the db was started succesfully. However after it started I saw the next msg in the logs : Multixact member wraparound protections are disabled because oldest checkpointed Multixact 65536 doesnt exist. In addition, no one is able to connect to the db (we keep getting errors database doesnt exist or user doesnt exist , even for postgresql user). current relevant rows from the control data : pg_control version number:960 Catalog version number: 201608131 Database system identifier: 6692952810876880414 Database cluster state: shut down pg_control last modified: Mon 20 May 2019 07:07:30 AM PDT Latest checkpoint location: 1837/E328 Prior checkpoint location:1837/E228 Latest checkpoint's REDO location:1837/E328 Latest checkpoint's REDO WAL file:0001183700E3 Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0:3 Latest checkpoint's NextOID: 1 Latest checkpoint's NextMultiXactId: 131072 Latest checkpoint's NextMultiOffset: 52352 Latest checkpoint's oldestXID:3 Latest checkpoint's oldestXID's DB: 0 Latest checkpoint's oldestActiveXID: 0 Latest checkpoint's oldestMultiXid: 65536 Latest checkpoint's oldestMulti's DB: 0 Latest checkpoint's oldestCommitTsXid:4604 Latest checkpoint's newestCommitTsXid:5041 I also checked and I saw that the customer has all the wals (backed up) but without any basebackup.. Any recommendations how to handle the case ?
Re: DB corruption
Akshay Ballarpure writes: > I have a query on DB corruption. Is there any way to recover from it > without losing data ? You've already lost data, evidently. > Starting postgresql service: [ OK ] > psql: FATAL: index "pg_authid_rolname_index" contains unexpected zero page > at block 0 > HINT: Please REINDEX it. This is not good. It'd be possible to reindex that index, certainly, but the question is what other files have also been clobbered. > psql: FATAL: "base/11564" is not a valid data directory > DETAIL: File "base/11564/PG_VERSION" does not contain valid data. > HINT: You might need to initdb. Based on the OID I'm going to guess that this is from an attempt to connect to the "postgres" database. (I'm also going to guess that you're running 8.4.x, because any later PG version would have a higher OID for "postgres".) Can you connect to any other databases? If so, do their contents seem intact? If you're really lucky, meaning (a) the damage is confined to that DB and (b) you didn't keep any important data in it, then dropping and recreating the "postgres" DB might be enough to get you out of trouble. But pg_authid_rolname_index is a cluster-global index, not specific to the "postgres" DB, so the fact that it too seems to be damaged is not promising. TBH your best bet, if the data in this installation is valuable and you don't have adequate backups, is to hire a professional data recovery service --- there are several companies that specialize in getting as much out of a corrupted PG installation as possible. (See https://www.postgresql.org/support/professional_support/ for some links.) You should then plan on updating to some newer PG release; 8.4.x has been out of support for years, and there are lots of known-and-unfixed bugs in it. regards, tom lane
Re: DB corruption
On Fri, Mar 23, 2018 at 01:29:35PM +0530, Akshay Ballarpure wrote: > I have a query on DB corruption. Is there any way to recover from it > without losing data ? Corrupted pages which need to be zeroed in order to recover the rest is data lost forever, except if you have a backup you can rollback to. Please see here for some global instructions about how to deal with such situations: https://wiki.postgresql.org/wiki/Corruption First take a deep breath, and take the time to read and understand it. > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you This is a public mailing list. -- Michael signature.asc Description: PGP signature
DB corruption
Hi, I have a query on DB corruption. Is there any way to recover from it without losing data ? Starting postgresql service: [ OK ] psql: FATAL: index "pg_authid_rolname_index" contains unexpected zero page at block 0 HINT: Please REINDEX it. psql: FATAL: "base/11564" is not a valid data directory DETAIL: File "base/11564/PG_VERSION" does not contain valid data. HINT: You might need to initdb. psql: FATAL: "base/11564" is not a valid data directory DETAIL: File "base/11564/PG_VERSION" does not contain valid data. HINT: You might need to initdb. psql: FATAL: "base/11564" is not a valid data directory DETAIL: File "base/11564/PG_VERSION" does not contain valid data. HINT: You might need to initdb. psql: FATAL: "base/11564" is not a valid data directory DETAIL: File "base/11564/PG_VERSION" does not contain valid data. HINT: You might need to initdb. With Best Regards Akshay =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you