Re: Trying to handle db corruption 9.6

2019-05-21 Thread Tomas Vondra

On Tue, May 21, 2019 at 04:03:52PM +, Greg Clough wrote:

 My restore command copy the wals from archive dir in the primary to an
 archive dir in the secondary(different from the pg_xlog in the
 secondary)


I think that you're restore command puts them back into the archive, and
then uncompresses them into pg_xlog, which is what %p represents.



Should I run it manually and see if the archives are copied to the
archive dir in the secondary or should I just copy all of them to the
xlog dir in the secondary ?


That would be my first test, but as Thomas mentioned, you don't have any
hint of WAL archives being restored in the postgresql.log... so it's not
even trying.  It's not likely that archive_command is your problem at the
moment.



I tried to start the secondary as a primary (I have a backup..) but I
still got an error (invalid checkpoint record from primary./
secondary). Does it means that my backup is corrupted ?


I think so, but Thomas could probably confirm if all hope is lost.  Also,
I'm not sure if there is a terminology difference but a "standby" is
never considered a "backup".  I realise it's late in the day, but even if
you have a correctly configured Standby you should also take backups with
pg_basebackup, Barman, pgBackRest, etc.



Well, I have no idea. We still got no information about how the standby
was created, if it was ever running fine, and so on. Considering it does
not seem to be getting data from the archive, it might be the case it was
created in some strange way and never really worked. And if there really
are no log messages about the restore_command, it probably fails before
the standby even tries to execute it.

So I don't know.


Restoring backups is where I would be heading now, as things seem
terribly broken.



Right. But my impression is there are no backups ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Trying to handle db corruption 9.6

2019-05-21 Thread Mariel Cherkassky
Tomas - Well,  when I run the restore_command manually it works (archive
dir exists on the secondary..). Thank for the explanation on the system
catalogs..

Greg - My restore command copy the wals from archive dir in the primary to
an archive dir in the secondary(different from the pg_xlog in the
secondary). Should I run it manually and see if the archives are copied to
the archive dir in the secondary or should I just copy all of them to the
xlog dir in the secondary ?
I tried to start the secondary as a primary (I have a backup..) but I still
got an error (invalid checkpoint record from primary./ secondary). Does it
means that my backup is corrupted ?

‫בתאריך יום ג׳, 21 במאי 2019 ב-16:07 מאת ‪Tomas Vondra‬‏ <‪
tomas.von...@2ndquadrant.com‬‏>:‬

> On Tue, May 21, 2019 at 12:01:31PM +0300, Mariel Cherkassky wrote:
> >Tomas :
> >
> >Well, when you say it does not work, why do you think so? Does it print
> >some error, or what? Does it even get executed? It does not seem to be
> >the case, judging by the log (there's no archive_command message).
> >
> >How was the "secondary machine" created? You said you have all the WAL
> >since then - how do you know that?
> >
> >Well, when I start the secondary in recovery mode (the primary is down,
> >auto failover is disabled..)  it doesnt start recovering the archive wals
> >from the primary. The logs of the secondary :
> >receiving incremental file list
> >rsync: link_stat "/var/lib/pgsql/archive/0002.history" failed: No such
> >file or directory (2)
> >
> >sent 8 bytes  received 10 bytes  36.00 bytes/sec
> >total size is 0  speedup is 0.00
> >rsync error: some files/attrs were not transferred (see previous errors)
> >(code 23) at main.c(1505) [receiver=3.0.6]
> >sh: /var/lib/pgsql/archive/0002.history: No such file or directory
> >2019-05-20 09:41:33 PDT  18558  LOG:  entering standby mode
> >2019-05-20 09:41:33 PDT  18558  LOG:  invalid primary checkpoint record
> >2019-05-20 09:41:33 PDT  18558  LOG:  invalid secondary checkpoint link in
> >control file
> >2019-05-20 09:41:33 PDT  18558  PANIC:  could not locate a valid
> checkpoint
> >record
> >2019-05-20 09:41:33 PDT  18555  LOG:  startup process (PID 18558) was
> >terminated by signal 6: Aborted
> >2019-05-20 09:41:33 PDT  18555  LOG:  aborting startup due to startup
> >process failure
> >2019-05-20 09:41:33 PDT  18555  LOG:  database system is shut down
> >2019-05-20 09:56:12 PDT  18701  LOG:  database system was shut down in
> >recovery at 2019-05-01 09:40:02 PDT
> >
> >As I said, the secondary was down for a month and I have all the archives
> >of the wals in my primary. I was hoping that the secondary will use the
> >restore_command to restore them :
> >restore_command = 'rsync -avzhe ssh postgres@X.X.X.X
> :/var/lib/pgsql/archive/%f
> >/var/lib/pgsql/archive/%f ; gunzip < /var/lib/pgsql/archive/%f > %p'
> >
> >my archive_command on the primary was :
> >archive_command = 'gzip < %p > /var/lib/pgsql/archive/%f'
> >
> >Am I missing something ?
> >
>
> First of all, the way you quote message is damn confusing - there's no
> clear difference between your message and the message you quote. I don't
> know which mail client you're using, but I suppose it can be configured to
> quote sensibly ...
>
> Well, clearly the standby tries to fetch WAL from archive, but the rsync
> command fails for some reason. You're in the position to investigate
> further, because you can run it manually - we can't. This has nothing to
> do with PostgreSQL. My guess is you don't have /var/lib/pgsql/archive on
> the standby, and it's confusing because archive uses the same path.
>
>
> >Another question, If I'll run initdb and initiate a new cluster and i'll
> >copy the data files of my old cluster into the new one, is there any
> chance
> >that it will work ?
> >I mean right now, my primary is down and cant start up because it is
> >missing an offset file in the pg_multixtrans/offset dir.
> >
>
> No, because you won't have contents of system catalogs, mapping the data
> files to relations (tables, indexes) and containing information about the
> structure (which columns / data types are in the data).
>
> The data files are pretty useless on their own. It might be possible to do
> some manualy recovery - say, you might create the same tables in the new
> schema, and then guess which data files belong to them. But there are
> various caveats e.g. due to dropped columns, etc.
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>


Re: Trying to handle db corruption 9.6

2019-05-21 Thread Tomas Vondra

On Tue, May 21, 2019 at 12:01:31PM +0300, Mariel Cherkassky wrote:

Tomas :

Well, when you say it does not work, why do you think so? Does it print
some error, or what? Does it even get executed? It does not seem to be
the case, judging by the log (there's no archive_command message).

How was the "secondary machine" created? You said you have all the WAL
since then - how do you know that?

Well, when I start the secondary in recovery mode (the primary is down,
auto failover is disabled..)  it doesnt start recovering the archive wals
from the primary. The logs of the secondary :
receiving incremental file list
rsync: link_stat "/var/lib/pgsql/archive/0002.history" failed: No such
file or directory (2)

sent 8 bytes  received 10 bytes  36.00 bytes/sec
total size is 0  speedup is 0.00
rsync error: some files/attrs were not transferred (see previous errors)
(code 23) at main.c(1505) [receiver=3.0.6]
sh: /var/lib/pgsql/archive/0002.history: No such file or directory
2019-05-20 09:41:33 PDT  18558  LOG:  entering standby mode
2019-05-20 09:41:33 PDT  18558  LOG:  invalid primary checkpoint record
2019-05-20 09:41:33 PDT  18558  LOG:  invalid secondary checkpoint link in
control file
2019-05-20 09:41:33 PDT  18558  PANIC:  could not locate a valid checkpoint
record
2019-05-20 09:41:33 PDT  18555  LOG:  startup process (PID 18558) was
terminated by signal 6: Aborted
2019-05-20 09:41:33 PDT  18555  LOG:  aborting startup due to startup
process failure
2019-05-20 09:41:33 PDT  18555  LOG:  database system is shut down
2019-05-20 09:56:12 PDT  18701  LOG:  database system was shut down in
recovery at 2019-05-01 09:40:02 PDT

As I said, the secondary was down for a month and I have all the archives
of the wals in my primary. I was hoping that the secondary will use the
restore_command to restore them :
restore_command = 'rsync -avzhe ssh postgres@X.X.X.X:/var/lib/pgsql/archive/%f
/var/lib/pgsql/archive/%f ; gunzip < /var/lib/pgsql/archive/%f > %p'

my archive_command on the primary was :
archive_command = 'gzip < %p > /var/lib/pgsql/archive/%f'

Am I missing something ?



First of all, the way you quote message is damn confusing - there's no
clear difference between your message and the message you quote. I don't
know which mail client you're using, but I suppose it can be configured to
quote sensibly ...

Well, clearly the standby tries to fetch WAL from archive, but the rsync
command fails for some reason. You're in the position to investigate
further, because you can run it manually - we can't. This has nothing to
do with PostgreSQL. My guess is you don't have /var/lib/pgsql/archive on
the standby, and it's confusing because archive uses the same path.



Another question, If I'll run initdb and initiate a new cluster and i'll
copy the data files of my old cluster into the new one, is there any chance
that it will work ?
I mean right now, my primary is down and cant start up because it is
missing an offset file in the pg_multixtrans/offset dir.



No, because you won't have contents of system catalogs, mapping the data
files to relations (tables, indexes) and containing information about the
structure (which columns / data types are in the data).

The data files are pretty useless on their own. It might be possible to do
some manualy recovery - say, you might create the same tables in the new
schema, and then guess which data files belong to them. But there are
various caveats e.g. due to dropped columns, etc.

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Trying to handle db corruption 9.6

2019-05-21 Thread Mariel Cherkassky
Tomas :

Well, when you say it does not work, why do you think so? Does it print
some error, or what? Does it even get executed? It does not seem to be
the case, judging by the log (there's no archive_command message).

How was the "secondary machine" created? You said you have all the WAL
since then - how do you know that?

Well, when I start the secondary in recovery mode (the primary is down,
auto failover is disabled..)  it doesnt start recovering the archive wals
from the primary. The logs of the secondary :
receiving incremental file list
rsync: link_stat "/var/lib/pgsql/archive/0002.history" failed: No such
file or directory (2)

sent 8 bytes  received 10 bytes  36.00 bytes/sec
total size is 0  speedup is 0.00
rsync error: some files/attrs were not transferred (see previous errors)
(code 23) at main.c(1505) [receiver=3.0.6]
sh: /var/lib/pgsql/archive/0002.history: No such file or directory
2019-05-20 09:41:33 PDT  18558  LOG:  entering standby mode
2019-05-20 09:41:33 PDT  18558  LOG:  invalid primary checkpoint record
2019-05-20 09:41:33 PDT  18558  LOG:  invalid secondary checkpoint link in
control file
2019-05-20 09:41:33 PDT  18558  PANIC:  could not locate a valid checkpoint
record
2019-05-20 09:41:33 PDT  18555  LOG:  startup process (PID 18558) was
terminated by signal 6: Aborted
2019-05-20 09:41:33 PDT  18555  LOG:  aborting startup due to startup
process failure
2019-05-20 09:41:33 PDT  18555  LOG:  database system is shut down
2019-05-20 09:56:12 PDT  18701  LOG:  database system was shut down in
recovery at 2019-05-01 09:40:02 PDT

As I said, the secondary was down for a month and I have all the archives
of the wals in my primary. I was hoping that the secondary will use the
restore_command to restore them :
restore_command = 'rsync -avzhe ssh postgres@X.X.X.X:/var/lib/pgsql/archive/%f
/var/lib/pgsql/archive/%f ; gunzip < /var/lib/pgsql/archive/%f > %p'

my archive_command on the primary was :
archive_command = 'gzip < %p > /var/lib/pgsql/archive/%f'

Am I missing something ?

Another question, If I'll run initdb and initiate a new cluster and i'll
copy the data files of my old cluster into the new one, is there any chance
that it will work ?
I mean right now, my primary is down and cant start up because it is
missing an offset file in the pg_multixtrans/offset dir.

‫בתאריך יום ג׳, 21 במאי 2019 ב-0:04 מאת ‪Tomas Vondra‬‏ <‪
tomas.von...@2ndquadrant.com‬‏>:‬

> On Mon, May 20, 2019 at 08:20:33PM +0300, Mariel Cherkassky wrote:
> >   Hey Greg,
> >   Basically my backup was made after the first pg_resetxlog so I was
> wrong.
>
> Bummer.
>
> >   However, the customer had a secondary machine that wasn't synced for a
> >   month. I have all the walls since the moment the secondary went out of
> >   sync. Once I started it I hoped that it will start recover the wals and
> >   fill the gap. However I got an error in the secondary :
> >2019-05-20 10:11:28 PDT  19021  LOG:  entering standby mode
> >   2019-05-20 10:11:28 PDT  19021  LOG:  invalid primary checkpoint record
> >   2019-05-20 10:11:28 PDT  19021  LOG:  invalid secondary checkpoint
> link in
> >   control file
> >   2019-05-20 10:11:28 PDT  19021  PANIC:  could not locate a valid
> >   checkpoint record
> >   2019-05-20 10:11:28 PDT  19018  LOG:  startup process (PID 19021) was
> >   terminated by signal 6: Aborted
> >   2019-05-20 10:11:28 PDT  19018  LOG:  aborting startup due to startup
> >   process failure
> >   2019-05-20 10:11:28 PDT  19018  LOG:  database system is shut down.
>
> >   I checked my secondary archive dir and pg_xlog dir
> and
> >   it seems that the restore command doesnt work. My restore_command:
>
> >   restore_command = 'rsync -avzhe ssh
> >   postgres@x.x.x.x:/var/lib/pgsql/archive/%f /var/lib/pgsql/archive/%f ;
> >   gunzip < /var/lib/pgsql/archive/%f > %p'
> >   archive_cleanup_command = '/usr/pgsql-9.6/bin/pg_archivecleanup
> >   /var/lib/pgsql/archive %r'
>
> Well, when you say it does not work, why do you think so? Does it print
> some error, or what? Does it even get executed? It does not seem to be
> the case, judging by the log (there's no archive_command message).
>
> How was the "secondary machine" created? You said you have all the WAL
> since then - how do you know that?
>
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: Trying to handle db corruption 9.6

2019-05-20 Thread Tomas Vondra

On Mon, May 20, 2019 at 08:20:33PM +0300, Mariel Cherkassky wrote:

  Hey Greg,
  Basically my backup was made after the first pg_resetxlog so I was wrong.


Bummer.


  However, the customer had a secondary machine that wasn't synced for a
  month. I have all the walls since the moment the secondary went out of
  sync. Once I started it I hoped that it will start recover the wals and
  fill the gap. However I got an error in the secondary :         
   2019-05-20 10:11:28 PDT  19021  LOG:  entering standby mode
  2019-05-20 10:11:28 PDT  19021  LOG:  invalid primary checkpoint record
  2019-05-20 10:11:28 PDT  19021  LOG:  invalid secondary checkpoint link in
  control file
  2019-05-20 10:11:28 PDT  19021  PANIC:  could not locate a valid
  checkpoint record
  2019-05-20 10:11:28 PDT  19018  LOG:  startup process (PID 19021) was
  terminated by signal 6: Aborted
  2019-05-20 10:11:28 PDT  19018  LOG:  aborting startup due to startup
  process failure
  2019-05-20 10:11:28 PDT  19018  LOG:  database system is shut down.       
                      I checked my secondary archive dir and pg_xlog dir and
  it seems that the restore command doesnt work. My restore_command:      
  restore_command = 'rsync -avzhe ssh
  postgres@x.x.x.x:/var/lib/pgsql/archive/%f /var/lib/pgsql/archive/%f ;
  gunzip < /var/lib/pgsql/archive/%f > %p'
  archive_cleanup_command = '/usr/pgsql-9.6/bin/pg_archivecleanup
  /var/lib/pgsql/archive %r'


Well, when you say it does not work, why do you think so? Does it print
some error, or what? Does it even get executed? It does not seem to be
the case, judging by the log (there's no archive_command message).

How was the "secondary machine" created? You said you have all the WAL
since then - how do you know that?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Trying to handle db corruption 9.6

2019-05-20 Thread Tomas Vondra

On Mon, May 20, 2019 at 04:20:45PM +, Bimal wrote:

  I had ran into same issue about year back, luckily I had standby to
  quickly promote.  But, I wish there was better a documentation on how to
  handle WAL log fill up and resetting them.  


pg_resetxlog is not a tool to deal with "WAL fill up". It's a last
resort option to deal with corrupted WAL, and can easily make matters
worse when used without due consideration. That seems to be the case
here, unfortunately.

On a properly behaving system, running out of disk space for pg_xlog
results in database shutdown. If you also get corrupted WAL, you have
bigger problems, I'm afraid.

Also, data corruption issues are one-off events, mostly unique. That
makes it rather difficult (~impossible) to write docs about recovering
from them. And it's why there are no magic tools.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Trying to handle db corruption 9.6

2019-05-20 Thread Mariel Cherkassky
Hey Greg,
Basically my backup was made after the first pg_resetxlog so I was wrong.
However, the customer had a secondary machine that wasn't synced for a
month. I have all the walls since the moment the secondary went out of
sync. Once I started it I hoped that it will start recover the wals and
fill the gap. However I got an error in the secondary :
 2019-05-20 10:11:28 PDT  19021  LOG:  entering standby mode
2019-05-20 10:11:28 PDT  19021  LOG:  invalid primary checkpoint record
2019-05-20 10:11:28 PDT  19021  LOG:  invalid secondary checkpoint link in
control file
2019-05-20 10:11:28 PDT  19021  PANIC:  could not locate a valid checkpoint
record
2019-05-20 10:11:28 PDT  19018  LOG:  startup process (PID 19021) was
terminated by signal 6: Aborted
2019-05-20 10:11:28 PDT  19018  LOG:  aborting startup due to startup
process failure
2019-05-20 10:11:28 PDT  19018  LOG:  database system is shut down.
I checked my secondary archive dir and pg_xlog dir and
it seems that the restore command doesnt work. My restore_command:
restore_command = 'rsync -avzhe ssh postgres@x.x.x.x:/var/lib/pgsql/archive/%f
/var/lib/pgsql/archive/%f ; gunzip < /var/lib/pgsql/archive/%f > %p'
archive_cleanup_command = '/usr/pgsql-9.6/bin/pg_archivecleanup
/var/lib/pgsql/archive %r'


On Mon, May 20, 2019, 7:20 PM Bimal  wrote:

> I had ran into same issue about year back, luckily I had standby to
> quickly promote.  But, I wish there was better a documentation on how to
> handle WAL log fill up and resetting them.
>
> On Monday, May 20, 2019, 9:08:19 AM PDT, Mariel Cherkassky <
> mariel.cherkas...@gmail.com> wrote:
>
>
> A backup was made after the corruption appeared but before I tried using
> the pg_resetxlog command. Basically I just want to start the database with
> the data that is available in the files(I'm ok with loosing data that was
> in the cache and wasnt written to disk).
> My question is how can I continue from here ?
> I also sent this mail to pgadmin mail list..
>
> ‫בתאריך יום ב׳, 20 במאי 2019 ב-18:59 מאת ‪Greg Clough‬‏ <‪
> greg.clo...@ihsmarkit.com‬‏>:‬
>
> > Yes I understand that.. I'm trying to handle it after the backup that I
> have taken..
>
>
> IMHO the best option here is to keep safe a copy as you have already done
> and then restore from a backup, and replay whatever WAL you have.  The
> database you have is terminally corrupted, and should never be relied upon
> going forward.
>
>
>
> You can try to get it running, and then extract the data with pg_dump...
> but even then you will need to manually verify it’s OK because you have no
> idea which dirty blocks from memory have been written to disk and which
> have not.  Without the WAL you have no way of making it consistent, and if
> they have been destroyed then you’re out of luck.
>
>
>
> If you don’t have backups and archived WAL then fixing what you’ve got may
> be your only option, but you should only go down that route if you have
> to.  If you have to “repair”, then I’d recommend engaging a reputable
> PostgreSQL consultancy to help you.
>
>
>
> Regards,
>
> Greg.
>
> P.S.  This conversation should probably be moved to something like
> pgsql-admin
>
>
>
> --
>
> This e-mail, including accompanying communications and attachments, is
> strictly confidential and only for the intended recipient. Any retention,
> use or disclosure not expressly authorised by IHSMarkit is prohibited. This
> email is subject to all waivers and other terms at the following link:
> https://ihsmarkit.com/Legal/EmailDisclaimer.html
>
> Please visit www.ihsmarkit.com/about/contact-us.html for contact
> information on our offices worldwide.
>
>


Re: Trying to handle db corruption 9.6

2019-05-20 Thread Bimal
 I had ran into same issue about year back, luckily I had standby to quickly 
promote.  But, I wish there was better a documentation on how to handle WAL log 
fill up and resetting them.  
On Monday, May 20, 2019, 9:08:19 AM PDT, Mariel Cherkassky 
 wrote:  
 
 A backup was made after the corruption appeared but before I tried using the 
pg_resetxlog command. Basically I just want to start the database with the data 
that is available in the files(I'm ok with loosing data that was in the cache 
and wasnt written to disk).My question is how can I continue from here ?I also 
sent this mail to pgadmin mail list..
‫בתאריך יום ב׳, 20 במאי 2019 ב-18:59 מאת ‪Greg Clough‬‏ 
<‪greg.clo...@ihsmarkit.com‬‏>:‬


> Yes I understand that.. I'm trying to handle it after the backup that I have 
> taken..


IMHO the best option here is to keep safe a copy as you have already done and 
then restore from a backup, and replay whatever WAL you have.  The database you 
have is terminally corrupted, and should never be relied upon going forward.

 

You can try to get it running, and then extract the data with pg_dump... but 
even then you will need to manually verify it’s OK because you have no idea 
which dirty blocks from memory have been written to disk and which have not.  
Without the WAL you have no way of making it consistent, and if they have been 
destroyed then you’re out of luck.

 

If you don’t have backups and archived WAL then fixing what you’ve got may be 
your only option, but you should only go down that route if you have to.  If 
you have to “repair”, then I’d recommend engaging a reputable PostgreSQL 
consultancy to help you.

 

Regards,

Greg.

P.S.  This conversation should probably be moved to something like pgsql-admin  
           

 


This e-mail, including accompanying communications and attachments, is strictly 
confidential and only for the intended recipient. Any retention, use or 
disclosure not expressly authorised by IHSMarkit is prohibited. This email is 
subject to all waivers and other terms at the following link: 
https://ihsmarkit.com/Legal/EmailDisclaimer.html

Please visit www.ihsmarkit.com/about/contact-us.html for contact information on 
our offices worldwide.

  

Re: Trying to handle db corruption 9.6

2019-05-20 Thread Mariel Cherkassky
A backup was made after the corruption appeared but before I tried using
the pg_resetxlog command. Basically I just want to start the database with
the data that is available in the files(I'm ok with loosing data that was
in the cache and wasnt written to disk).
My question is how can I continue from here ?
I also sent this mail to pgadmin mail list..

‫בתאריך יום ב׳, 20 במאי 2019 ב-18:59 מאת ‪Greg Clough‬‏ <‪
greg.clo...@ihsmarkit.com‬‏>:‬

> > Yes I understand that.. I'm trying to handle it after the backup that I
> have taken..
>
>
> IMHO the best option here is to keep safe a copy as you have already done
> and then restore from a backup, and replay whatever WAL you have.  The
> database you have is terminally corrupted, and should never be relied upon
> going forward.
>
>
>
> You can try to get it running, and then extract the data with pg_dump...
> but even then you will need to manually verify it’s OK because you have no
> idea which dirty blocks from memory have been written to disk and which
> have not.  Without the WAL you have no way of making it consistent, and if
> they have been destroyed then you’re out of luck.
>
>
>
> If you don’t have backups and archived WAL then fixing what you’ve got may
> be your only option, but you should only go down that route if you have
> to.  If you have to “repair”, then I’d recommend engaging a reputable
> PostgreSQL consultancy to help you.
>
>
>
> Regards,
>
> Greg.
>
> P.S.  This conversation should probably be moved to something like
> pgsql-admin
>
>
>
> --
>
> This e-mail, including accompanying communications and attachments, is
> strictly confidential and only for the intended recipient. Any retention,
> use or disclosure not expressly authorised by IHSMarkit is prohibited. This
> email is subject to all waivers and other terms at the following link:
> https://ihsmarkit.com/Legal/EmailDisclaimer.html
>
> Please visit www.ihsmarkit.com/about/contact-us.html for contact
> information on our offices worldwide.
>


Re: Trying to handle db corruption 9.6

2019-05-20 Thread Mariel Cherkassky
Yes I understand that.. I'm trying to handle it after the backup that I
have taken..

On Mon, May 20, 2019, 5:49 PM Flo Rance  wrote:

> Hi,
>
> First of all, as stated in the wiki, you'll need to do a filesystem level
> copy of the database files and put them on another drive before attempting
> to do anything else !
>
> https://wiki.postgresql.org/wiki/Corruption
>
> regards,
> Flo
>
> On Mon, May 20, 2019 at 4:40 PM Mariel Cherkassky <
> mariel.cherkas...@gmail.com> wrote:
>
>> Hey,
>> I'm trying to handle a corruption that one of our customers is facing.
>> His disk space was full and as a result of that he decided to run
>> pg_resetxlog a few times(bad idea..) .
>> When I connected to the machine I saw that the db was down.
>> When I started the db (service postgresql start) I saw the next error in
>> the logs :
>>
>> DETAIL:  Could not open file "pg_multixact/offsets/": No such file or
>> directory.
>>
>> The pg_multixact/offset dir contained one file (0025).
>> The pg_multixact/members dir contains 2 files :  and 0001.
>>
>> I tried to follow the documentation of pg_resetxlog, and run pg_resetxlog
>> with -m 0xF0A604,0xEA50CE which are 0025*65536  and 0026*65536  in hexa.
>> However, it didnt help and the same error appeared.
>> So I tried to rename the file to  and then the db searched for a file
>> in members that wasnt exist.
>> I followed the documentation and changed the multitransaction offset
>> (-O) and the transactions id (-c ) based on the doc and then the db was
>> started succesfully.
>> However after it started I saw the next msg in the logs :
>> Multixact member wraparound protections are disabled because oldest
>> checkpointed Multixact 65536 doesnt exist. In addition, no one is able to
>> connect to the db (we keep getting errors database doesnt exist or user
>> doesnt exist , even for postgresql user).
>>
>> current relevant rows from the control data :
>>
>> pg_control version number:960
>>
>> Catalog version number:   201608131
>>
>> Database system identifier:   6692952810876880414
>>
>> Database cluster state:   shut down
>>
>> pg_control last modified: Mon 20 May 2019 07:07:30 AM PDT
>>
>> Latest checkpoint location:   1837/E328
>>
>> Prior checkpoint location:1837/E228
>>
>> Latest checkpoint's REDO location:1837/E328
>>
>> Latest checkpoint's REDO WAL file:0001183700E3
>>
>> Latest checkpoint's TimeLineID:   1
>>
>> Latest checkpoint's PrevTimeLineID:   1
>>
>> Latest checkpoint's full_page_writes: on
>>
>> Latest checkpoint's NextXID:  0:3
>>
>> Latest checkpoint's NextOID:  1
>>
>> Latest checkpoint's NextMultiXactId:  131072
>>
>> Latest checkpoint's NextMultiOffset:  52352
>>
>> Latest checkpoint's oldestXID:3
>>
>> Latest checkpoint's oldestXID's DB:   0
>>
>> Latest checkpoint's oldestActiveXID:  0
>>
>> Latest checkpoint's oldestMultiXid:   65536
>>
>> Latest checkpoint's oldestMulti's DB: 0
>>
>> Latest checkpoint's oldestCommitTsXid:4604
>>
>> Latest checkpoint's newestCommitTsXid:5041
>>
>>
>>
>> I also checked and I saw that the customer has all the wals (backed up)
>> but without any basebackup..
>> Any recommendations how to handle the case ?
>>
>


Re: Trying to handle db corruption 9.6

2019-05-20 Thread Flo Rance
Hi,

First of all, as stated in the wiki, you'll need to do a filesystem level
copy of the database files and put them on another drive before attempting
to do anything else !

https://wiki.postgresql.org/wiki/Corruption

regards,
Flo

On Mon, May 20, 2019 at 4:40 PM Mariel Cherkassky <
mariel.cherkas...@gmail.com> wrote:

> Hey,
> I'm trying to handle a corruption that one of our customers is facing.
> His disk space was full and as a result of that he decided to run
> pg_resetxlog a few times(bad idea..) .
> When I connected to the machine I saw that the db was down.
> When I started the db (service postgresql start) I saw the next error in
> the logs :
>
> DETAIL:  Could not open file "pg_multixact/offsets/": No such file or
> directory.
>
> The pg_multixact/offset dir contained one file (0025).
> The pg_multixact/members dir contains 2 files :  and 0001.
>
> I tried to follow the documentation of pg_resetxlog, and run pg_resetxlog
> with -m 0xF0A604,0xEA50CE which are 0025*65536  and 0026*65536  in hexa.
> However, it didnt help and the same error appeared.
> So I tried to rename the file to  and then the db searched for a file
> in members that wasnt exist.
> I followed the documentation and changed the multitransaction offset (-O)
> and the transactions id (-c ) based on the doc and then the db was started
> succesfully.
> However after it started I saw the next msg in the logs :
> Multixact member wraparound protections are disabled because oldest
> checkpointed Multixact 65536 doesnt exist. In addition, no one is able to
> connect to the db (we keep getting errors database doesnt exist or user
> doesnt exist , even for postgresql user).
>
> current relevant rows from the control data :
>
> pg_control version number:960
>
> Catalog version number:   201608131
>
> Database system identifier:   6692952810876880414
>
> Database cluster state:   shut down
>
> pg_control last modified: Mon 20 May 2019 07:07:30 AM PDT
>
> Latest checkpoint location:   1837/E328
>
> Prior checkpoint location:1837/E228
>
> Latest checkpoint's REDO location:1837/E328
>
> Latest checkpoint's REDO WAL file:0001183700E3
>
> Latest checkpoint's TimeLineID:   1
>
> Latest checkpoint's PrevTimeLineID:   1
>
> Latest checkpoint's full_page_writes: on
>
> Latest checkpoint's NextXID:  0:3
>
> Latest checkpoint's NextOID:  1
>
> Latest checkpoint's NextMultiXactId:  131072
>
> Latest checkpoint's NextMultiOffset:  52352
>
> Latest checkpoint's oldestXID:3
>
> Latest checkpoint's oldestXID's DB:   0
>
> Latest checkpoint's oldestActiveXID:  0
>
> Latest checkpoint's oldestMultiXid:   65536
>
> Latest checkpoint's oldestMulti's DB: 0
>
> Latest checkpoint's oldestCommitTsXid:4604
>
> Latest checkpoint's newestCommitTsXid:5041
>
>
>
> I also checked and I saw that the customer has all the wals (backed up)
> but without any basebackup..
> Any recommendations how to handle the case ?
>


Trying to handle db corruption 9.6

2019-05-20 Thread Mariel Cherkassky
Hey,
I'm trying to handle a corruption that one of our customers is facing.
His disk space was full and as a result of that he decided to run
pg_resetxlog a few times(bad idea..) .
When I connected to the machine I saw that the db was down.
When I started the db (service postgresql start) I saw the next error in
the logs :

DETAIL:  Could not open file "pg_multixact/offsets/": No such file or
directory.

The pg_multixact/offset dir contained one file (0025).
The pg_multixact/members dir contains 2 files :  and 0001.

I tried to follow the documentation of pg_resetxlog, and run pg_resetxlog
with -m 0xF0A604,0xEA50CE which are 0025*65536  and 0026*65536  in hexa.
However, it didnt help and the same error appeared.
So I tried to rename the file to  and then the db searched for a file
in members that wasnt exist.
I followed the documentation and changed the multitransaction offset (-O)
and the transactions id (-c ) based on the doc and then the db was started
succesfully.
However after it started I saw the next msg in the logs :
Multixact member wraparound protections are disabled because oldest
checkpointed Multixact 65536 doesnt exist. In addition, no one is able to
connect to the db (we keep getting errors database doesnt exist or user
doesnt exist , even for postgresql user).

current relevant rows from the control data :

pg_control version number:960

Catalog version number:   201608131

Database system identifier:   6692952810876880414

Database cluster state:   shut down

pg_control last modified: Mon 20 May 2019 07:07:30 AM PDT

Latest checkpoint location:   1837/E328

Prior checkpoint location:1837/E228

Latest checkpoint's REDO location:1837/E328

Latest checkpoint's REDO WAL file:0001183700E3

Latest checkpoint's TimeLineID:   1

Latest checkpoint's PrevTimeLineID:   1

Latest checkpoint's full_page_writes: on

Latest checkpoint's NextXID:  0:3

Latest checkpoint's NextOID:  1

Latest checkpoint's NextMultiXactId:  131072

Latest checkpoint's NextMultiOffset:  52352

Latest checkpoint's oldestXID:3

Latest checkpoint's oldestXID's DB:   0

Latest checkpoint's oldestActiveXID:  0

Latest checkpoint's oldestMultiXid:   65536

Latest checkpoint's oldestMulti's DB: 0

Latest checkpoint's oldestCommitTsXid:4604

Latest checkpoint's newestCommitTsXid:5041



I also checked and I saw that the customer has all the wals (backed up) but
without any basebackup..
Any recommendations how to handle the case ?


Re: DB corruption

2018-03-23 Thread Tom Lane
Akshay Ballarpure  writes:
> I have a query on DB corruption. Is there any way to recover from it 
> without losing data ?

You've already lost data, evidently.

> Starting postgresql service: [ OK ]
> psql: FATAL: index "pg_authid_rolname_index" contains unexpected zero page 
> at block 0
> HINT: Please REINDEX it.

This is not good.  It'd be possible to reindex that index, certainly,
but the question is what other files have also been clobbered.

> psql: FATAL: "base/11564" is not a valid data directory
> DETAIL: File "base/11564/PG_VERSION" does not contain valid data.
> HINT: You might need to initdb.

Based on the OID I'm going to guess that this is from an attempt to
connect to the "postgres" database.  (I'm also going to guess that
you're running 8.4.x, because any later PG version would have a higher
OID for "postgres".)  Can you connect to any other databases?  If so,
do their contents seem intact?  If you're really lucky, meaning (a) the
damage is confined to that DB and (b) you didn't keep any important
data in it, then dropping and recreating the "postgres" DB might be
enough to get you out of trouble.  But pg_authid_rolname_index is
a cluster-global index, not specific to the "postgres" DB, so the
fact that it too seems to be damaged is not promising.

TBH your best bet, if the data in this installation is valuable and
you don't have adequate backups, is to hire a professional data
recovery service --- there are several companies that specialize in
getting as much out of a corrupted PG installation as possible.
(See https://www.postgresql.org/support/professional_support/ for
some links.)  You should then plan on updating to some newer PG
release; 8.4.x has been out of support for years, and there are lots
of known-and-unfixed bugs in it.

regards, tom lane



Re: DB corruption

2018-03-23 Thread Michael Paquier
On Fri, Mar 23, 2018 at 01:29:35PM +0530, Akshay Ballarpure wrote:
> I have a query on DB corruption. Is there any way to recover from it 
> without losing data ?

Corrupted pages which need to be zeroed in order to recover the rest is
data lost forever, except if you have a backup you can rollback to.
Please see here for some global instructions about how to deal with such
situations:
https://wiki.postgresql.org/wiki/Corruption

First take a deep breath, and take the time to read and understand it.

> Notice: The information contained in this e-mail
> message and/or attachments to it may contain 
> confidential or privileged information. If you are 
> not the intended recipient, any dissemination, use, 
> review, distribution, printing or copying of the 
> information contained in this e-mail message 
> and/or attachments to it are strictly prohibited. If 
> you have received this communication in error, 
> please notify us by reply e-mail or telephone and 
> immediately and permanently delete the message 
> and any attachments. Thank you

This is a public mailing list.
--
Michael


signature.asc
Description: PGP signature


DB corruption

2018-03-23 Thread Akshay Ballarpure
Hi,
I have a query on DB corruption. Is there any way to recover from it 
without losing data ?

Starting postgresql service: [ OK ]
psql: FATAL: index "pg_authid_rolname_index" contains unexpected zero page 
at block 0
HINT: Please REINDEX it.
psql: FATAL: "base/11564" is not a valid data directory
DETAIL: File "base/11564/PG_VERSION" does not contain valid data.
HINT: You might need to initdb.
psql: FATAL: "base/11564" is not a valid data directory
DETAIL: File "base/11564/PG_VERSION" does not contain valid data.
HINT: You might need to initdb.
psql: FATAL: "base/11564" is not a valid data directory
DETAIL: File "base/11564/PG_VERSION" does not contain valid data.
HINT: You might need to initdb.
psql: FATAL: "base/11564" is not a valid data directory
DETAIL: File "base/11564/PG_VERSION" does not contain valid data.
HINT: You might need to initdb. 

With Best Regards
Akshay

=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you