Re: [HACKERS] Funny WAL corruption issue

2017-08-11 Thread Chris Travers
On Fri, Aug 11, 2017 at 1:33 PM, Greg Stark  wrote:

> On 10 August 2017 at 15:26, Chris Travers  wrote:
> >
> >
> > The bitwise comparison is interesting.  Remember the error was:
> >
> > pg_xlogdump: FATAL:  error in WAL record at 1E39C/E1117FB8: unexpected
> > pageaddr 1E375/61118000 in log segment 0001E39C00E1, offset
> > 1146880
> ...
> > Since this didn't throw a checksum error (we have data checksums
> disabled but wal records ISTR have a separate CRC check), would this
> perhaps indicate that the checksum operated over incorrect data?
>
> No checksum error and this "unexpected pageaddr" doesn't necessarily
> mean data corruption. It could mean that when the database stopped logging
> it was reusing a wal file and the old wal stream had a record boundary
> on the same byte position. So the previous record checksum passed and
> the following record checksum passes but the record header is for a
> different wal stream position.
>
> I think you could actually hack xlogdump to ignore this condition and
> keep outputting and you'll see whether the records that follow appear
> to be old wal log data.  I haven't actually tried this though.
>

For better or worse, I get a different error at the same spot if I try this:

Doing so involved disabling the check in the backend wal reader.

pg_xlogdump: FATAL:  error in WAL record at 1E39C/E1117FB8: invalid
contrecord length 4509 at 1E39C/E1117FF8

If I hack it to ignore all errors on that record, no further records come
out though it does run over the same records.

This leads me to conclude there are no further valid records.


>
> --
> greg
>



-- 
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
lock-in.
http://www.efficito.com/learn_more


Re: [HACKERS] Funny WAL corruption issue

2017-08-11 Thread Chris Travers
On Fri, Aug 11, 2017 at 1:33 PM, Greg Stark  wrote:

> On 10 August 2017 at 15:26, Chris Travers  wrote:
> >
> >
> > The bitwise comparison is interesting.  Remember the error was:
> >
> > pg_xlogdump: FATAL:  error in WAL record at 1E39C/E1117FB8: unexpected
> > pageaddr 1E375/61118000 in log segment 0001E39C00E1, offset
> > 1146880
> ...
> > Since this didn't throw a checksum error (we have data checksums
> disabled but wal records ISTR have a separate CRC check), would this
> perhaps indicate that the checksum operated over incorrect data?
>
> No checksum error and this "unexpected pageaddr" doesn't necessarily
> mean data corruption. It could mean that when the database stopped logging
> it was reusing a wal file and the old wal stream had a record boundary
> on the same byte position. So the previous record checksum passed and
> the following record checksum passes but the record header is for a
> different wal stream position.
>

I expect to test this theory shortly.

Assuming it is correct, what can we do to prevent restarts of slaves from
running into it?


> I think you could actually hack xlogdump to ignore this condition and
> keep outputting and you'll see whether the records that follow appear
> to be old wal log data.  I haven't actually tried this though.
>
> --
> greg
>



-- 
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
lock-in.
http://www.efficito.com/learn_more


Re: [HACKERS] Funny WAL corruption issue

2017-08-11 Thread Greg Stark
On 10 August 2017 at 15:26, Chris Travers  wrote:
>
>
> The bitwise comparison is interesting.  Remember the error was:
>
> pg_xlogdump: FATAL:  error in WAL record at 1E39C/E1117FB8: unexpected
> pageaddr 1E375/61118000 in log segment 0001E39C00E1, offset
> 1146880
...
> Since this didn't throw a checksum error (we have data checksums disabled but 
> wal records ISTR have a separate CRC check), would this perhaps indicate that 
> the checksum operated over incorrect data?

No checksum error and this "unexpected pageaddr" doesn't necessarily
mean data corruption. It could mean that when the database stopped logging
it was reusing a wal file and the old wal stream had a record boundary
on the same byte position. So the previous record checksum passed and
the following record checksum passes but the record header is for a
different wal stream position.

I think you could actually hack xlogdump to ignore this condition and
keep outputting and you'll see whether the records that follow appear
to be old wal log data.  I haven't actually tried this though.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Funny WAL corruption issue

2017-08-10 Thread Chris Travers
On Thu, Aug 10, 2017 at 3:17 PM, Vladimir Rusinov 
wrote:

>
>
> On Thu, Aug 10, 2017 at 1:48 PM, Aleksander Alekseev <
> a.aleks...@postgrespro.ru> wrote:
>
>> I just wanted to point out that a hardware issue or third party software
>> issues (bugs in FS, software RAID, ...) could not be fully excluded from
>> the list of suspects. According to the talk by Christophe Pettus [1]
>> it's not that uncommon as most people think.
>
>
> This still might be the case of hardware corruption, but it does not look
> like one.
>

Yeah, I don't think so either.  The systems were not restarted, only the
service so I don't think this is a lie-on-write case.  We have EEC with
full checks, etc.  It really looks like something I initiated caused it but
not sure what and really not interested in trying to reproduce on a db of
this size.

>
> Likelihood of two different persons seeing similar error message just a
> year apart is low. From our practice hardware corruption usually looks like
> a random single bit flip (most common - bad cpu or memory), bunch of zeroes
> (bad storage), or bunch of complete garbage (usually indicates in-memory
> pointer corruption).
>
> Chris, if you still have original WAL segment from the master and it's
> corrupt copy from standby, can you do bit-by-bit comparison to see how they
> are different? Also, if you can please share some hardware details.
> Specifically, do you use ECC? If so, are there any ECC errors logged? Do
> you use physical disks/ssd or some form of storage virtualization?
>

Straight on bare metal, eec with no errors logged.  SSD for both data and
wal.

The bitwise comparison is interesting.  Remember the error was:

pg_xlogdump: FATAL:  error in WAL record at 1E39C/E1117FB8: unexpected
pageaddr 1E375/61118000 in log segment 0001E39C00E1, offset
1146880


Starting with the good segment:

Good wall segment, I think the record starts at 003b:


0117fb0     003b   

0117fc0 7f28 e111 e39c 0001 0940  cb88 db01

0117fd0 0200  067f  4000  2249 0195

0117fe0 0001  8001  b5c3  05ff 

0117ff0  0003   008c   

0118000 d093 0005 0001  8000 e111 e39c 0001

0118010 0084    7fb8 e111 e39c 0001

0118020 0910  ccac 2eba 2000 0056 067f 

0118030 4000  2249 0195 b5c4  08ff 0001

0118040 0002 0003 0004 0005 0006 0007 0008 0009

0118050 000a 000b 000c 000d 000e 000f 0010 0011

0118060 0012 0013 0014 0015 0016 0017 0018 0019

0118070 001a 001b 001c 001d 001e 001f 0020 0021


0117fb0     003b   

0117fc0 7f28 e111 e39c 0001 0940  cb88 db01

0117fd0 0200  067f  4000  2249 0195

0117fe0 0001  8001  b5c3  05ff 

0117ff0  0003   4079 ce05 1cce ecf9

0118000 d093 0005 0001  8000 6111 e375 0001

0118010 119d    cfd4 00cc ca00 0410

0118020 1800 7c00 5923 544b dc20 914c 7a5c afec

0118030 db45 0060 b700 1910 1800 7c00 791f 2ede

0118040 c573 a110 5a88 e1e6 ab48 0034 9c00 2210

0118050 1800 7c00 4415 400d 2c7e b5e3 7c88 bcef

0118060 4666 00db 9900 0a10 1800 7c00 7d1d b355
0118070 d432 8365 de99 4dba 87c7 00ed 6200 2210

I think the divergence is interesting here.  Up through 0117ff8, they are
identical.  Then the last half if the line differs.
The first half of the next is the same (but up through 011800a this time),
but the last 6 bytes differ (those six hold what appear to be the memory
address causing the problem), and we only have a few bits different in the
rest of the line.

It looks like some data and some flags were overwritten, perhaps while the
process exited.  Very interesting.


> Also, in absolute majority of cases corruption is caught by checksums. I
> am not familiar with WAL protocol - do we have enough checksums when
> writing it out and on the wire? I suspect there are much more things
> PostgreSQL can do to be more resilient, and at least detect corruptions
> earlier.
>

Since this didn't throw a checksum error (we have data checksums disabled
but wal records ISTR have a separate CRC check), would this perhaps
indicate that the checksum operated over incorrect data?

>
> --
> Vladimir Rusinov
> PostgreSQL SRE, Google Ireland
>
> Google Ireland Ltd.,Gordon House, Barrow Street, Dublin 4, Ireland
> Registered in Dublin, Ireland
> Registration Number: 368047
>



-- 
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
lock-in.
http://www.efficito.com/learn_more


Re: [HACKERS] Funny WAL corruption issue

2017-08-10 Thread Vladimir Rusinov
On Thu, Aug 10, 2017 at 1:48 PM, Aleksander Alekseev <
a.aleks...@postgrespro.ru> wrote:

> I just wanted to point out that a hardware issue or third party software
> issues (bugs in FS, software RAID, ...) could not be fully excluded from
> the list of suspects. According to the talk by Christophe Pettus [1]
> it's not that uncommon as most people think.


This still might be the case of hardware corruption, but it does not look
like one.

Likelihood of two different persons seeing similar error message just a
year apart is low. From our practice hardware corruption usually looks like
a random single bit flip (most common - bad cpu or memory), bunch of zeroes
(bad storage), or bunch of complete garbage (usually indicates in-memory
pointer corruption).

Chris, if you still have original WAL segment from the master and it's
corrupt copy from standby, can you do bit-by-bit comparison to see how they
are different? Also, if you can please share some hardware details.
Specifically, do you use ECC? If so, are there any ECC errors logged? Do
you use physical disks/ssd or some form of storage virtualization?

Also, in absolute majority of cases corruption is caught by checksums. I am
not familiar with WAL protocol - do we have enough checksums when writing
it out and on the wire? I suspect there are much more things PostgreSQL can
do to be more resilient, and at least detect corruptions earlier.

-- 
Vladimir Rusinov
PostgreSQL SRE, Google Ireland

Google Ireland Ltd.,Gordon House, Barrow Street, Dublin 4, Ireland
Registered in Dublin, Ireland
Registration Number: 368047


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Funny WAL corruption issue

2017-08-10 Thread Aleksander Alekseev
Hi Chris,

> I ran into a funny situation today regarding PostgreSQL replication and
> wal corruption and wanted to go over what I think happened and what I
> wonder about as a possible solution.

Sad story. Unfortunately I have no idea what could be a reason nor can I
suggest a good way to find it unless there is an already know sequence
of steps that reproduces an issue.

I just wanted to point out that a hardware issue or third party software
issues (bugs in FS, software RAID, ...) could not be fully excluded from
the list of suspects. According to the talk by Christophe Pettus [1]
it's not that uncommon as most people think.

If the issue reproduces from time to time on one replica and doesn't on
the second identical replica there is a good chance that you've faced a
hardware issue. Another thing that is worth checking is a SMART status
of the hard drive.

[1]: 
http://www.pgcon.org/2017/schedule/attachments/453_corruption-pgcon-2017.pdf

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature


Re: [HACKERS] Funny WAL corruption issue

2017-08-10 Thread Chris Travers
> Yes.  Exactly the same output until a certain point where pg_xlogdump dies
> with an error.  That is the same LSN where PostgreSQL died with an error on
> restart.
>

 One other thing that is possibly relevant after reading through the last
bug report is the error pgxlogdumo gives:

pg_xlogdump: FATAL:  error in WAL record at 1E39C/E1117FB8: unexpected
pageaddr 1E375/61118000 in log segment 0001E39C00E1, offset
1146880



> --
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
lock-in.
http://www.efficito.com/learn_more


Re: [HACKERS] Funny WAL corruption issue

2017-08-10 Thread Chris Travers
Sorry, meant to reply all.

On Thu, Aug 10, 2017 at 2:19 PM, Vladimir Borodin  wrote:

> Hi, Chris.
>
> 10 авг. 2017 г., в 15:09, Chris Travers 
> написал(а):
>
> Hi;
>
> I ran into a funny situation today regarding PostgreSQL replication and
> wal corruption and wanted to go over what I think happened and what I
> wonder about as a possible solution.
>
> Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB
> database with variable load.  Master database has two slaves and generates
> 10-20MB of WAL traffic a second.  The data_checksum option is off.
>
>
> The problem occurred when I attempted to restart the service on the slave
> using pg_ctl (I believe the service had been started with sys V init
> scripts).  On trying to restart, it gave me a nice "Invalid memory
> allocation request" error and promptly stopped.
>
> The main logs showed a lot of messages like before the restart:
> 2017-08-02 11:47:33 UTC LOG:  PID 19033 in cancel request did not match
> any process
> 2017-08-02 11:47:33 UTC LOG:  PID 19032 in cancel request did not match
> any process
> 2017-08-02 11:47:33 UTC LOG:  PID 19024 in cancel request did not match
> any process
> 2017-08-02 11:47:33 UTC LOG:  PID 19034 in cancel request did not match
> any process
>
> On restart, the following was logged to stderr:
> LOG:  entering standby mode
> LOG:  redo starts at 1E39C/8B77B458
> LOG:  consistent recovery state reached at 1E39C/E1117FF8
> FATAL:  invalid memory alloc request size 3456458752
> LOG:  startup process (PID 18167) exited with exit code 1
> LOG:  terminating any other active server processes
>
> LOG:  database system is shut down
>
> After some troubleshooting I found that the wal segment had become
> corrupt, I copied the correct one from the master and everything came up to
> present.
>
> So It seems like somewhere something crashed big time on the back-end and
> when we tried to restart, the wal ended in an invalid way.
>
>
> We have reported the same thing [1] nearly a year ago. Could you please
> check with pg_xlogdump that both WALs (normal from master and corrupted)
> are exactly the same until some certain LSN?
>
> [1] https://www.postgresql.org/message-id/20160614103415.
> 5796.6885%40wrigleys.postgresql.org
>

Yes.  Exactly the same output until a certain point where pg_xlogdump dies
with an error.  That is the same LSN where PostgreSQL died with an error on
restart.

>
>
> I am wondering what can be done to prevent these sorts of things from
> happening in the future if, for example, a replica dies in the middle of a
> wal fsync.
> --
> Best Wishes,
> Chris Travers
>
> Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
> lock-in.
> http://www.efficito.com/learn_more
>
>
>
> --
> May the force be with you…
> https://simply.name
>
>


-- 
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
lock-in.
http://www.efficito.com/learn_more


Re: [HACKERS] Funny WAL corruption issue

2017-08-10 Thread Vladimir Borodin
Hi, Chris.

> 10 авг. 2017 г., в 15:09, Chris Travers  написал(а):
> 
> Hi;
> 
> I ran into a funny situation today regarding PostgreSQL replication and wal 
> corruption and wanted to go over what I think happened and what I wonder 
> about as a possible solution.
> 
> Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB 
> database with variable load.  Master database has two slaves and generates 
> 10-20MB of WAL traffic a second.  The data_checksum option is off.
> 
> 
> The problem occurred when I attempted to restart the service on the slave 
> using pg_ctl (I believe the service had been started with sys V init 
> scripts).  On trying to restart, it gave me a nice "Invalid memory allocation 
> request" error and promptly stopped.
> 
> The main logs showed a lot of messages like before the restart:
> 2017-08-02 11:47:33 UTC LOG:  PID 19033 in cancel request did not match any 
> process
> 2017-08-02 11:47:33 UTC LOG:  PID 19032 in cancel request did not match any 
> process
> 2017-08-02 11:47:33 UTC LOG:  PID 19024 in cancel request did not match any 
> process
> 2017-08-02 11:47:33 UTC LOG:  PID 19034 in cancel request did not match any 
> process
> 
> On restart, the following was logged to stderr:
> LOG:  entering standby mode
> LOG:  redo starts at 1E39C/8B77B458
> LOG:  consistent recovery state reached at 1E39C/E1117FF8
> FATAL:  invalid memory alloc request size 3456458752
> LOG:  startup process (PID 18167) exited with exit code 1
> LOG:  terminating any other active server processes
> LOG:  database system is shut down
> 
> After some troubleshooting I found that the wal segment had become corrupt, I 
> copied the correct one from the master and everything came up to present.
> 
> So It seems like somewhere something crashed big time on the back-end and 
> when we tried to restart, the wal ended in an invalid way.

We have reported the same thing [1] nearly a year ago. Could you please check 
with pg_xlogdump that both WALs (normal from master and corrupted) are exactly 
the same until some certain LSN?

[1] 
https://www.postgresql.org/message-id/20160614103415.5796.6885%40wrigleys.postgresql.org

> 
> I am wondering what can be done to prevent these sorts of things from 
> happening in the future if, for example, a replica dies in the middle of a 
> wal fsync. 
> -- 
> Best Wishes,
> Chris Travers
> 
> Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor 
> lock-in.
> http://www.efficito.com/learn_more 

--
May the force be with you…
https://simply.name



[HACKERS] Funny WAL corruption issue

2017-08-10 Thread Chris Travers
Hi;

I ran into a funny situation today regarding PostgreSQL replication and wal
corruption and wanted to go over what I think happened and what I wonder
about as a possible solution.

Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB
database with variable load.  Master database has two slaves and generates
10-20MB of WAL traffic a second.  The data_checksum option is off.


The problem occurred when I attempted to restart the service on the slave
using pg_ctl (I believe the service had been started with sys V init
scripts).  On trying to restart, it gave me a nice "Invalid memory
allocation request" error and promptly stopped.

The main logs showed a lot of messages like before the restart:

2017-08-02 11:47:33 UTC LOG:  PID 19033 in cancel request did not match any
process

2017-08-02 11:47:33 UTC LOG:  PID 19032 in cancel request did not match any
process

2017-08-02 11:47:33 UTC LOG:  PID 19024 in cancel request did not match any
process

2017-08-02 11:47:33 UTC LOG:  PID 19034 in cancel request did not match any
process


On restart, the following was logged to stderr:

LOG:  entering standby mode

LOG:  redo starts at 1E39C/8B77B458

LOG:  consistent recovery state reached at 1E39C/E1117FF8

FATAL:  invalid memory alloc request size 3456458752

LOG:  startup process (PID 18167) exited with exit code 1

LOG:  terminating any other active server processes

LOG:  database system is shut down

After some troubleshooting I found that the wal segment had become corrupt,
I copied the correct one from the master and everything came up to present.

So It seems like somewhere something crashed big time on the back-end and
when we tried to restart, the wal ended in an invalid way.

I am wondering what can be done to prevent these sorts of things from
happening in the future if, for example, a replica dies in the middle of a
wal fsync.
-- 
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
lock-in.
http://www.efficito.com/learn_more