Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-28 Thread Greg Stark
On 28 Feb 2014 06:19, Andres Freund and...@2ndquadrant.com wrote: On 2014-02-27 23:41:08 +, Greg Stark wrote: Though I notice something I can't understand here. After activating the new clone subsequent attempts to select rows from the page bump the LSN, presumably due to touching

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-28 Thread Andres Freund
On 2014-02-28 10:44:14 +, Greg Stark wrote: On 28 Feb 2014 06:19, Andres Freund and...@2ndquadrant.com wrote: Generally the LSN is computed when writing, not when a buffer is modified, so that's not particularly surprising. It'd be interesting to see what the records are that end on

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-27 Thread Alvaro Herrera
Andres Freund wrote: On 2014-02-26 18:18:05 -0300, Alvaro Herrera wrote: Andres Freund wrote: static void heap_xlog_lock(XLogRecPtr lsn, XLogRecord *record) { ... HeapTupleHeaderClearHotUpdated(htup); HeapTupleHeaderSetXmax(htup, xlrec-locking_xid);

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-27 Thread Greg Stark
On Thu, Feb 27, 2014 at 2:34 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Greg, Peter, if you could update your standbys to the current HEAD of REL9_3_STABLE for the affected apps and verify the problem no longer shows up in a reasonable timeframe, it would be great. (I'm assuming you

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-27 Thread Greg Stark
Though I notice something I can't understand here. After activating the new clone subsequent attempts to select rows from the page bump the LSN, presumably due to touching hint bits (since the prune xid hasn't changed). But the checksum hasn't changed even after running CHECKPOINT. How is it

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-27 Thread Andres Freund
On 2014-02-27 23:41:08 +, Greg Stark wrote: Though I notice something I can't understand here. After activating the new clone subsequent attempts to select rows from the page bump the LSN, presumably due to touching hint bits (since the prune xid hasn't changed). But the checksum hasn't

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-26 Thread Alvaro Herrera
Andres Freund wrote: static void heap_xlog_lock(XLogRecPtr lsn, XLogRecord *record) { ... HeapTupleHeaderClearHotUpdated(htup); HeapTupleHeaderSetXmax(htup, xlrec-locking_xid); HeapTupleHeaderSetCmax(htup, FirstCommandId, false); /* Make sure there is no forward chain link

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-26 Thread Alvaro Herrera
I forgot to mention that the bug can be reproduced in a hot-standby setup with the attached isolation spec. Note that full_page_writes must be turned off (otherwise, the updates use full-page writes and then the bogus code is not run). Once the spec is executed, in the replica run SET

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-26 Thread Andres Freund
On 2014-02-26 18:18:05 -0300, Alvaro Herrera wrote: Andres Freund wrote: static void heap_xlog_lock(XLogRecPtr lsn, XLogRecord *record) { ... HeapTupleHeaderClearHotUpdated(htup); HeapTupleHeaderSetXmax(htup, xlrec-locking_xid); HeapTupleHeaderSetCmax(htup,

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-24 Thread Andres Freund
On 2014-02-20 13:25:35 +, Greg Stark wrote: rmgr: Heaplen (rec/tot):235/ 267, tx:5943845, lsn: FD/2F0A3640, prev FD/2F0A3600, bkp: , desc: insert: rel 1663/16385/212653; tid 13065/2 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid |

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-24 Thread Alvaro Herrera
Here's a reformatted copy. I think this is the same bug as Peter G. reported in http://www.postgresql.org/message-id/CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=o67w0csswpvv7xfuc...@mail.gmail.com I have a hunch that this is related to the heap_lock_updated business. I haven't investigated yet. Greg Stark

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-24 Thread Andres Freund
Hi, On 2014-02-24 17:55:14 -0300, Alvaro Herrera wrote: Greg Stark wrote: I have a database where a a couple rows don't appear in index scans but do appear in sequential scans. It looks like the same problem as Peter reported but this is a different database. I've extracted all the

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-24 Thread Andres Freund
On 2014-02-24 22:17:31 +0100, Andres Freund wrote: Those together explain the story. Note this bit: static void heap_xlog_lock(XLogRecPtr lsn, XLogRecord *record) { ... HeapTupleHeaderClearHotUpdated(htup); HeapTupleHeaderSetXmax(htup, xlrec-locking_xid);

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-24 Thread Peter Geoghegan
On Mon, Feb 24, 2014 at 1:17 PM, Andres Freund and...@2ndquadrant.com wrote: We somehow need to have a policy of testing changes to the WAL format without full_page_writes. They hide bugs in replay far, far too often. What's the easiest way to get atomic page writes at the FS level on your

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-24 Thread Andres Freund
On 2014-02-24 15:05:37 -0800, Peter Geoghegan wrote: On Mon, Feb 24, 2014 at 1:17 PM, Andres Freund and...@2ndquadrant.com wrote: We somehow need to have a policy of testing changes to the WAL format without full_page_writes. They hide bugs in replay far, far too often. What's the easiest

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-24 Thread Peter Geoghegan
On Mon, Feb 24, 2014 at 3:17 PM, Andres Freund and...@2ndquadrant.com wrote: TBH I don't care about torn pages during normal testing. I don't want to suggest disabling it for real workloads with real data, just that it's important to do so during development/testing of WAL related code,

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-24 Thread Andres Freund
On 2014-02-24 15:20:13 -0800, Peter Geoghegan wrote: On Mon, Feb 24, 2014 at 3:17 PM, Andres Freund and...@2ndquadrant.com wrote: TBH I don't care about torn pages during normal testing. I don't want to suggest disabling it for real workloads with real data, just that it's important to do

[HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-20 Thread Greg Stark
I have a database where a a couple rows don't appear in index scans but do appear in sequential scans. It looks like the same problem as Peter reported but this is a different database. I've extracted all the xlogdump records and below are the ones I think are relevant. You can see that lp 2 gets

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-20 Thread Andres Freund
Hi, On 2014-02-20 13:25:35 +, Greg Stark wrote: I have a database where a a couple rows don't appear in index scans but do appear in sequential scans. It looks like the same problem as Peter reported but this is a different database. I've extracted all the xlogdump records and below are