Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-28 Thread Andres Freund
On 2014-02-28 10:44:14 +, Greg Stark wrote: > On 28 Feb 2014 06:19, "Andres Freund" wrote: > > Generally the LSN is computed when writing, not when a buffer is > > modified, so that's not particularly surprising. It'd be interesting to > > see what the records are that end on those LSNs. > >

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-28 Thread Greg Stark
On 28 Feb 2014 06:19, "Andres Freund" wrote: > > On 2014-02-27 23:41:08 +, Greg Stark wrote: > > Though I notice something I can't understand here. > > > > After activating the new clone subsequent attempts to select rows from > > the page bump the LSN, presumably due to touching hint bits (si

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-27 Thread Andres Freund
On 2014-02-27 23:41:08 +, Greg Stark wrote: > Though I notice something I can't understand here. > > After activating the new clone subsequent attempts to select rows from > the page bump the LSN, presumably due to touching hint bits (since the > prune xid hasn't changed). But the checksum has

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-27 Thread Greg Stark
Though I notice something I can't understand here. After activating the new clone subsequent attempts to select rows from the page bump the LSN, presumably due to touching hint bits (since the prune xid hasn't changed). But the checksum hasn't changed even after running CHECKPOINT. How is it poss

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-27 Thread Greg Stark
On Thu, Feb 27, 2014 at 2:34 PM, Alvaro Herrera wrote: > Greg, Peter, if you could update your standbys to the current HEAD of > REL9_3_STABLE for the affected apps and verify the problem no longer > shows up in a reasonable timeframe, it would be great. (I'm assuming > you saw this happen repeat

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-27 Thread Alvaro Herrera
Andres Freund wrote: > On 2014-02-26 18:18:05 -0300, Alvaro Herrera wrote: > > Andres Freund wrote: > > > > > static void > > > heap_xlog_lock(XLogRecPtr lsn, XLogRecord *record) > > > { > > > ... > > > HeapTupleHeaderClearHotUpdated(htup); > > > HeapTupleHeaderSetXmax(htup, xlrec->locking

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-26 Thread Andres Freund
On 2014-02-26 18:18:05 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > > static void > > heap_xlog_lock(XLogRecPtr lsn, XLogRecord *record) > > { > > ... > > HeapTupleHeaderClearHotUpdated(htup); > > HeapTupleHeaderSetXmax(htup, xlrec->locking_xid); > > HeapTupleHeaderSetCmax(h

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-26 Thread Andres Freund
On 2014-02-26 19:03:42 -0300, Alvaro Herrera wrote: > I forgot to mention that the bug can be reproduced in a hot-standby > setup with the attached isolation spec. Note that full_page_writes must > be turned off (otherwise, the updates use full-page writes and then the > bogus code is not run). O

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-26 Thread Alvaro Herrera
I forgot to mention that the bug can be reproduced in a hot-standby setup with the attached isolation spec. Note that full_page_writes must be turned off (otherwise, the updates use full-page writes and then the bogus code is not run). Once the spec is executed, in the replica run SET enable_seq

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-26 Thread Alvaro Herrera
Andres Freund wrote: > static void > heap_xlog_lock(XLogRecPtr lsn, XLogRecord *record) > { > ... > HeapTupleHeaderClearHotUpdated(htup); > HeapTupleHeaderSetXmax(htup, xlrec->locking_xid); > HeapTupleHeaderSetCmax(htup, FirstCommandId, false); > /* Make sure there is no forward ch

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-24 Thread Andres Freund
On 2014-02-24 15:20:13 -0800, Peter Geoghegan wrote: > On Mon, Feb 24, 2014 at 3:17 PM, Andres Freund wrote: > > TBH I don't care about torn pages during normal testing. I don't want to > > suggest disabling it for real workloads with real data, just that it's > > important to do so during develop

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-24 Thread Peter Geoghegan
On Mon, Feb 24, 2014 at 3:17 PM, Andres Freund wrote: > TBH I don't care about torn pages during normal testing. I don't want to > suggest disabling it for real workloads with real data, just that it's > important to do so during development/testing of WAL related code, > because otherwise it will

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-24 Thread Andres Freund
On 2014-02-24 15:05:37 -0800, Peter Geoghegan wrote: > On Mon, Feb 24, 2014 at 1:17 PM, Andres Freund wrote: > > We somehow need to have a policy of testing changes to the WAL format > > without full_page_writes. They hide bugs in replay far, far too often. > > What's the easiest way to get atomi

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-24 Thread Peter Geoghegan
On Mon, Feb 24, 2014 at 1:17 PM, Andres Freund wrote: > We somehow need to have a policy of testing changes to the WAL format > without full_page_writes. They hide bugs in replay far, far too often. What's the easiest way to get atomic page writes at the FS level on your laptop? ZFS or some data

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-24 Thread Andres Freund
On 2014-02-24 22:17:31 +0100, Andres Freund wrote: > Those together explain the story. Note this bit: > > static void > heap_xlog_lock(XLogRecPtr lsn, XLogRecord *record) > { > ... > HeapTupleHeaderClearHotUpdated(htup); > HeapTupleHeaderSetXmax(htup, xlrec->locking_xid); > HeapTupleHe

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-24 Thread Andres Freund
Hi, On 2014-02-24 17:55:14 -0300, Alvaro Herrera wrote: > Greg Stark wrote: > > I have a database where a a couple rows don't appear in index scans > > but do appear in sequential scans. It looks like the same problem as > > Peter reported but this is a different database. I've extracted all > > t

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-24 Thread Alvaro Herrera
Here's a reformatted copy. I think this is the same bug as Peter G. reported in http://www.postgresql.org/message-id/CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=o67w0csswpvv7xfuc...@mail.gmail.com I have a hunch that this is related to the heap_lock_updated business. I haven't investigated yet. Greg Stark

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-24 Thread Andres Freund
On 2014-02-20 13:25:35 +, Greg Stark wrote: > rmgr: Heaplen (rec/tot):235/ 267, tx:5943845, lsn: > FD/2F0A3640, prev FD/2F0A3600, bkp: , desc: insert: rel > 1663/16385/212653; tid 13065/2 > lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | > t_ctid | t_i

Re: [HACKERS] Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?

2014-02-20 Thread Andres Freund
Hi, On 2014-02-20 13:25:35 +, Greg Stark wrote: > I have a database where a a couple rows don't appear in index scans > but do appear in sequential scans. It looks like the same problem as > Peter reported but this is a different database. I've extracted all > the xlogdump records and below ar