subject:"\[HACKERS\] Data corruption issues using streaming replication on 9.0.14\/9.2.5\/9.3.1"

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-22 Thread Heikki Linnakangas


On 19.11.2013 16:20, Andres Freund wrote:

On 2013-11-18 23:15:59 +0100, Andres Freund wrote:

Afaics it's likely a combination/interaction of bugs and fixes between:
* the initial HS code
* 5a031a5556ff83b8a9646892715d7fef415b83c3
* f44eedc3f0f347a856eea8590730769125964597


Yes, the combination of those is guilty.

Man, this is (to a good part my) bad.


But that'd mean nobody noticed it during 9.3's beta...


It's fairly hard to reproduce artificially since a) there have to be
enough transactions starting and committing from the start of the
checkpoint the standby is starting from to the point it does
LogStandbySnapshot() to cross a 32768 boundary b) hint bits often save
the game by not accessing clog at all anymore and thus not noticing the
corruption.
I've reproduced the issue by having an INSERT ONLY table that's never
read from. It's helpful to disable autovacuum.


For the archive, here's what I used to reproduce this. It creates master 
and a standby, and also uses an INSERT only table. To make it trigger 
more easily, it helps to insert sleeps in CreateCheckpoint(), around the 
LogStandbySnapshot() call.


- Heikki


test-hot-standby-bug.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-22 Thread Heikki Linnakangas


On 21.11.2013 22:55, Andres Freund wrote:

On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote:

On 19.11.2013 16:22, Andres Freund wrote:

On 2013-11-19 15:20:01 +0100, Andres Freund wrote:

Imo something the attached patch should be done. The description I came

g up with is:


 Fix Hot-Standby initialization of clog and subtrans.


Looks ok for a back-patchable fix.


Do you plan to commit this? Or who is going to?


Ok, committed.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-22 Thread Andres Freund

On 2013-11-22 15:01:10 +0200, Heikki Linnakangas wrote:
 On 21.11.2013 22:55, Andres Freund wrote:
 On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote:
 Looks ok for a back-patchable fix.
 
 Do you plan to commit this? Or who is going to?
 
 Ok, committed.

Thanks!

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Andres Freund

On 2013-11-20 17:46:22 +0100, Andres Freund wrote:
 On 2013-11-20 18:25:56 +0200, Heikki Linnakangas wrote:
  Isn't it possible that the standby has already incorrectly set
  HEAP_XMIN_INVALID hint bit on a page? The full page images generated by
  VACUUM FREEZE will correct the damage, but if not, e.g. because
  full_page_writes=off, strange things will happen.
 
 The xlog_heap_freeze records should repair that afaics.

Nope, it wouldn't, it just uses heap_freeze_tuple() again which *only*
sets HEAP_XMIN_COMMITTED but does *not* clear HEAP_XMIN_INVALID. And if
we were to change that now, it wouldn't help already frozen tuples that
wouldn't get frozen again.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Joshua D. Drake



Hello,

This is turning into a rather large thread and I have a simple question:

Is a work-around to this problem as simple as disabling streaming 
replication and enabling log shipping instead?


Joshua D. Drake


--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
   a rose in the deeps of my heart. - W.B. Yeats


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Andres Freund

On 2013-11-21 12:39:54 -0800, Josh Berkus wrote:
 On 11/21/2013 12:36 PM, Joshua D. Drake wrote:
  
  Hello,
  
  This is turning into a rather large thread and I have a simple question:
  
  Is a work-around to this problem as simple as disabling streaming
  replication and enabling log shipping instead?

No. Check
http://archives.postgresql.org/message-id/20131120234141.GI18801%40awork2.anarazel.de

The problem is starting with hot_standby=on on a system with
recovery.conf present. It is independent of whether you use streaming
replication, archive based recovery, or just shutdown the server and
manually copy xlog segments there.
As long as hot_standby=on, and recovery.conf is present you can hit the
bug.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Tatsuo Ishii

 Well, I happen to have some pieces of such a framework: the parts which
 can automate spinning up arbitrarily complex groups of replicas and
 doing failover between them.  What we'd still need is:
 
 a) a slightly better workload than pgbench
 b) a way to compare and test databases for data corruption of several kinds
 
 Can someone else kick in to help with this?

Recently pgpool-II introduces its own regression test framework
written in shell script which can create a streaming replication
primary and arbitrary number of standbys in single server. The test
scenarios are written in a shell script, thus is reasonably
flexible. I have read through all the thread but I am not sure how to
reproduce the problem reliably. If I would know it, probably I can
integrate into the regression test.

 I think this last issue shows that it's critical as a community to have
 such a testing framework in place, otherwise we really need to halt all
 work on replication until we have such a thing.

Agreed. I think this is very critical for us.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Andres Freund

On 2013-11-21 10:25:20 -0500, J Smith wrote:
 On Tue, Nov 19, 2013 at 9:22 AM, Andres Freund and...@2ndquadrant.com wrote:
  On 2013-11-19 15:20:01 +0100, Andres Freund wrote:
  Imo something the attached patch should be done. The description I came
  up with is:
 
  Fix Hot-Standby initialization of clog and subtrans.
 
 
 G'day Andres.
 
 This wouldn't happen to be related to the issue I was having with
 missing pg_subtrans files, would it, as discussed here
 http://www.postgresql.org/message-id/cadfupgc5bmtv-yg9znxv-vcfkb+jprqs7m2oesqxam_4z1j...@mail.gmail.com
 ? Based on my reading of the commit message and what I could glean
 from the patch itself, they seem to be related.

I don't think so - for one, pg_subtrans isn't really the problems with
that bug, for another, it won't cause missing files. Also, you're not
using replication, right?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Andres Freund

On 2013-11-21 23:02:29 +0200, Heikki Linnakangas wrote:
 On 21.11.2013 22:53, Andres Freund wrote:
 On 2013-11-21 12:51:17 -0800, Josh Berkus wrote:
 On 11/21/2013 12:46 PM, Andres Freund wrote:
 The problem is starting with hot_standby=on on a system with
 recovery.conf present. It is independent of whether you use streaming
 replication, archive based recovery, or just shutdown the server and
 manually copy xlog segments there.
 As long as hot_standby=on, and recovery.conf is present you can hit the
 bug.
 
 Oh, aha.  There have to be some transactions which are awaiting
 checkpoint, though, correct?  As in, if there's no activity on the
 master, you can't trigger the bug?
 
 Correct. Also, if you *start* at such a checkpoint you're not vulnerable
 until the standby is restarted.
 
 Keep in mind that autovacuum counts as activity in this case. If you're
 unlucky, that is. It's next to impossible to determine after-the-fact if
 there has been activity in the master that might've caused problems.

Well, in that case you're going to just loose the
pg_database/pg_class/stats updates from analyze/vacuum. That's annoying,
but not too bad.

 I wouldn't try to narrow it down any further than that, it gets too
 complicated.

Yes.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread J Smith

On Tue, Nov 19, 2013 at 9:22 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-11-19 15:20:01 +0100, Andres Freund wrote:
 Imo something the attached patch should be done. The description I came
 up with is:

 Fix Hot-Standby initialization of clog and subtrans.


G'day Andres.

This wouldn't happen to be related to the issue I was having with
missing pg_subtrans files, would it, as discussed here
http://www.postgresql.org/message-id/cadfupgc5bmtv-yg9znxv-vcfkb+jprqs7m2oesqxam_4z1j...@mail.gmail.com
? Based on my reading of the commit message and what I could glean
from the patch itself, they seem to be related.

If there's anything I can do to help test, just say the word and I can
fit in some time. Cheers.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Heikki Linnakangas


On 21.11.2013 22:53, Andres Freund wrote:

On 2013-11-21 12:51:17 -0800, Josh Berkus wrote:

On 11/21/2013 12:46 PM, Andres Freund wrote:

The problem is starting with hot_standby=on on a system with
recovery.conf present. It is independent of whether you use streaming
replication, archive based recovery, or just shutdown the server and
manually copy xlog segments there.
As long as hot_standby=on, and recovery.conf is present you can hit the
bug.


Oh, aha.  There have to be some transactions which are awaiting
checkpoint, though, correct?  As in, if there's no activity on the
master, you can't trigger the bug?


Correct. Also, if you *start* at such a checkpoint you're not vulnerable
until the standby is restarted.


Keep in mind that autovacuum counts as activity in this case. If 
you're unlucky, that is. It's next to impossible to determine 
after-the-fact if there has been activity in the master that might've 
caused problems.


If you have ever set hot_standby=on in your standby server, running one 
of the affected versions, you're at risk. The standby might be corrupt, 
and should be rebuild from a base backup. The higher the transaction 
rate in the master, the higher the risk.


I wouldn't try to narrow it down any further than that, it gets too 
complicated.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread J Smith

On Thu, Nov 21, 2013 at 10:27 AM, Andres Freund and...@2ndquadrant.com wrote:

 I don't think so - for one, pg_subtrans isn't really the problems with
 that bug, for another, it won't cause missing files. Also, you're not
 using replication, right?


Actually, this server is a master being replicated to a standby. We
haven't experienced the problem on the standby, but we are indeed
replicating the server experiencing the issue.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Heikki Linnakangas


On 21.11.2013 22:55, Andres Freund wrote:

Hi,

On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote:

On 19.11.2013 16:22, Andres Freund wrote:

On 2013-11-19 15:20:01 +0100, Andres Freund wrote:

Imo something the attached patch should be done. The description I came

g up with is:


 Fix Hot-Standby initialization of clog and subtrans.


Looks ok for a back-patchable fix.


Do you plan to commit this? Or who is going to?


I can commit it tomorrow, unless someone beats me to it.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread J Smith

On Thu, Nov 21, 2013 at 12:23 PM, Andres Freund and...@2ndquadrant.com wrote:

 It's still not this issue in that case, but I might have an idea... Do
 you have hot_standby_feedback enabled?


Nope, hot_standby_feedback is set to its default setting which is off.

At any rate, I'm going to try and capture a backtrace from the error
I'm experiencing as suggested in the other thread by Robert Haas and
we'll see if that provides any insight.

Cheers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Andres Freund

On 2013-11-21 11:54:58 -0500, J Smith wrote:
 On Thu, Nov 21, 2013 at 10:27 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
 
  I don't think so - for one, pg_subtrans isn't really the problems with
  that bug, for another, it won't cause missing files. Also, you're not
  using replication, right?
 
 
 Actually, this server is a master being replicated to a standby. We
 haven't experienced the problem on the standby, but we are indeed
 replicating the server experiencing the issue.

It's still not this issue in that case, but I might have an idea... Do
you have hot_standby_feedback enabled?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Josh Berkus

On 11/21/2013 12:46 PM, Andres Freund wrote:
 No. Check
 http://archives.postgresql.org/message-id/20131120234141.GI18801%40awork2.anarazel.de
 
 The problem is starting with hot_standby=on on a system with
 recovery.conf present. It is independent of whether you use streaming
 replication, archive based recovery, or just shutdown the server and
 manually copy xlog segments there.
 As long as hot_standby=on, and recovery.conf is present you can hit the
 bug.

Oh, aha.  There have to be some transactions which are awaiting
checkpoint, though, correct?  As in, if there's no activity on the
master, you can't trigger the bug?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Andres Freund

On 2013-11-21 12:51:17 -0800, Josh Berkus wrote:
 On 11/21/2013 12:46 PM, Andres Freund wrote:
  No. Check
  http://archives.postgresql.org/message-id/20131120234141.GI18801%40awork2.anarazel.de
  
  The problem is starting with hot_standby=on on a system with
  recovery.conf present. It is independent of whether you use streaming
  replication, archive based recovery, or just shutdown the server and
  manually copy xlog segments there.
  As long as hot_standby=on, and recovery.conf is present you can hit the
  bug.
 
 Oh, aha.  There have to be some transactions which are awaiting
 checkpoint, though, correct?  As in, if there's no activity on the
 master, you can't trigger the bug?

Correct. Also, if you *start* at such a checkpoint you're not vulnerable
until the standby is restarted.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Josh Berkus

On 11/21/2013 12:36 PM, Joshua D. Drake wrote:
 
 Hello,
 
 This is turning into a rather large thread and I have a simple question:
 
 Is a work-around to this problem as simple as disabling streaming
 replication and enabling log shipping instead?

Yes, and re-cloning the replica, in case it already has corruption.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Andres Freund

Hi,

On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote:
 On 19.11.2013 16:22, Andres Freund wrote:
 On 2013-11-19 15:20:01 +0100, Andres Freund wrote:
 Imo something the attached patch should be done. The description I came
g up with is:
 
  Fix Hot-Standby initialization of clog and subtrans.
 
 Looks ok for a back-patchable fix.

Do you plan to commit this? Or who is going to?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Heikki Linnakangas


On 19.11.2013 16:22, Andres Freund wrote:

On 2013-11-19 15:20:01 +0100, Andres Freund wrote:

Imo something the attached patch should be done. The description I came
up with is:

 Fix Hot-Standby initialization of clog and subtrans.


Looks ok for a back-patchable fix.

It's a bit bizarre that the ExtendSUBTRANS loop in 
ProcArrayApplyRecoveryInfo looks different from the one in 
RecordKnownAssignedTransactionIds, but both look correct to me.


In master, it'd be nice to do some further cleanup. Some gripes:

In ProcArrayApplyXidAssignment, I wonder if it would be best to just 
remove the (standbyState == STANDBY_INITIALIZED) check altogether. The 
KnownAssignedXidsRemoveTree() that follows is harmless if there is 
nothing in the known-assigned-xids array (xact_redo_commit does it in 
STANDBY_INITIALIZED state too). The other thing that's done after that 
check is updating lastOverflowedXid, and AFAICS it would be harmless to 
update that, too. It will be overwritten by the 
ProcArrayApplyRecoveryInfo() call that comes later.


Clog, subtrans and multixact are all handled differently. Extensions of 
clog and multixact are WAL-logged, but extensions of subtrans are not. 
They all have a Startup function, but it has a slightly different 
purpose. StartupCLOG only sets latest_page_number, but StartupSUBTRANS 
and StartupMultiXact zero out the current page. For CLOG, the TrimCLOG() 
function does that. Why is clog handled differently from multixact?


StartupCLOG() and StartupMultiXact set latest_page_number, but 
StartupSUBTRANS does not. Is that a problem for subtrans? StartupCLOG() 
and StartupMultiXact() are called at different stages in hot standby - 
StartupCLOG() is called at the beginning of recovery, but 
StartupMultiXact() is only called at end of recovery. Why the 
discrepancy, does latest_page_number need to be set during hot standby 
or not?


I think we should bite the bullet, and WAL-log the extension of 
subtrans, too. Then make the startup and extension procedure for all the 
SLRUs the same.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Andres Freund

On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote:
 On 19.11.2013 16:22, Andres Freund wrote:
 On 2013-11-19 15:20:01 +0100, Andres Freund wrote:
 Imo something the attached patch should be done. The description I came
 up with is:
 
  Fix Hot-Standby initialization of clog and subtrans.
 
 Looks ok for a back-patchable fix.
 
 It's a bit bizarre that the ExtendSUBTRANS loop in
 ProcArrayApplyRecoveryInfo looks different from the one in
 RecordKnownAssignedTransactionIds, but both look correct to me.

That's because of the different upper bounds (nextxid) vs xid]), but
yea, I wondered about that as well.

 In master, it'd be nice to do some further cleanup. Some gripes:
 
 In ProcArrayApplyXidAssignment, I wonder if it would be best to just remove
 the (standbyState == STANDBY_INITIALIZED) check altogether. The
 KnownAssignedXidsRemoveTree() that follows is harmless if there is nothing
 in the known-assigned-xids array (xact_redo_commit does it in
 STANDBY_INITIALIZED state too). The other thing that's done after that check
 is updating lastOverflowedXid, and AFAICS it would be harmless to update
 that, too. It will be overwritten by the ProcArrayApplyRecoveryInfo() call
 that comes later.

I was thinking about removing it entirely in the patch, but chose not to
do so. I don't really care which way we go.

 Clog, subtrans and multixact are all handled differently. Extensions of clog
 and multixact are WAL-logged, but extensions of subtrans are not. They all
 have a Startup function, but it has a slightly different purpose.
 StartupCLOG only sets latest_page_number, but StartupSUBTRANS and
 StartupMultiXact zero out the current page. For CLOG, the TrimCLOG()
 function does that. Why is clog handled differently from multixact?

I'd guess clog and multixact are handled differently because multixact
supposedly is never queried during recovery. But I don't that's actually
still true, thinking of 9.3's changes around fkey locks and
HeapTupleGetUpdateXid().
So it's probably time to split StartupMultiXact similar to clog's routines.

 StartupCLOG() and StartupMultiXact set latest_page_number, but
 StartupSUBTRANS does not. Is that a problem for subtrans?

I don't think it is, the difference is that StartupSUBTRANS() zeroes out
the current subtrans page which will set latest_page_number, the other's
access the pages normally, which doesn't set it.

 StartupCLOG() and
 StartupMultiXact() are called at different stages in hot standby -
 StartupCLOG() is called at the beginning of recovery, but StartupMultiXact()
 is only called at end of recovery. Why the discrepancy, does
 latest_page_number need to be set during hot standby or not?

latest_page_number primarily is an optimization, isn't it? Except for a
safeguard check in SimpleLruTruncate() it should only influence victim
buffer initialization. But: slru.c explicitly doesn't initialize
-latest_page_number, which means it's zeroed from a memset slightly
above. Which seems we might cause problems when performing truncations
during recovery, before the first page is zeroed (which'd set
latest_page_number again).
...
Hm. Do we actually *ever* truncate the multixact slru during recovery?
clog.c's truncations are WAL logged, TruncateSUBTRANS() is performed by
restartpoints, but there's no callers to TruncateMultiXact but
vac_truncate_clog and it's not logged? That doesn't seem right.

 I think we should bite the bullet, and WAL-log the extension of subtrans,
 too. Then make the startup and extension procedure for all the SLRUs the
 same.

Hm. I don't really see a big advantage in that? I am all for trying to
bring more symetry to the startup routines, but I don't think forcing
WAL logging for something scrapped every restart is necessary for that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Kevin Grittner

Andres Freund and...@2ndquadrant.com wrote:
 On 2013-11-19 09:33:34 -0500, Andrew Dunstan wrote:

 On 11/19/2013 09:20 AM, Andres Freund wrote:
 Imo this warrants and expedited point release :(

+1

 I presume anyone who is vulnerable to it would need to recreate
 their secondary servers to get rid of potential problems?

 Yes. There's less expensive ways to do it, but those seem to
 complicated to suggest.

Wouldn't a database VACUUM FREEZE fix it, with WAL-logged writing
of everything that doesn't yet have hint bits set?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Andres Freund

On 2013-11-20 05:30:39 -0800, Kevin Grittner wrote:
  Yes. There's less expensive ways to do it, but those seem to
  complicated to suggest.
 
 Wouldn't a database VACUUM FREEZE fix it, with WAL-logged writing
 of everything that doesn't yet have hint bits set?

Besides also being pretty expensive it still wouldn't correct the clog -
and we don't always rely on hint bits.

I was thinking about just copying over the clog from the primary, but
it's not trivial if the standby isn't cought up, since the primary's
clog could have been truncated ahead of what the standby needs.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Kevin Grittner

Andres Freund and...@2ndquadrant.com wrote:
 On 2013-11-20 05:30:39 -0800, Kevin Grittner wrote:

 Wouldn't a database VACUUM FREEZE fix it, with WAL-logged
 writing of everything that doesn't yet have hint bits set?

 Besides also being pretty expensive it still wouldn't correct the
 clog - and we don't always rely on hint bits.

I'm talking about after a fix is deployed, fixing up the possible
corruption.  Can you explain where VACUUM FREEZE would not suffice?
I don't know of anywhere that we have hint bits set for a tuple and
we go fetch the clog bits in spite of that.  I don't understand
where that would make sense; especially since I thought that a
database FREEZE followed by a checkpoint releases old clog space
anyway.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Andres Freund

On 2013-11-20 05:59:58 -0800, Kevin Grittner wrote:
 Andres Freund and...@2ndquadrant.com wrote:
  On 2013-11-20 05:30:39 -0800, Kevin Grittner wrote:
 
  Wouldn't a database VACUUM FREEZE fix it, with WAL-logged
  writing of everything that doesn't yet have hint bits set?
 
  Besides also being pretty expensive it still wouldn't correct the
  clog - and we don't always rely on hint bits.
 
 I'm talking about after a fix is deployed, fixing up the possible
 corruption.  Can you explain where VACUUM FREEZE would not suffice?
 I don't know of anywhere that we have hint bits set for a tuple and
 we go fetch the clog bits in spite of that.

There's several places. Grep for TransactionIdDidCommit() and ignore the
bits in tqual.c. Many of the remaining ones do not look at hint bits.

 I don't understand
 where that would make sense; especially since I thought that a
 database FREEZE followed by a checkpoint releases old clog space
 anyway.

It only releases them up to the (cluster wide) xmin horizon. So if there
are older snapshots or prepared xacts around...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Kevin Grittner

Andres Freund and...@2ndquadrant.com wrote:
 On 2013-11-20 05:59:58 -0800, Kevin Grittner wrote:

 I don't understand where that would make sense; especially since
 I thought that a database FREEZE followed by a checkpoint
 releases old clog space anyway.

 It only releases them up to the (cluster wide) xmin horizon. So
 if there are older snapshots or prepared xacts around...

So as long as there are no open transactions or prepared
transactions on the master which started before the release with
the fix is applied, VACUUM FREEZE would be guaranteed to work? 
Since I don't see how a non-prepared transaction would be running
from before a minor release upgrade, that just means we have to
make sure there are no prepared transactions from before the
upgrade?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Andres Freund

On 2013-11-20 06:21:13 -0800, Kevin Grittner wrote:
 Andres Freund and...@2ndquadrant.com wrote:
  On 2013-11-20 05:59:58 -0800, Kevin Grittner wrote:
 
  I don't understand where that would make sense; especially since
  I thought that a database FREEZE followed by a checkpoint
  releases old clog space anyway.
 
  It only releases them up to the (cluster wide) xmin horizon. So
  if there are older snapshots or prepared xacts around...
 
 So as long as there are no open transactions or prepared
 transactions on the master which started before the release with
 the fix is applied, VACUUM FREEZE would be guaranteed to work? 
 Since I don't see how a non-prepared transaction would be running
 from before a minor release upgrade, that just means we have to
 make sure there are no prepared transactions from before the
 upgrade?

That's not a bad point. So the way to fix it would be:

1) Restart the standby to the new minor release, wait for catchup
2) Restart the primary (fast or smart) to the new minor release
3) Acquire enough new xids to make sure we cross a clog page (?)
4) Jot down a new xid:  SELECT txid_current()::bigint % (1::bigint33-1)
5) vacuumdb -z -a
6) Ensure that there are no prepared xacts older than 3) around
SELECT *
FROM pg_prepared_xacts
ORDER BY age(transaction) DESC LIMIT 1;
7) Ensure the xmin horizon is above the one from: 3:
SELECT datname, datfrozenxid
FROM pg_database
WHERE datname != 'template0'
ORDER BY age(datfrozenxid) DESC LIMIT 1;
8) Get the current lsn: SELECT pg_current_xlog_location();
9) verify on each standby that SELECT pg_last_xlog_receive_location() is
   past 7)
10) be happy

I am not sure how we can easily compute that 6) and 7) are past 3) in
the presence of xid wraparounds.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Kevin Grittner

Andres Freund and...@2ndquadrant.com wrote:
 On 2013-11-20 06:21:13 -0800, Kevin Grittner wrote:


  So as long as there are no open transactions or prepared
  transactions on the master which started before the release with
  the fix is applied, VACUUM FREEZE would be guaranteed to work? 
  Since I don't see how a non-prepared transaction would be running
  from before a minor release upgrade, that just means we have to
  make sure there are no prepared transactions from before the
  upgrade?
 
 That's not a bad point. So the way to fix it would be:
 
 1) Restart the standby to the new minor release, wait for catchup
 2) Restart the primary (fast or smart) to the new minor release
 3) Acquire enough new xids to make sure we cross a clog page (?)
 4) Jot down a new xid:  SELECT txid_current()::bigint % (1::bigint33-1)
 5) vacuumdb -z -a
 6) Ensure that there are no prepared xacts older than 3) around
 SELECT *
 FROM pg_prepared_xacts
 ORDER BY age(transaction) DESC LIMIT 1;
 7) Ensure the xmin horizon is above the one from: 3:
 SELECT datname, datfrozenxid
 FROM pg_database
 WHERE datname != 'template0'
 ORDER BY age(datfrozenxid) DESC LIMIT 1;
 8) Get the current lsn: SELECT pg_current_xlog_location();
 9) verify on each standby that SELECT pg_last_xlog_receive_location() is
    past 7)
 10) be happy
 
 I am not sure how we can easily compute that 6) and 7) are past 3) in
 the presence of xid wraparounds.


I may well be missing something here, but wouldn't it be sufficient to?:
1) Restart the standby to the new minor release, wait for catchup
2) Restart the primary (fast or smart) to the new minor release
3) Run VACUUM FREEZE (optionally with ANALYZE) in each database on primary
4) Run CHECKPOINT command on primary, or just wait for one to run
5) Wait for standby to process to the checkpoint
6) Be happy

--
Kevin Grittner 
EDB: http://www.enterprisedb.com 
The Enterprise PostgreSQL Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Andres Freund

On 2013-11-20 07:06:04 -0800, Kevin Grittner wrote:
  That's not a bad point. So the way to fix it would be:
  
  1) Restart the standby to the new minor release, wait for catchup
  2) Restart the primary (fast or smart) to the new minor release
  3) Acquire enough new xids to make sure we cross a clog page (?)
  4) Jot down a new xid:  SELECT txid_current()::bigint % (1::bigint33-1)
  5) vacuumdb -z -a
  6) Ensure that there are no prepared xacts older than 3) around
  SELECT *
  FROM pg_prepared_xacts
  ORDER BY age(transaction) DESC LIMIT 1;
  7) Ensure the xmin horizon is above the one from: 3:
  SELECT datname, datfrozenxid
  FROM pg_database
  WHERE datname != 'template0'
  ORDER BY age(datfrozenxid) DESC LIMIT 1;
  8) Get the current lsn: SELECT pg_current_xlog_location();
  9) verify on each standby that SELECT pg_last_xlog_receive_location() is
     past 7)
  10) be happy
  
  I am not sure how we can easily compute that 6) and 7) are past 3) in
  the presence of xid wraparounds.
 
 
 I may well be missing something here, but wouldn't it be sufficient to?:
 1) Restart the standby to the new minor release, wait for catchup
 2) Restart the primary (fast or smart) to the new minor release
 3) Run VACUUM FREEZE (optionally with ANALYZE) in each database on primary
 4) Run CHECKPOINT command on primary, or just wait for one to run
 5) Wait for standby to process to the checkpoint
 6) Be happy

Well, in some cases it might. But what if there's a prepared xact
around? Or a transaction started directly after 2) preventing
FreezeLimit to go up? Or vacuum_defer_cleanup_age is set? Or there's
another bug like 4c697d8f4845823a8af67788b219ffa4516ad14c?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Heikki Linnakangas


On 20.11.2013 17:06, Kevin Grittner wrote:

Andres Freund and...@2ndquadrant.com wrote:

On 2013-11-20 06:21:13 -0800, Kevin Grittner wrote:




  So as long as there are no open transactions or prepared
  transactions on the master which started before the release with
  the fix is applied, VACUUM FREEZE would be guaranteed to work?
  Since I don't see how a non-prepared transaction would be running
  from before a minor release upgrade, that just means we have to
  make sure there are no prepared transactions from before the
  upgrade?


That's not a bad point. So the way to fix it would be:

1) Restart the standby to the new minor release, wait for catchup
2) Restart the primary (fast or smart) to the new minor release
3) Acquire enough new xids to make sure we cross a clog page (?)
4) Jot down a new xid:  SELECT txid_current()::bigint % (1::bigint33-1)
5) vacuumdb -z -a
6) Ensure that there are no prepared xacts older than 3) around
SELECT *
FROM pg_prepared_xacts
ORDER BY age(transaction) DESC LIMIT 1;
7) Ensure the xmin horizon is above the one from: 3:
SELECT datname, datfrozenxid
FROM pg_database
WHERE datname != 'template0'
ORDER BY age(datfrozenxid) DESC LIMIT 1;
8) Get the current lsn: SELECT pg_current_xlog_location();
9) verify on each standby that SELECT pg_last_xlog_receive_location() is
past 7)
10) be happy

I am not sure how we can easily compute that 6) and 7) are past 3) in
the presence of xid wraparounds.



I may well be missing something here, but wouldn't it be sufficient to?:
1) Restart the standby to the new minor release, wait for catchup
2) Restart the primary (fast or smart) to the new minor release
3) Run VACUUM FREEZE (optionally with ANALYZE) in each database on primary
4) Run CHECKPOINT command on primary, or just wait for one to run
5) Wait for standby to process to the checkpoint
6) Be happy


Isn't it possible that the standby has already incorrectly set 
HEAP_XMIN_INVALID hint bit on a page? The full page images generated by 
VACUUM FREEZE will correct the damage, but if not, e.g. because 
full_page_writes=off, strange things will happen.


Personally, I wouldn't trust anything less than a new base backup.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Andres Freund

On 2013-11-20 18:25:56 +0200, Heikki Linnakangas wrote:
 Isn't it possible that the standby has already incorrectly set
 HEAP_XMIN_INVALID hint bit on a page? The full page images generated by
 VACUUM FREEZE will correct the damage, but if not, e.g. because
 full_page_writes=off, strange things will happen.

The xlog_heap_freeze records should repair that afaics.

 Personally, I wouldn't trust anything less than a new base backup.

But I still tend to agree.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Josh Berkus

Andrews, Kevin:

Presumably a replica created while all traffic was halted on the master
would be clean, correct?  This bug can only be triggered if there's
heavy write load on the master, right?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Josh Berkus

On 11/20/2013 10:30 AM, Josh Berkus wrote:
 Andrews, Kevin:

Andres, that is.

 
 Presumably a replica created while all traffic was halted on the master
 would be clean, correct?  This bug can only be triggered if there's
 heavy write load on the master, right?
 

Also, just to verify:

If someone is doing PITR based on a snapshot taken with pg_basebackup,
that will only trip this corruption bug if the user has hot_standby=on
in their config *while restoring*?  Or is it critical if they have
hot_standby=on while backing up?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Andres Freund

On 2013-11-20 10:48:41 -0800, Josh Berkus wrote:
  Presumably a replica created while all traffic was halted on the master
  would be clean, correct?  This bug can only be triggered if there's
  heavy write load on the master, right?

Kinda. It's unfortunately necessary to understand how HS works to some
degree:
Everytime a server is (re-)started with a recovery.conf present and
hot_standby=on (be it streaming, archive based replication or PITR) the
Hot Standby code is used.
(Crash|Replication)-Recovery starts by reading the last checkpoint (from
pg_control or, if present, backup.label) and then replays WAL from the
'redo' point included in the checkpoint. The bug then occurs when it
first (or, in some case second time) replays a 'xl_running_xacts'
record. That's used to reconstruct information needed to allow queries.

Everytime the server in HS mode allows connections (consistent recovery state
reached at ... and database system is ready to accept read only
connections in the log), the bug can be triggered. If there weren't too
many transactions at that point, the problem won't occur until the
standby is restarted.

 If someone is doing PITR based on a snapshot taken with pg_basebackup,
 that will only trip this corruption bug if the user has hot_standby=on
 in their config *while restoring*?  Or is it critical if they have
 hot_standby=on while backing up?

hot_standby=on only has an effect while starting up with a recovery.conf
present. So, if you have an old base backup around and all WAL files,
you can start from that.

Does that answer your questsions?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Josh Berkus

Andres,

 Everytime the server in HS mode allows connections (consistent recovery state
 reached at ... and database system is ready to accept read only
 connections in the log), the bug can be triggered. If there weren't too
 many transactions at that point, the problem won't occur until the
 standby is restarted.

Oh, so this doesn't just happen when the base backup is first taken;
*any* time the standby is restarted, it can happen. (!!!)

 If someone is doing PITR based on a snapshot taken with pg_basebackup,
 that will only trip this corruption bug if the user has hot_standby=on
 in their config *while restoring*?  Or is it critical if they have
 hot_standby=on while backing up?
 
 hot_standby=on only has an effect while starting up with a recovery.conf
 present. So, if you have an old base backup around and all WAL files,
 you can start from that.
 
 Does that answer your questsions?

Yeah, thanks.

If you have any ideas for how we'd write code to scan for this kind of
corruption, please post them.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Andres Freund

On 2013-11-20 15:52:22 -0800, Josh Berkus wrote:
 Andres,
 
  Everytime the server in HS mode allows connections (consistent recovery 
  state
  reached at ... and database system is ready to accept read only
  connections in the log), the bug can be triggered. If there weren't too
  many transactions at that point, the problem won't occur until the
  standby is restarted.
 
 Oh, so this doesn't just happen when the base backup is first taken;
 *any* time the standby is restarted, it can happen. (!!!)

Yes.

 If you have any ideas for how we'd write code to scan for this kind of
 corruption, please post them.

I don't really have one. Current corruption would be somewhat easy to
detect (walk through the clog, check if all commit bits match), but that
doesn't detect wether already truncated clog was corrupted.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Christophe Pettus


On Nov 20, 2013, at 3:57 PM, Andres Freund and...@2ndquadrant.com wrote:

 On 2013-11-20 15:52:22 -0800, Josh Berkus wrote:
 Oh, so this doesn't just happen when the base backup is first taken;
 *any* time the standby is restarted, it can happen. (!!!)
 
 Yes.

So, to be completely clear, any secondary running the affected versions which 
is started with hot_standby=on could potentially be corrupted even if it never 
connects to a primary?

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Andres Freund

On 2013-11-20 16:36:46 -0800, Christophe Pettus wrote:
 
 On Nov 20, 2013, at 3:57 PM, Andres Freund and...@2ndquadrant.com wrote:
 
  On 2013-11-20 15:52:22 -0800, Josh Berkus wrote:
  Oh, so this doesn't just happen when the base backup is first taken;
  *any* time the standby is restarted, it can happen. (!!!)
  
  Yes.
 
 So, to be completely clear, any secondary running the affected
 versions which is started with hot_standby=on could potentially be
 corrupted even if it never connects to a primary?

Yes.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

Hi,

On 2013-11-18 23:15:59 +0100, Andres Freund wrote:
 Afaics it's likely a combination/interaction of bugs and fixes between:
 * the initial HS code
 * 5a031a5556ff83b8a9646892715d7fef415b83c3
 * f44eedc3f0f347a856eea8590730769125964597

Yes, the combination of those is guilty.

Man, this is (to a good part my) bad.

 But that'd mean nobody noticed it during 9.3's beta...

It's fairly hard to reproduce artificially since a) there have to be
enough transactions starting and committing from the start of the
checkpoint the standby is starting from to the point it does
LogStandbySnapshot() to cross a 32768 boundary b) hint bits often save
the game by not accessing clog at all anymore and thus not noticing the
corruption.
I've reproduced the issue by having an INSERT ONLY table that's never
read from. It's helpful to disable autovacuum.

Imo something the attached patch should be done. The description I came
up with is:

Fix Hot-Standby initialization of clog and subtrans.

These bugs can cause data loss on standbys started with hot_standby=on
at the moment they start to accept read only queries by marking
committed transactions as uncommited. The likelihood of such
corruptions is small unless the primary has a high transaction rate.

5a031a5556ff83b8a9646892715d7fef415b83c3 fixed bugs in HS's startup
logic by maintaining less state until at least
STANDBY_SNAPSHOT_PENDING state was reached, missing the fact that both
clog and subtrans are written to before that. This only failed to fail
in common cases because the usage of ExtendCLOG in procarray.c was
superflous since clog extensions are actually WAL logged.

f44eedc3f0f347a856eea8590730769125964597/I then tried to fix the
missing extensions of pg_subtrans due to the former commit's changes -
which are not WAL logged - by performing the extensions when switching
to a state  STANDBY_INITIALIZED and not performing xid assignments
before that - again missing the fact that ExtendCLOG is unneccessary -
but screwed up twice: Once because latestObservedXid wasn't updated
anymore in that state due to the earlier commit and once by having an
off-by-one error in the loop performing extensions.
This means that whenever a CLOG_XACTS_PER_PAGE (32768 with default
settings) boundary was crossed between the start of the checkpoint
recovery started from and the first xl_running_xact record old
transactions commit bits in pg_clog could be overwritten if they
started and committed in that window.

Fix this mess by not performing ExtendCLOG() in HS at all anymore
since it's unneeded and evidently dangerous and by performing subtrans
extensions even before reaching STANDBY_SNAPSHOT_PENDING.

Imo this warrants and expedited point release :(

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

On 2013-11-19 15:20:01 +0100, Andres Freund wrote:
 Imo something the attached patch should be done. The description I came
 up with is:
 
 Fix Hot-Standby initialization of clog and subtrans.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From db5fc631102e2bf3be1034981bedb0b99871a2bb Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Tue, 19 Nov 2013 13:12:31 +0100
Subject: [PATCH] Fix Hot-Standby initialization of clog and subtrans.

These bugs can cause data loss on standbys started with hot_standby=on
at the moment they start to accept read only queries by marking
committed transactions as uncommited. The likelihood of such
corruptions is small unless the primary has a high transaction rate.

5a031a5556ff83b8a9646892715d7fef415b83c3 fixed bugs in HS's startup
logic by maintaining less state until at least
STANDBY_SNAPSHOT_PENDING state was reached, missing the fact that both
clog and subtrans are written to before that. This only failed to fail
in common cases because the usage of ExtendCLOG in procarray.c was
superflous since clog extensions are actually WAL logged.

f44eedc3f0f347a856eea8590730769125964597/I then tried to fix the
missing extensions of pg_subtrans due to the former commit's changes -
which are not WAL logged - by performing the extensions when switching
to a state  STANDBY_INITIALIZED and not performing xid assignments
before that - again missing the fact that ExtendCLOG is unneccessary -
but screwed up twice: Once because latestObservedXid wasn't updated
anymore in that state due to the earlier commit and once by having an
off-by-one error in the loop performing extensions.
This means that whenever a CLOG_XACTS_PER_PAGE (32768 with default
settings) boundary was crossed between the start of the checkpoint
recovery started from and the first xl_running_xact record old
transactions commit bits in pg_clog could be overwritten if they
started and committed in that window.

Fix this mess by not performing ExtendCLOG() in HS at all anymore
since it's unneeded and evidently dangerous and by performing subtrans
extensions even before reaching STANDBY_SNAPSHOT_PENDING.
---
 src/backend/access/transam/clog.c   |  2 +-
 src/backend/storage/ipc/procarray.c | 64 ++---
 2 files changed, 39 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index cb95aa3..6a963b6 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -622,7 +622,7 @@ ExtendCLOG(TransactionId newestXact)
 	LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
 
 	/* Zero the page and make an XLOG entry about it */
-	ZeroCLOGPage(pageno, !InRecovery);
+	ZeroCLOGPage(pageno, true);
 
 	LWLockRelease(CLogControlLock);
 }
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index c2f86ff..4d0bfb8 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -484,8 +484,9 @@ ProcArrayInitRecovery(TransactionId initializedUptoXID)
 
 	/*
 	 * we set latestObservedXid to the xid SUBTRANS has been initialized upto
-	 * so we can extend it from that point onwards when we reach a consistent
-	 * state in ProcArrayApplyRecoveryInfo().
+	 * so we can extend it from that point onwards in
+	 * RecordKnownAssignedTransactionIds and when we get consistent in
+	 * ProcArrayApplyRecoveryInfo().
 	 */
 	latestObservedXid = initializedUptoXID;
 	TransactionIdRetreat(latestObservedXid);
@@ -661,17 +662,23 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	pfree(xids);
 
 	/*
-	 * latestObservedXid is set to the the point where SUBTRANS was started up
-	 * to, initialize subtrans from thereon, up to nextXid - 1.
+	 * latestObservedXid is at least set to the the point where SUBTRANS was
+	 * started up to (c.f. ProcArrayInitRecovery()) or to the biggest xid
+	 * RecordKnownAssignedTransactionIds() was called for.  Initialize
+	 * subtrans from thereon, up to nextXid - 1.
+	 *
+	 * We need to duplicate parts of RecordKnownAssignedTransactionId() here,
+	 * because we've just added xids to the known assigned xids machinery that
+	 * haven't gone through RecordKnownAssignedTransactionId().
 	 */
 	Assert(TransactionIdIsNormal(latestObservedXid));
+	TransactionIdAdvance(latestObservedXid);
 	while (TransactionIdPrecedes(latestObservedXid, running-nextXid))
 	{
-		ExtendCLOG(latestObservedXid);
 		ExtendSUBTRANS(latestObservedXid);
-
 		TransactionIdAdvance(latestObservedXid);
 	}
+	TransactionIdRetreat(latestObservedXid);  /* = running-nextXid - 1 */
 
 	/* --
 	 * Now we've got the running xids we need to set the global values that
@@ -756,10 +763,6 @@ ProcArrayApplyXidAssignment(TransactionId topxid,
 
 	Assert(standbyState = STANDBY_INITIALIZED);
 
-	/* can't do anything useful unless we have more state setup */
-	if

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andrew Dunstan



On 11/19/2013 09:20 AM, Andres Freund wrote:

Imo this warrants and expedited point release :(



I presume anyone who is vulnerable to it would need to recreate their 
secondary servers to get rid of potential problems?


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

On 2013-11-19 09:33:34 -0500, Andrew Dunstan wrote:
 
 On 11/19/2013 09:20 AM, Andres Freund wrote:
 Imo this warrants and expedited point release :(
 
 
 I presume anyone who is vulnerable to it would need to recreate their
 secondary servers to get rid of potential problems?

Yes. There's less expensive ways to do it, but those seem to complicated
to suggest.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Christophe Pettus


On Nov 19, 2013, at 6:59 AM, Andres Freund and...@2ndquadrant.com wrote:

 Yes. There's less expensive ways to do it, but those seem to complicated
 to suggest.

If this is something that could be built into to a tool, acknowledging the 
complexity, I'd be happy to see about building it.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Josh Berkus

On 11/19/2013 06:20 AM, Andres Freund wrote:
 Hi,
 
 On 2013-11-18 23:15:59 +0100, Andres Freund wrote:
 Afaics it's likely a combination/interaction of bugs and fixes between:
 * the initial HS code
 * 5a031a5556ff83b8a9646892715d7fef415b83c3
 * f44eedc3f0f347a856eea8590730769125964597
 
 Yes, the combination of those is guilty.
 
 Man, this is (to a good part my) bad.
 
 But that'd mean nobody noticed it during 9.3's beta...

Ah, so this affected 9.3.0 as well?

Maybe it's worth it now to devise some automated replication testing?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

On 2013-11-19 09:51:28 -0800, Josh Berkus wrote:
 On 11/19/2013 06:20 AM, Andres Freund wrote:
  Hi,
  
  On 2013-11-18 23:15:59 +0100, Andres Freund wrote:
  Afaics it's likely a combination/interaction of bugs and fixes between:
  * the initial HS code
  * 5a031a5556ff83b8a9646892715d7fef415b83c3
  * f44eedc3f0f347a856eea8590730769125964597
  
  Yes, the combination of those is guilty.
  
  Man, this is (to a good part my) bad.
  
  But that'd mean nobody noticed it during 9.3's beta...
 
 Ah, so this affected 9.3.0 as well?

Yes.

 Maybe it's worth it now to devise some automated replication testing?

It'd be a good idea, but I am not sure where to get resources for it
from.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Josh Berkus

On 11/19/2013 09:58 AM, Andres Freund wrote:
 On 2013-11-19 09:51:28 -0800, Josh Berkus wrote:
 Maybe it's worth it now to devise some automated replication testing?
 
 It'd be a good idea, but I am not sure where to get resources for it
 from.

Well, servers isn't hard ... various cloud companies will donate them.

What would be hard is writing the tests.  I know how they should work,
and the tools to use, but there's about 1,000 to 2,000 lines of code
between here and there.

(This is why we should have encouraged the folks from SFDC and Huawei
who wanted to do testing, instead of blowing them off.)

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

On 2013-11-19 10:16:26 -0800, Josh Berkus wrote:
 On 11/19/2013 09:58 AM, Andres Freund wrote:
  On 2013-11-19 09:51:28 -0800, Josh Berkus wrote:
  Maybe it's worth it now to devise some automated replication testing?
  
  It'd be a good idea, but I am not sure where to get resources for it
  from.
 
 Well, servers isn't hard ... various cloud companies will donate them.

Yea, I was talking about human/time/financial ones.

 What would be hard is writing the tests.  I know how they should work,
 and the tools to use, but there's about 1,000 to 2,000 lines of code
 between here and there.

I am not sure how much code it's going to take (I'd hope somewhat less),
but it certainly will take some time to agree how it should be built and
then building and integrating it.

 (This is why we should have encouraged the folks from SFDC and Huawei
 who wanted to do testing, instead of blowing them off.)

Haven't noticed/followed anything around that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Christophe Pettus


On Nov 19, 2013, at 10:25 AM, Andres Freund and...@2ndquadrant.com wrote:
 I am not sure how much code it's going to take (I'd hope somewhat less),
 but it certainly will take some time to agree how it should be built and
 then building and integrating it.

Given that the situation we're in right now is that we have an unknown number 
of silently corrupt secondaries out there which will only be discovered when 
someone promotes them to being a primary (possibly because the current primary 
died without a backup), I'd say that this is something pretty urgent.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

On 2013-11-19 10:27:30 -0800, Christophe Pettus wrote:
 
 On Nov 19, 2013, at 10:25 AM, Andres Freund and...@2ndquadrant.com wrote:
  I am not sure how much code it's going to take (I'd hope somewhat less),
  but it certainly will take some time to agree how it should be built and
  then building and integrating it.
 
 Given that the situation we're in right now is that we have an unknown
 number of silently corrupt secondaries out there which will only be
 discovered when someone promotes them to being a primary (possibly
 because the current primary died without a backup), I'd say that this
 is something pretty urgent.

It's pretty unlikely that any automated testing would have cought this,
the required conditions are too unlikely for that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Christophe Pettus


On Nov 19, 2013, at 10:29 AM, Andres Freund and...@2ndquadrant.com wrote:

 It's pretty unlikely that any automated testing would have cought this,
 the required conditions are too unlikely for that.

I would expect that promote secondary while primary is under heavy load is 
clear-cut test case.  What concerns me more is that we don't seem to have a 
framework to put in a regression test on the bug you just found (and thank you 
for finding it so quickly!).


--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Josh Berkus

On 11/19/2013 10:29 AM, Andres Freund wrote:
 It's pretty unlikely that any automated testing would have cought this,
 the required conditions are too unlikely for that.

Given our almost total lack of automated testing for replication, how
would you (or anyone else) possibly know that?  We certainly do know
that we won't catch these kinds of bugs *without* testing.

Heck, even MySQL AB had automated testing for replication.  They largely
ignored the results, but they had it.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

On 2013-11-19 10:32:10 -0800, Christophe Pettus wrote:
 
 On Nov 19, 2013, at 10:29 AM, Andres Freund and...@2ndquadrant.com wrote:
 
  It's pretty unlikely that any automated testing would have cought this,
  the required conditions are too unlikely for that.
 
 I would expect that promote secondary while primary is under heavy
 load is clear-cut test case.

That's not sufficient though. It's e.g. very hard to reproduce the issue
using the standard pgbench workload (not enough xids generated, too many
hint bits).

Note that the bug isn't caused by promotion, the problem occurs during
the initial startup of a Hot-Standby standby. If the bug wasn't hit
there, it won't be a problem at promotion.

 What concerns me more is that we don't seem to have a framework to put
 in a regression test on the bug you just found (and thank you for
 finding it so quickly!).

Agreed. But regarding it as a bad situation isn't fixing it
unfortunately.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

On 2013-11-19 10:34:09 -0800, Josh Berkus wrote:
 On 11/19/2013 10:29 AM, Andres Freund wrote:
  It's pretty unlikely that any automated testing would have cought this,
  the required conditions are too unlikely for that.
 
 Given our almost total lack of automated testing for replication, how
 would you (or anyone else) possibly know that?

I've tried reproducing it, and it's not easy.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Josh Berkus

On 11/19/2013 10:40 AM, Andres Freund wrote:
 On 2013-11-19 10:32:10 -0800, Christophe Pettus wrote:
 What concerns me more is that we don't seem to have a framework to put
 in a regression test on the bug you just found (and thank you for
 finding it so quickly!).
 
 Agreed. But regarding it as a bad situation isn't fixing it
 unfortunately.

Well, I happen to have some pieces of such a framework: the parts which
can automate spinning up arbitrarily complex groups of replicas and
doing failover between them.  What we'd still need is:

a) a slightly better workload than pgbench
b) a way to compare and test databases for data corruption of several kinds

Can someone else kick in to help with this?

I think this last issue shows that it's critical as a community to have
such a testing framework in place, otherwise we really need to halt all
work on replication until we have such a thing.  I can't see how you
expect to complete streaming logical replication without a replication
testing framework.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

On 2013-11-19 10:43:14 -0800, Josh Berkus wrote:
 I think this last issue shows that it's critical as a community to have
 such a testing framework in place, otherwise we really need to halt all
 work on replication until we have such a thing.

 I can't see how you expect to complete streaming logical replication
 without a replication testing framework.

That's actually easier to test since you can relatively easily integrate
it into pg_regress and isolationtester - thus the patchset actually
contains tests.
You seem to imply that I/we should do that work? That's a bit onesided,
isn't it?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Josh Berkus

On 11/19/2013 10:51 AM, Andres Freund wrote:
 That's actually easier to test since you can relatively easily integrate
 it into pg_regress and isolationtester - thus the patchset actually
 contains tests.
 You seem to imply that I/we should do that work? That's a bit onesided,
 isn't it?

Nope, just that you should be worried about it, and maybe shouting for help.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Christophe Pettus


On Nov 19, 2013, at 10:51 AM, Andres Freund and...@2ndquadrant.com wrote:

 You seem to imply that I/we should do that work?

No, just that it be done.  Of course, the more support from the professional PG 
community that is given to it, the better.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Josh Berkus

On 11/19/2013 06:20 AM, Andres Freund wrote:
 Imo this warrants and expedited point release :(

Yes, I agree.  I'd go further and say it's worth releasing source as
soon as we have it on this one.

What amount of testing were you able to give your patch?  We're trying
to arrange some testing on our side.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Christophe Pettus

Hi, Andres,

From my understanding, the problem only occurs over streaming replication; if 
the secondary was never a hot standby, and only used the archived WAL 
segments, that would be safe.  Is that correct?
--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

Hi,

On 2013-11-19 15:53:36 -0800, Christophe Pettus wrote:
 From my understanding, the problem only occurs over streaming
 replication; if the secondary was never a hot standby, and only used
 the archived WAL segments, that would be safe.  Is that correct?

Not entirely.

It's related to a standby running with hot_standby=on. Both archive
based and streaming replication can be used with hot_standby=on or off.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

On 2013-11-19 15:41:58 -0800, Josh Berkus wrote:
 On 11/19/2013 06:20 AM, Andres Freund wrote:
  Imo this warrants and expedited point release :(
 
 Yes, I agree.  I'd go further and say it's worth releasing source as
 soon as we have it on this one.
 
 What amount of testing were you able to give your patch?  We're trying
 to arrange some testing on our side.

So far I've only tested whether it fixes the corruption I could
reproduce with some effort and that pgbench tables are consistent on the
primary and the standby.
This certainly warrants another look from somebody that knows that
code. Which afair is only Simon and Heikki? I'll also have another look
after sleeping on it.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Jeff Frost


On Nov 19, 2013, at 4:05 PM, Andres Freund and...@2ndquadrant.com wrote:

 Hi,
 
 On 2013-11-19 15:53:36 -0800, Christophe Pettus wrote:
 From my understanding, the problem only occurs over streaming
 replication; if the secondary was never a hot standby, and only used
 the archived WAL segments, that would be safe.  Is that correct?
 
 Not entirely.
 
 It's related to a standby running with hot_standby=on. Both archive
 based and streaming replication can be used with hot_standby=on or off.
 

So, does that mean that restoring from PITR based backups from tools like 
barman and wal-e could exhibit the same issue if hot_standby=on was in the 
postgresql.conf?



signature.asc
Description: Message signed with OpenPGP using GPGMail

[HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus

Three times in the last two weeks, we have experience data corruption secondary 
servers using streaming replication on client systems.  The versions involved 
are 9.0.14, 9.2.5, and 9.3.1.  Each incident was separate; two cases they were 
for the same client (9.0.14 and 9.3.1), one for a different client (9.2.5).

The details of each incident are similar, but not identical.

The details of each incident are:

INCDIDENT #1: 9.0.14 -- A new secondary (S1) was initialized using rsync off of 
an existing, correct primary (P1) for the base backup, and using WAL-E for WAL 
segment shipping.  Both the primary and secondary were running 9.0.14.  S1 
properly connected to the primary once the it was caught up on WAL segments, 
and S1 was then promoted as a primary using the trigger file.

No errors in the log files on either system.

After promotion, it was discovered that there was significant data loss on S1.  
Rows that were present on P1 were missing on S1, and some rows were duplicated 
(including duplicates that violated primary key and other unique constraints).  
The indexes were corrupt, in that they seemed to think that the duplicates were 
not duplicated, and that the missing rows were still present.

Because the client's schema included a last_updated field, we were able to 
determine that all of the rows that were either missing or duplicated had been 
updated on P1 shortly (3-5 minutes) before S1 was promoted.  It's possible, but 
not confirmed, that there were active queries (including updates) running on P1 
at the moment of S1's promotion.

As a note, P1 was created from another system (let's call it P0) using just WAL 
shipping (no streaming replication), and no data corruption was observed.

P1 and S1 were both AWS instances running Ubuntu 12.04, using EBS (with xfs as 
the file system) as the data volume.

P1 and S1 have been destroyed at this point.


INCIDENT #2: 9.3.1 -- In order to repair the database, a pg_dump was taken of 
S1y, after having dropped the primary and unique constraints, and restored into 
a new 9.3.1 server, P2.  Duplicate rows were purged, and missing rows were 
added again.  The database, a new primary, was then put back into production, 
and ran without incident.

A new secondary, S2 was created off of the primary.  This secondary was created 
using pg_basebackup using --xlog-method=stream, although the WAL-E archiving 
was still present.

S2 attached to P2 without incident and no errors in the logs, but 
nearly-identical corruption was discovered (although this time without the 
duplicated rows, just missing rows).  At this point, it's not clear if there 
was some clustering in the last_updated timestamp for the rows that are 
missing from S2.  No duplicated rows were observed.

P2 and S2 are both AWS instances running Ubuntu 12.04, using EBS (with xfs as 
the file system) as the data volume.

No errors in the log files on either system.

P2 and S2 are still operational.


INCIDENT #3: 9.2.5 -- A client was migrating a large database from a 9.2.2 
system (P3) to a new 9.2.5 system (S3) using streaming replication.  As I 
personally didn't do the steps on this one, I don't have quite as much 
information, but the basics are close to incident #2: When S3 was promoted 
using the trigger file, no errors were observed and the database came up 
normally, but rows were missing from S3 that were present on P3.

P1 is running Centos 6.3 with ext4 as the file system.

P2 is running Centos 6.4 with ext3 as the file system.

Log shipping in this case was done via rsync.

P3 and S3 are still operational.

No errors in the log files on either system.

--

Obviously, we're very concerned that a bug was introduced in the latest minor 
release.  We're happy to gather data as required to assist in diagnosing this.
--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Josh Berkus

On 11/18/2013 10:58 AM, Christophe Pettus wrote:
 Three times in the last two weeks, we have experience data corruption 
 secondary servers using streaming replication on client systems.  The 
 versions involved are 9.0.14, 9.2.5, and 9.3.1.  Each incident was separate; 
 two cases they were for the same client (9.0.14 and 9.3.1), one for a 
 different client (9.2.5).

To emphasize a salient point: we have not previously seen data
corruption with these exact symptoms prior to the recent patch release.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus


On Nov 18, 2013, at 10:58 AM, Christophe Pettus x...@thebuild.com wrote:
 As a note, P1 was created from another system (let's call it P0) using just 
 WAL shipping (no streaming replication), and no data corruption was observed.

As another data point, P0 was running 9.0.13, rather than 9.0.14.
--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Andres Freund

Hi,

On 2013-11-18 10:58:26 -0800, Christophe Pettus wrote:
 INCDIDENT #1: 9.0.14 -- A new secondary (S1) was initialized using
 rsync off of an existing, correct primary (P1) for the base backup,
 and using WAL-E for WAL segment shipping.  Both the primary and
 secondary were running 9.0.14.  S1 properly connected to the primary
 once the it was caught up on WAL segments, and S1 was then promoted as
 a primary using the trigger file.

Could you detail how exactly the base backup was created? Including the
*exact* logic for copying?

 No errors in the log files on either system.

Do you have the log entries for the startup after the base backup?

 Because the client's schema included a last_updated field, we were
 able to determine that all of the rows that were either missing or
 duplicated had been updated on P1 shortly (3-5 minutes) before S1 was
 promoted.  It's possible, but not confirmed, that there were active
 queries (including updates) running on P1 at the moment of S1's
 promotion.

Any chance you have log_checkpoints enabled? If so, could you check
whether there was a checkpoint around the time of the base backup?

This server is gone, right? If not, could you do:
SELECT ctid, xmin, xmax, * FROM whatever WHERE duplicate_row?

 INCIDENT #2: 9.3.1 -- In order to repair the database, a pg_dump was taken of 
 S1y, after having dropped the primary and unique constraints, and restored 
 into a new 9.3.1 server, P2.  Duplicate rows were purged, and missing rows 
 were added again.  The database, a new primary, was then put back into 
 production, and ran without incident.
 
 A new secondary, S2 was created off of the primary.  This secondary was 
 created using pg_basebackup using --xlog-method=stream, although the WAL-E 
 archiving was still present.
 
 S2 attached to P2 without incident and no errors in the logs, but 
 nearly-identical corruption was discovered (although this time without the 
 duplicated rows, just missing rows).  At this point, it's not clear if there 
 was some clustering in the last_updated timestamp for the rows that are 
 missing from S2.  No duplicated rows were observed.
 
 P2 and S2 are both AWS instances running Ubuntu 12.04, using EBS (with xfs as 
 the file system) as the data volume.
 
 No errors in the log files on either system.

Could you list the *exact* steps you did to startup the cluster?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus


On Nov 18, 2013, at 11:28 AM, Andres Freund and...@2ndquadrant.com wrote:
 Could you detail how exactly the base backup was created? Including the
 *exact* logic for copying?

0. Before any of this began, P1 was archiving WAL segments to AWS-S3.
1. pg_start_backup('', true) on P1.
2. Using rsync -av on P1, the entire $PGDATA directory was pushed from P1 to S2.
3. Once the rsync was complete, pg_stop_backup() on P1.
4. Create appropriate recovery.conf on S1.
5. Bring up PostgreSQL on S1.
6. PostgreSQL recovers normally (pulling WAL segments from WAL-E), and 
eventually connects to P1.
 
 Do you have the log entries for the startup after the base backup?

Sadly, not anymore.

 This server is gone, right?

Correct.

 Could you list the *exact* steps you did to startup the cluster?

0. Before any of this began, P2 was archiving WAL segments to AWS-S3.
1. Initial (empty) data directory deleted on S2.
2. New data directory created with:

/usr/lib/postgresql/9.3/bin/pg_basebackup --verbose --progress 
--xlog-method=stream --host=ip --user=repluser --pgdata=/data/9.3/main

3. Once the pg_basebackup completed, create appropriate recovery.conf on S1.
4. Bring up PostgreSQL on S2.
5. PostgreSQL recovers normally (pulling a small number of WAL segments from 
WAL-E), and eventually connects to P2.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Andres Freund

On 2013-11-18 11:38:43 -0800, Christophe Pettus wrote:
 
 On Nov 18, 2013, at 11:28 AM, Andres Freund and...@2ndquadrant.com wrote:
  Could you detail how exactly the base backup was created? Including the
  *exact* logic for copying?
 
 0. Before any of this began, P1 was archiving WAL segments to AWS-S3.
 1. pg_start_backup('', true) on P1.
 2. Using rsync -av on P1, the entire $PGDATA directory was pushed from P1 to 
 S2.

Without deleting any data, including pg_xlog/, backup.label, anything?

Did you have hot_standby enabled on all of those machines? Even on the
9.0.13 cluster?

  Could you list the *exact* steps you did to startup the cluster?
 
 0. Before any of this began, P2 was archiving WAL segments to AWS-S3.
 1. Initial (empty) data directory deleted on S2.
 2. New data directory created with:
 
   /usr/lib/postgresql/9.3/bin/pg_basebackup --verbose --progress 
 --xlog-method=stream --host=ip --user=repluser --pgdata=/data/9.3/main
 
 3. Once the pg_basebackup completed, create appropriate recovery.conf on S1.

That was just recovery command and primary conninfo?


Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus


On Nov 18, 2013, at 11:47 AM, Andres Freund and...@2ndquadrant.com wrote:

 Without deleting any data, including pg_xlog/, backup.label, anything?

Correct.

 Did you have hot_standby enabled on all of those machines? Even on the
 9.0.13 cluster?

Yes.

 That was just recovery command and primary conninfo?

Correct.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus


On Nov 18, 2013, at 11:47 AM, Andres Freund and...@2ndquadrant.com wrote:

 Did you have hot_standby enabled on all of those machines? Even on the
 9.0.13 cluster?

Actually, it's a bit more complex than this:

1. We don't know about P0, the 9.0.13 machine.  I assume it was, but it was 
managed elsewhere.
2. P1 never had hot_standby = 'on', as it was never intended to be a hot 
stand_by.
3. S1 did have hot_standby = 'on.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus


On Nov 18, 2013, at 11:47 AM, Andres Freund and...@2ndquadrant.com wrote:
 Without deleting any data, including pg_xlog/, backup.label, anything?

One more correction: After rsync finished and the pg_base_backup() was issued, 
the contents of pg_xlog/ on S1 were deleted.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus


On Nov 18, 2013, at 12:00 PM, Christophe Pettus x...@thebuild.com wrote:

 One more correction: After rsync finished and the pg_base_backup() was 
 issued, the contents of pg_xlog/ on S1 were deleted.

pg_stop_backup(), sorry.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Andres Freund

On 2013-11-18 10:58:26 -0800, Christophe Pettus wrote:
 After promotion, it was discovered that there was significant data
 loss on S1.  Rows that were present on P1 were missing on S1, and some
 rows were duplicated (including duplicates that violated primary key
 and other unique constraints).  The indexes were corrupt, in that they
 seemed to think that the duplicates were not duplicated, and that the
 missing rows were still present.

Were there any kind of patterns in the lost data? What kind of workload
are they running? I have an idea what the issue might be...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus


On Nov 18, 2013, at 12:57 PM, Andres Freund and...@2ndquadrant.com wrote:

 Were there any kind of patterns in the lost data? What kind of workload
 are they running? I have an idea what the issue might be...

On the P1  S1 case, the data corrupted was data modified in the last few 
minutes before the switchover.  I don't want to over-analyze, but it was within 
the checkpoint_timeout value for that sever.

On the P2  S2 case, it's less obvious what the pattern is, since there was no 
cutover.

Insufficient information on the P3  S3 case.

Each of them is a reasonably high-volume OLTP-style workload.  The P1/P2 client 
has a very high level of writes; the P3 more read-heavy, but still a fair 
number of writes. 

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Andres Freund

Hi,

Afaics it's likely a combination/interaction of bugs and fixes between:
* the initial HS code
* 5a031a5556ff83b8a9646892715d7fef415b83c3
* f44eedc3f0f347a856eea8590730769125964597

But that'd mean nobody noticed it during 9.3's beta...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus

Great!  If there's any further data I can supply to help, let me know.

On Nov 18, 2013, at 2:15 PM, Andres Freund and...@2ndquadrant.com wrote:

 Hi,
 
 Afaics it's likely a combination/interaction of bugs and fixes between:
 * the initial HS code
 * 5a031a5556ff83b8a9646892715d7fef415b83c3
 * f44eedc3f0f347a856eea8590730769125964597
 
 But that'd mean nobody noticed it during 9.3's beta...
 
 Greetings,
 
 Andres Freund
 
 -- 
 Andres Freundhttp://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
 
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Andres Freund

On 2013-11-18 14:25:58 -0800, Christophe Pettus wrote:
 Great!  If there's any further data I can supply to help, let me know.

Trying to reproduce the issue with and without hot_standby=on would be
very helpful, but I guess that's time consuming?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus


On Nov 18, 2013, at 2:26 PM, Andres Freund and...@2ndquadrant.com wrote:

 Trying to reproduce the issue with and without hot_standby=on would be
 very helpful, but I guess that's time consuming?

I've been working on it, but I haven't gotten it to fail yet.  I'll keep at it.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

78 matches

Mail list logo