subject:"\[HACKERS\] Data corruption issues using streaming replication on 9.0.14\/9.2.5\/9.3.1"

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-22 Thread Andres Freund

On 2013-11-22 15:01:10 +0200, Heikki Linnakangas wrote: > On 21.11.2013 22:55, Andres Freund wrote: > >On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote: > >>Looks ok for a back-patchable fix. > > > >Do you plan to commit this? Or who is going to? > > Ok, committed. Thanks! Greetings, Andr

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-22 Thread Heikki Linnakangas

On 21.11.2013 22:55, Andres Freund wrote: On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote: On 19.11.2013 16:22, Andres Freund wrote: On 2013-11-19 15:20:01 +0100, Andres Freund wrote: Imo something the attached patch should be done. The description I came g> >>up with is: Fix Ho

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-22 Thread Heikki Linnakangas

On 19.11.2013 16:20, Andres Freund wrote: On 2013-11-18 23:15:59 +0100, Andres Freund wrote: Afaics it's likely a combination/interaction of bugs and fixes between: * the initial HS code * 5a031a5556ff83b8a9646892715d7fef415b83c3 * f44eedc3f0f347a856eea8590730769125964597 Yes, the combination

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Andres Freund

On 2013-11-21 23:02:29 +0200, Heikki Linnakangas wrote: > On 21.11.2013 22:53, Andres Freund wrote: > >On 2013-11-21 12:51:17 -0800, Josh Berkus wrote: > >>On 11/21/2013 12:46 PM, Andres Freund wrote: > >>>The problem is starting with hot_standby=on on a system with > >>>recovery.conf present. It i

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Heikki Linnakangas

On 21.11.2013 22:55, Andres Freund wrote: Hi, On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote: On 19.11.2013 16:22, Andres Freund wrote: On 2013-11-19 15:20:01 +0100, Andres Freund wrote: Imo something the attached patch should be done. The description I came g> >>up with is: F

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Heikki Linnakangas

On 21.11.2013 22:53, Andres Freund wrote: On 2013-11-21 12:51:17 -0800, Josh Berkus wrote: On 11/21/2013 12:46 PM, Andres Freund wrote: The problem is starting with hot_standby=on on a system with recovery.conf present. It is independent of whether you use streaming replication, archive based r

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Andres Freund

Hi, On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote: > On 19.11.2013 16:22, Andres Freund wrote: > >On 2013-11-19 15:20:01 +0100, Andres Freund wrote: > >>Imo something the attached patch should be done. The description I came g> >>up with is: > >> > >> Fix Hot-Standby initialization of

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Andres Freund

On 2013-11-21 12:51:17 -0800, Josh Berkus wrote: > On 11/21/2013 12:46 PM, Andres Freund wrote: > > No. Check > > http://archives.postgresql.org/message-id/20131120234141.GI18801%40awork2.anarazel.de > > > > The problem is starting with hot_standby=on on a system with > > recovery.conf present. It

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Josh Berkus

On 11/21/2013 12:46 PM, Andres Freund wrote: > No. Check > http://archives.postgresql.org/message-id/20131120234141.GI18801%40awork2.anarazel.de > > The problem is starting with hot_standby=on on a system with > recovery.conf present. It is independent of whether you use streaming > replication, a

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Andres Freund

On 2013-11-21 12:39:54 -0800, Josh Berkus wrote: > On 11/21/2013 12:36 PM, Joshua D. Drake wrote: > > > > Hello, > > > > This is turning into a rather large thread and I have a simple question: > > > > Is a work-around to this problem as simple as disabling streaming > > replication and enabling

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Joshua D. Drake

Hello, This is turning into a rather large thread and I have a simple question: Is a work-around to this problem as simple as disabling streaming replication and enabling log shipping instead? Joshua D. Drake -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Josh Berkus

On 11/21/2013 12:36 PM, Joshua D. Drake wrote: > > Hello, > > This is turning into a rather large thread and I have a simple question: > > Is a work-around to this problem as simple as disabling streaming > replication and enabling log shipping instead? Yes, and re-cloning the replica, in case

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread J Smith

On Thu, Nov 21, 2013 at 10:27 AM, Andres Freund wrote: > > I don't think so - for one, pg_subtrans isn't really the problems with > that bug, for another, it won't cause missing files. Also, you're not > using replication, right? > Actually, this server is a master being replicated to a standby.

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Tatsuo Ishii

> Well, I happen to have some pieces of such a framework: the parts which > can automate spinning up arbitrarily complex groups of replicas and > doing failover between them. What we'd still need is: > > a) a slightly better workload than pgbench > b) a way to compare and test databases for data

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Andres Freund

On 2013-11-21 11:54:58 -0500, J Smith wrote: > On Thu, Nov 21, 2013 at 10:27 AM, Andres Freund > wrote: > > > > I don't think so - for one, pg_subtrans isn't really the problems with > > that bug, for another, it won't cause missing files. Also, you're not > > using replication, right? > > > > A

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread J Smith

On Tue, Nov 19, 2013 at 9:22 AM, Andres Freund wrote: > On 2013-11-19 15:20:01 +0100, Andres Freund wrote: >> Imo something the attached patch should be done. The description I came >> up with is: >> >> Fix Hot-Standby initialization of clog and subtrans. > G'day Andres. This wouldn't happen

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Andres Freund

On 2013-11-21 10:25:20 -0500, J Smith wrote: > On Tue, Nov 19, 2013 at 9:22 AM, Andres Freund wrote: > > On 2013-11-19 15:20:01 +0100, Andres Freund wrote: > >> Imo something the attached patch should be done. The description I came > >> up with is: > >> > >> Fix Hot-Standby initialization of

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread J Smith

On Thu, Nov 21, 2013 at 12:23 PM, Andres Freund wrote: > > It's still not this issue in that case, but I might have an idea... Do > you have hot_standby_feedback enabled? > Nope, hot_standby_feedback is set to its default setting which is off. At any rate, I'm going to try and capture a backtrac

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-21 Thread Andres Freund

On 2013-11-20 17:46:22 +0100, Andres Freund wrote: > On 2013-11-20 18:25:56 +0200, Heikki Linnakangas wrote: > > Isn't it possible that the standby has already incorrectly set > > HEAP_XMIN_INVALID hint bit on a page? The full page images generated by > > VACUUM FREEZE will correct the damage, but

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Andres Freund

On 2013-11-20 16:36:46 -0800, Christophe Pettus wrote: > > On Nov 20, 2013, at 3:57 PM, Andres Freund wrote: > > > On 2013-11-20 15:52:22 -0800, Josh Berkus wrote: > >> Oh, so this doesn't just happen when the base backup is first taken; > >> *any* time the standby is restarted, it can happen. (

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Christophe Pettus

On Nov 20, 2013, at 3:57 PM, Andres Freund wrote: > On 2013-11-20 15:52:22 -0800, Josh Berkus wrote: >> Oh, so this doesn't just happen when the base backup is first taken; >> *any* time the standby is restarted, it can happen. (!!!) > > Yes. So, to be completely clear, any secondary running t

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Andres Freund

On 2013-11-20 15:52:22 -0800, Josh Berkus wrote: > Andres, > > > Everytime the server in HS mode allows connections ("consistent recovery > > state > > reached at ..." and "database system is ready to accept read only > > connections" in the log), the bug can be triggered. If there weren't too >

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Josh Berkus

Andres, > Everytime the server in HS mode allows connections ("consistent recovery state > reached at ..." and "database system is ready to accept read only > connections" in the log), the bug can be triggered. If there weren't too > many transactions at that point, the problem won't occur until t

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Andres Freund

On 2013-11-20 10:48:41 -0800, Josh Berkus wrote: > > Presumably a replica created while all traffic was halted on the master > > would be clean, correct? This bug can only be triggered if there's > > heavy write load on the master, right? Kinda. It's unfortunately necessary to understand how HS w

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Josh Berkus

On 11/20/2013 10:30 AM, Josh Berkus wrote: > Andrews, Kevin: Andres, that is. > > Presumably a replica created while all traffic was halted on the master > would be clean, correct? This bug can only be triggered if there's > heavy write load on the master, right? > Also, just to verify: If s

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Josh Berkus

Andrews, Kevin: Presumably a replica created while all traffic was halted on the master would be clean, correct? This bug can only be triggered if there's heavy write load on the master, right? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Andres Freund

On 2013-11-20 18:25:56 +0200, Heikki Linnakangas wrote: > Isn't it possible that the standby has already incorrectly set > HEAP_XMIN_INVALID hint bit on a page? The full page images generated by > VACUUM FREEZE will correct the damage, but if not, e.g. because > full_page_writes=off, strange things

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Heikki Linnakangas

On 20.11.2013 17:06, Kevin Grittner wrote: Andres Freund wrote: On 2013-11-20 06:21:13 -0800, Kevin Grittner wrote: So as long as there are no open transactions or prepared transactions on the master which started before the release with the fix is applied, VACUUM FREEZE would be guar

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Andres Freund

On 2013-11-20 07:06:04 -0800, Kevin Grittner wrote: > > That's not a bad point. So the way to fix it would be: > > > > 1) Restart the standby to the new minor release, wait for catchup > > 2) Restart the primary (fast or smart) to the new minor release > > 3) Acquire enough new xids to make sure w

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Kevin Grittner

Andres Freund wrote: > On 2013-11-20 06:21:13 -0800, Kevin Grittner wrote: >> So as long as there are no open transactions or prepared >> transactions on the master which started before the release with >> the fix is applied, VACUUM FREEZE would be guaranteed to work? >> Since I don't see h

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Andres Freund

On 2013-11-20 06:21:13 -0800, Kevin Grittner wrote: > Andres Freund wrote: > > On 2013-11-20 05:59:58 -0800, Kevin Grittner wrote: > > >> I don't understand where that would make sense; especially since > >> I thought that a database FREEZE followed by a checkpoint > >> releases old clog space an

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Kevin Grittner

Andres Freund wrote: > On 2013-11-20 05:59:58 -0800, Kevin Grittner wrote: >> I don't understand where that would make sense; especially since >> I thought that a database FREEZE followed by a checkpoint >> releases old clog space anyway. > > It only releases them up to the (cluster wide) xmin ho

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Andres Freund

On 2013-11-20 05:59:58 -0800, Kevin Grittner wrote: > Andres Freund wrote: > > On 2013-11-20 05:30:39 -0800, Kevin Grittner wrote: > > >> Wouldn't a database VACUUM FREEZE fix it, with WAL-logged > >> writing of everything that doesn't yet have hint bits set? > > > > Besides also being pretty exp

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Kevin Grittner

Andres Freund wrote: > On 2013-11-20 05:30:39 -0800, Kevin Grittner wrote: >> Wouldn't a database VACUUM FREEZE fix it, with WAL-logged >> writing of everything that doesn't yet have hint bits set? > > Besides also being pretty expensive it still wouldn't correct the > clog - and we don't always

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Andres Freund

On 2013-11-20 05:30:39 -0800, Kevin Grittner wrote: > > Yes. There's less expensive ways to do it, but those seem to > > complicated to suggest. > > Wouldn't a database VACUUM FREEZE fix it, with WAL-logged writing > of everything that doesn't yet have hint bits set? Besides also being pretty exp

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Kevin Grittner

Andres Freund wrote: > On 2013-11-19 09:33:34 -0500, Andrew Dunstan wrote: >> >> On 11/19/2013 09:20 AM, Andres Freund wrote: >> >Imo this warrants and expedited point release :( +1 >> I presume anyone who is vulnerable to it would need to recreate >> their secondary servers to get rid of potent

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Andres Freund

On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote: > On 19.11.2013 16:22, Andres Freund wrote: > >On 2013-11-19 15:20:01 +0100, Andres Freund wrote: > >>Imo something the attached patch should be done. The description I came > >>up with is: > >> > >> Fix Hot-Standby initialization of clog

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-20 Thread Heikki Linnakangas

On 19.11.2013 16:22, Andres Freund wrote: On 2013-11-19 15:20:01 +0100, Andres Freund wrote: Imo something the attached patch should be done. The description I came up with is: Fix Hot-Standby initialization of clog and subtrans. Looks ok for a back-patchable fix. It's a bit bizarre tha

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Jeff Frost

On Nov 19, 2013, at 4:05 PM, Andres Freund wrote: > Hi, > > On 2013-11-19 15:53:36 -0800, Christophe Pettus wrote: >> From my understanding, the problem only occurs over streaming >> replication; if the secondary was never a hot standby, and only used >> the archived WAL segments, that would be

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

On 2013-11-19 15:41:58 -0800, Josh Berkus wrote: > On 11/19/2013 06:20 AM, Andres Freund wrote: > > Imo this warrants and expedited point release :( > > Yes, I agree. I'd go further and say it's worth releasing source as > soon as we have it on this one. > > What amount of testing were you able

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

Hi, On 2013-11-19 15:53:36 -0800, Christophe Pettus wrote: > From my understanding, the problem only occurs over streaming > replication; if the secondary was never a hot standby, and only used > the archived WAL segments, that would be safe. Is that correct? Not entirely. It's related to a sta

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Christophe Pettus

Hi, Andres, >From my understanding, the problem only occurs over streaming replication; if >the secondary was never a hot standby, and only used the archived WAL >segments, that would be safe. Is that correct? -- -- Christophe Pettus x...@thebuild.com -- Sent via pgsql-hackers mailing li

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Josh Berkus

On 11/19/2013 06:20 AM, Andres Freund wrote: > Imo this warrants and expedited point release :( Yes, I agree. I'd go further and say it's worth releasing source as soon as we have it on this one. What amount of testing were you able to give your patch? We're trying to arrange some testing on ou

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Christophe Pettus

On Nov 19, 2013, at 10:51 AM, Andres Freund wrote: > You seem to imply that I/we should do that work? No, just that it be done. Of course, the more support from the professional PG community that is given to it, the better. -- -- Christophe Pettus x...@thebuild.com -- Sent via pgsql-h

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Josh Berkus

On 11/19/2013 10:51 AM, Andres Freund wrote: > That's actually easier to test since you can relatively easily integrate > it into pg_regress and isolationtester - thus the patchset actually > contains tests. > You seem to imply that I/we should do that work? That's a bit onesided, > isn't it? Nope

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

On 2013-11-19 10:43:14 -0800, Josh Berkus wrote: > I think this last issue shows that it's critical as a community to have > such a testing framework in place, otherwise we really need to halt all > work on replication until we have such a thing. > I can't see how you expect to complete streaming

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Josh Berkus

On 11/19/2013 10:40 AM, Andres Freund wrote: > On 2013-11-19 10:32:10 -0800, Christophe Pettus wrote: >> What concerns me more is that we don't seem to have a framework to put >> in a regression test on the bug you just found (and thank you for >> finding it so quickly!). > > Agreed. But regarding

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

On 2013-11-19 10:34:09 -0800, Josh Berkus wrote: > On 11/19/2013 10:29 AM, Andres Freund wrote: > > It's pretty unlikely that any automated testing would have cought this, > > the required conditions are too unlikely for that. > > Given our almost total lack of automated testing for replication, h

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

On 2013-11-19 10:32:10 -0800, Christophe Pettus wrote: > > On Nov 19, 2013, at 10:29 AM, Andres Freund wrote: > > > It's pretty unlikely that any automated testing would have cought this, > > the required conditions are too unlikely for that. > > I would expect that "promote secondary while pri

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Josh Berkus

On 11/19/2013 10:29 AM, Andres Freund wrote: > It's pretty unlikely that any automated testing would have cought this, > the required conditions are too unlikely for that. Given our almost total lack of automated testing for replication, how would you (or anyone else) possibly know that? We certa

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Christophe Pettus

On Nov 19, 2013, at 10:29 AM, Andres Freund wrote: > It's pretty unlikely that any automated testing would have cought this, > the required conditions are too unlikely for that. I would expect that "promote secondary while primary is under heavy load" is clear-cut test case. What concerns me

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

On 2013-11-19 10:27:30 -0800, Christophe Pettus wrote: > > On Nov 19, 2013, at 10:25 AM, Andres Freund wrote: > > I am not sure how much code it's going to take (I'd hope somewhat less), > > but it certainly will take some time to agree how it should be built and > > then building and integrating

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Christophe Pettus

On Nov 19, 2013, at 10:25 AM, Andres Freund wrote: > I am not sure how much code it's going to take (I'd hope somewhat less), > but it certainly will take some time to agree how it should be built and > then building and integrating it. Given that the situation we're in right now is that we have

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

On 2013-11-19 10:16:26 -0800, Josh Berkus wrote: > On 11/19/2013 09:58 AM, Andres Freund wrote: > > On 2013-11-19 09:51:28 -0800, Josh Berkus wrote: > >> Maybe it's worth it now to devise some automated replication testing? > > > > It'd be a good idea, but I am not sure where to get resources for

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Josh Berkus

On 11/19/2013 09:58 AM, Andres Freund wrote: > On 2013-11-19 09:51:28 -0800, Josh Berkus wrote: >> Maybe it's worth it now to devise some automated replication testing? > > It'd be a good idea, but I am not sure where to get resources for it > from. Well, servers isn't hard ... various cloud comp

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

On 2013-11-19 09:51:28 -0800, Josh Berkus wrote: > On 11/19/2013 06:20 AM, Andres Freund wrote: > > Hi, > > > > On 2013-11-18 23:15:59 +0100, Andres Freund wrote: > >> Afaics it's likely a combination/interaction of bugs and fixes between: > >> * the initial HS code > >> * 5a031a5556ff83b8a9646892

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Josh Berkus

On 11/19/2013 06:20 AM, Andres Freund wrote: > Hi, > > On 2013-11-18 23:15:59 +0100, Andres Freund wrote: >> Afaics it's likely a combination/interaction of bugs and fixes between: >> * the initial HS code >> * 5a031a5556ff83b8a9646892715d7fef415b83c3 >> * f44eedc3f0f347a856eea8590730769125964597

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Christophe Pettus

On Nov 19, 2013, at 6:59 AM, Andres Freund wrote: > Yes. There's less expensive ways to do it, but those seem to complicated > to suggest. If this is something that could be built into to a tool, acknowledging the complexity, I'd be happy to see about building it. -- -- Christophe Pettus x

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

On 2013-11-19 09:33:34 -0500, Andrew Dunstan wrote: > > On 11/19/2013 09:20 AM, Andres Freund wrote: > >Imo this warrants and expedited point release :( > > > I presume anyone who is vulnerable to it would need to recreate their > secondary servers to get rid of potential problems? Yes. There's

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andrew Dunstan

On 11/19/2013 09:20 AM, Andres Freund wrote: Imo this warrants and expedited point release :( I presume anyone who is vulnerable to it would need to recreate their secondary servers to get rid of potential problems? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@pos

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

On 2013-11-19 15:20:01 +0100, Andres Freund wrote: > Imo something the attached patch should be done. The description I came > up with is: > > Fix Hot-Standby initialization of clog and subtrans. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ Po

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-19 Thread Andres Freund

Hi, On 2013-11-18 23:15:59 +0100, Andres Freund wrote: > Afaics it's likely a combination/interaction of bugs and fixes between: > * the initial HS code > * 5a031a5556ff83b8a9646892715d7fef415b83c3 > * f44eedc3f0f347a856eea8590730769125964597 Yes, the combination of those is guilty. Man, this is

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus

On Nov 18, 2013, at 2:26 PM, Andres Freund wrote: > Trying to reproduce the issue with and without hot_standby=on would be > very helpful, but I guess that's time consuming? I've been working on it, but I haven't gotten it to fail yet. I'll keep at it. -- -- Christophe Pettus x...@thebuild

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Andres Freund

On 2013-11-18 14:25:58 -0800, Christophe Pettus wrote: > Great! If there's any further data I can supply to help, let me know. Trying to reproduce the issue with and without hot_standby=on would be very helpful, but I guess that's time consuming? Greetings, Andres Freund -- Andres Freund

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus

Great! If there's any further data I can supply to help, let me know. On Nov 18, 2013, at 2:15 PM, Andres Freund wrote: > Hi, > > Afaics it's likely a combination/interaction of bugs and fixes between: > * the initial HS code > * 5a031a5556ff83b8a9646892715d7fef415b83c3 > * f44eedc3f0f347a856e

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Andres Freund

Hi, Afaics it's likely a combination/interaction of bugs and fixes between: * the initial HS code * 5a031a5556ff83b8a9646892715d7fef415b83c3 * f44eedc3f0f347a856eea8590730769125964597 But that'd mean nobody noticed it during 9.3's beta... Greetings, Andres Freund -- Andres Freund

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus

On Nov 18, 2013, at 12:57 PM, Andres Freund wrote: > Were there any kind of patterns in the lost data? What kind of workload > are they running? I have an idea what the issue might be... On the P1 > S1 case, the data corrupted was data modified in the last few minutes before the switchover. I

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Andres Freund

On 2013-11-18 10:58:26 -0800, Christophe Pettus wrote: > After promotion, it was discovered that there was significant data > loss on S1. Rows that were present on P1 were missing on S1, and some > rows were duplicated (including duplicates that violated primary key > and other unique constraints)

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus

On Nov 18, 2013, at 12:00 PM, Christophe Pettus wrote: > One more correction: After rsync finished and the pg_base_backup() was > issued, the contents of pg_xlog/ on S1 were deleted. pg_stop_backup(), sorry. -- -- Christophe Pettus x...@thebuild.com -- Sent via pgsql-hackers mailing li

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus

On Nov 18, 2013, at 11:47 AM, Andres Freund wrote: > Without deleting any data, including pg_xlog/, backup.label, anything? One more correction: After rsync finished and the pg_base_backup() was issued, the contents of pg_xlog/ on S1 were deleted. -- -- Christophe Pettus x...@thebuild.com

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus

On Nov 18, 2013, at 11:47 AM, Andres Freund wrote: > Did you have hot_standby enabled on all of those machines? Even on the > 9.0.13 cluster? Actually, it's a bit more complex than this: 1. We don't know about P0, the 9.0.13 machine. I assume it was, but it was managed elsewhere. 2. P1 never

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus

On Nov 18, 2013, at 11:47 AM, Andres Freund wrote: > Without deleting any data, including pg_xlog/, backup.label, anything? Correct. > Did you have hot_standby enabled on all of those machines? Even on the > 9.0.13 cluster? Yes. > That was just recovery command and primary conninfo? Correct

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Andres Freund

On 2013-11-18 11:38:43 -0800, Christophe Pettus wrote: > > On Nov 18, 2013, at 11:28 AM, Andres Freund wrote: > > Could you detail how exactly the base backup was created? Including the > > *exact* logic for copying? > > 0. Before any of this began, P1 was archiving WAL segments to AWS-S3. > 1.

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus

On Nov 18, 2013, at 11:28 AM, Andres Freund wrote: > Could you detail how exactly the base backup was created? Including the > *exact* logic for copying? 0. Before any of this began, P1 was archiving WAL segments to AWS-S3. 1. pg_start_backup('', true) on P1. 2. Using rsync -av on P1, the entire

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Andres Freund

Hi, On 2013-11-18 10:58:26 -0800, Christophe Pettus wrote: > INCDIDENT #1: 9.0.14 -- A new secondary (S1) was initialized using > rsync off of an existing, correct primary (P1) for the base backup, > and using WAL-E for WAL segment shipping. Both the primary and > secondary were running 9.0.14.

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus

On Nov 18, 2013, at 10:58 AM, Christophe Pettus wrote: > As a note, P1 was created from another system (let's call it P0) using just > WAL shipping (no streaming replication), and no data corruption was observed. As another data point, P0 was running 9.0.13, rather than 9.0.14. -- -- Christophe

Re: [HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Josh Berkus

On 11/18/2013 10:58 AM, Christophe Pettus wrote: > Three times in the last two weeks, we have experience data corruption > secondary servers using streaming replication on client systems. The > versions involved are 9.0.14, 9.2.5, and 9.3.1. Each incident was separate; > two cases they were fo

[HACKERS] Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1

2013-11-18 Thread Christophe Pettus

Three times in the last two weeks, we have experience data corruption secondary servers using streaming replication on client systems. The versions involved are 9.0.14, 9.2.5, and 9.3.1. Each incident was separate; two cases they were for the same client (9.0.14 and 9.3.1), one for a different

78 matches

Mail list logo