[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-16 Thread Thomas Munro
On Wed, Jun 17, 2015 at 6:58 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Thomas Munro wrote: Thanks. As mentioned elsewhere in the thread, I discovered that the same problem exists for page boundaries, with a different error message. I've tried the attached repro scripts on 9.3.0,

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Noah Misch
On Thu, Jun 04, 2015 at 05:29:51PM -0400, Robert Haas wrote: Here's a new version with some more fixes and improvements: I read through this version and found nothing to change. I encourage other hackers to study the patch, though. The surrounding code is challenging. With this version, I'm

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Thomas Munro
On Fri, Jun 5, 2015 at 1:47 PM, Thomas Munro thomas.mu...@enterprisedb.com wrote: On Fri, Jun 5, 2015 at 11:47 AM, Thomas Munro thomas.mu...@enterprisedb.com wrote: On Fri, Jun 5, 2015 at 9:29 AM, Robert Haas robertmh...@gmail.com wrote: Here's a new version with some more fixes and

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Robert Haas
On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch n...@leadboat.com wrote: On Thu, Jun 04, 2015 at 05:29:51PM -0400, Robert Haas wrote: Here's a new version with some more fixes and improvements: I read through this version and found nothing to change. I encourage other hackers to study the patch,

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Alvaro Herrera
Robert Haas wrote: On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch n...@leadboat.com wrote: On Thu, Jun 04, 2015 at 05:29:51PM -0400, Robert Haas wrote: Here's a new version with some more fixes and improvements: I read through this version and found nothing to change. I encourage other

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Thomas Munro
On Fri, Jun 5, 2015 at 11:47 AM, Thomas Munro thomas.mu...@enterprisedb.com wrote: On Fri, Jun 5, 2015 at 9:29 AM, Robert Haas robertmh...@gmail.com wrote: Here's a new version with some more fixes and improvements: - SetOffsetVacuumLimit was failing to set MultiXactState-oldestOffset when

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas
On Thu, Jun 4, 2015 at 5:29 PM, Robert Haas robertmh...@gmail.com wrote: - Forces aggressive autovacuuming when the control file's oldestMultiXid doesn't point to a valid MultiXact and enables member wraparound at the next checkpoint following the correction of that problem. Err, enables

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas
On Thu, Jun 4, 2015 at 12:57 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jun 4, 2015 at 9:42 AM, Robert Haas robertmh...@gmail.com wrote: Thanks for the review. Here's a new version. I've fixed the things Alvaro and Noah noted, and some compiler warnings about set but unused

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Thomas Munro
On Fri, Jun 5, 2015 at 9:29 AM, Robert Haas robertmh...@gmail.com wrote: Here's a new version with some more fixes and improvements: - SetOffsetVacuumLimit was failing to set MultiXactState-oldestOffset when the oldest offset became known if the now-known value happened to be zero. Fixed.

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Andres Freund
Hi, On 2015-06-04 12:57:42 -0400, Robert Haas wrote: + /* + * Do we need an emergency autovacuum? If we're not sure, assume yes. + */ + return !oldestOffsetKnown || + (nextOffset - oldestOffset MULTIXACT_MEMBER_SAFE_THRESHOLD); I think without teaching

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas
On Thu, Jun 4, 2015 at 1:27 PM, Andres Freund and...@anarazel.de wrote: On 2015-06-04 12:57:42 -0400, Robert Haas wrote: + /* + * Do we need an emergency autovacuum? If we're not sure, assume yes. + */ + return !oldestOffsetKnown || + (nextOffset - oldestOffset

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Alvaro Herrera
Alvaro Herrera wrote: Robert Haas wrote: So here's a patch taking a different approach. I tried to apply this to 9.3 but it's messy because of pgindent. Anyone would have a problem with me backpatching a pgindent run of multixact.c? Done. -- Álvaro Herrera

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas
On Thu, Jun 4, 2015 at 9:42 AM, Robert Haas robertmh...@gmail.com wrote: Thanks for the review. Here's a new version. I've fixed the things Alvaro and Noah noted, and some compiler warnings about set but unused variables. I also tested it, and it doesn't quite work as hoped. If started on a

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Noah Misch
On Wed, Jun 03, 2015 at 04:53:46PM -0400, Robert Haas wrote: So here's a patch taking a different approach. In this approach, if the multixact whose members we want to look up doesn't exist, we don't use a later one (that might or might not be valid). Instead, we attempt to cope with the

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas
On Thu, Jun 4, 2015 at 2:42 AM, Noah Misch n...@leadboat.com wrote: I like that change a lot. It's much easier to seek forgiveness for wasting = 28 GiB of disk than for deleting visibility information wrongly. I'm glad you like it. I concur. 2. If setting the offset stop limit (the point

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Andres Freund wrote: On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote: One idea I had was: what if the oldestMulti pointed to another multi earlier in the same 0046 file, so that it is read-as-zeroes (and the file is created), and then a subsequent multixact truncate tries to read a

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Robert Haas
On Wed, Jun 3, 2015 at 8:24 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Jun 2, 2015 at 5:22 PM, Andres Freund and...@anarazel.de wrote: Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning the disk it'll always get one at a segment boundary, right? I'm not sure

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Thomas Munro
On Mon, Jun 1, 2015 at 4:55 PM, Noah Misch n...@leadboat.com wrote: While testing this (with inconsistent-multixact-fix-master.patch applied, FWIW), I noticed a nearby bug with a similar symptom. TruncateMultiXact() omits the nextMXact==oldestMXact special case found in each other

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Robert Haas wrote: So here's a patch taking a different approach. I tried to apply this to 9.3 but it's messy because of pgindent. Anyone would have a problem with me backpatching a pgindent run of multixact.c? Also, you have a new function SlruPageExists, but we already have

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Alvaro Herrera wrote: Really, the whole question of how this code goes past the open() failure in SlruPhysicalReadPage baffles me. I don't see any possible way for the file to be created ... Hmm, the checkpointer can call TruncateMultiXact when in recovery, on restartpoints. I wonder if in

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Andres Freund
On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote: Andres Freund wrote: That's not necessarily the case though, given how the code currently works. In a bunch of places the SLRUs are accessed *before* having been made consistent by WAL replay. Especially if several checkpoints/vacuums

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Andres Freund
On 2015-06-03 00:42:55 -0300, Alvaro Herrera wrote: Thomas Munro wrote: On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: My guess is that the file existed, and perhaps had one or more pages, but the wanted page doesn't exist, so we tried to read but got 0

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Thomas Munro wrote: I have finally reproduced that error! See attached repro shell script. The conditions are: 1. next multixact == oldest multixact (no active multixacts, pointing past the end) 2. next multixact would be the first item on a new page (multixact % 2048 == 0) 3. the

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Andres Freund wrote: On 2015-06-03 00:42:55 -0300, Alvaro Herrera wrote: Thomas Munro wrote: On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: My guess is that the file existed, and perhaps had one or more pages, but the wanted page doesn't exist, so

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Thomas Munro
On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: My guess is that the file existed, and perhaps had one or more pages, but the wanted page doesn't exist, so we tried to read but got 0 bytes back. read() returns 0 in this case but doesn't set errno. I didn't find

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Alvaro Herrera
Thomas Munro wrote: On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: My guess is that the file existed, and perhaps had one or more pages, but the wanted page doesn't exist, so we tried to read but got 0 bytes back. read() returns 0 in this case but doesn't

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-01 14:22:32 -0400, Robert Haas wrote: On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund and...@anarazel.de wrote: The lack of WAL logging actually has caused problems in the 9.3.3 (?) era, where we didn't do any truncation during recovery... Right, but now we're piggybacking on the

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 1:21 AM, Noah Misch n...@leadboat.com wrote: On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote: On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch n...@leadboat.com wrote: On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote: SetMultiXactIdLimit() bracketed

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-02 11:29:24 -0400, Robert Haas wrote: On Tue, Jun 2, 2015 at 8:56 AM, Andres Freund and...@anarazel.de wrote: But what *definitely* looks wrong to me is that a TruncateMultiXact() in this scenario now (since a couple weeks ago) does a SimpleLruReadPage_ReadOnly() in the members

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 11:27 AM, Andres Freund and...@anarazel.de wrote: On 2015-06-02 11:16:22 -0400, Robert Haas wrote: I'm having trouble figuring out what to do about this. I mean, the essential principle of this patch is that if we can't count on relminmxid, datminmxid, or the control

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-02 11:16:22 -0400, Robert Haas wrote: I'm having trouble figuring out what to do about this. I mean, the essential principle of this patch is that if we can't count on relminmxid, datminmxid, or the control file to be accurate, we can at least look at what is present on the disk.

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 8:56 AM, Andres Freund and...@anarazel.de wrote: But what *definitely* looks wrong to me is that a TruncateMultiXact() in this scenario now (since a couple weeks ago) does a SimpleLruReadPage_ReadOnly() in the members slru via find_multixact_start(). That just won't work

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 11:36 AM, Andres Freund and...@anarazel.de wrote: That would be a departure from the behavior of every existing release that includes this code based on, to my knowledge, zero trouble reports. On the other hand we're now at about bug #5 attributeable to the odd way

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Noah Misch
On Tue, Jun 02, 2015 at 11:16:22AM -0400, Robert Haas wrote: On Tue, Jun 2, 2015 at 1:21 AM, Noah Misch n...@leadboat.com wrote: On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote: Granted. Would it be better to update both functions at the same time, and perhaps to make that a

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-02 11:37:02 -0400, Robert Haas wrote: The exact circumstances under which we're willing to replace a relminmxid with a newly-computed one that differs are not altogether clear to me, but there's an if statement protecting that logic, so there are some circumstances in which we'll

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 11:44 AM, Andres Freund and...@anarazel.de wrote: On 2015-06-02 11:37:02 -0400, Robert Haas wrote: The exact circumstances under which we're willing to replace a relminmxid with a newly-computed one that differs are not altogether clear to me, but there's an if statement

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Andres Freund
On 2015-05-31 07:51:59 -0400, Robert Haas wrote: 1) We continue determining the oldest SlruScanDirectory(SlruScanDirCbFindEarliest) on the master to find the oldest offsets segment to truncate. Alternatively, if we determine it to be safe, we could use oldestMulti to find that.

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Robert Haas
On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch n...@leadboat.com wrote: Incomplete review, done in a relative rush: Thanks. On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote: OK, here's a patch. Actually two patches, differing only in whitespace, for 9.3 and for master (ha!). I now

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Robert Haas
On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund and...@anarazel.de wrote: I'm probably biased here, but I think we should finish reviewing, testing, and committing my patch before we embark on designing this. Probably, yes. I am wondering whether doing this immediately won't end up making some

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Alvaro Herrera
Thomas Munro wrote: - There's a third possible problem related to boundary cases in SlruScanDirCbRemoveMembers, but I don't understand that one well enough to explain it. Maybe Thomas can jump in here and explain the concern. I noticed something in passing which is probably not

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Alvaro Herrera
Alvaro Herrera wrote: Robert Haas wrote: In the process of investigating this, we found a few other things that seem like they may also be bugs: - As noted upthread, replaying an older checkpoint after a newer checkpoint has already happened may lead to similar problems. This may

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Alvaro Herrera
Alvaro Herrera wrote: Anyway here's a quick script to almost-reproduce the problem. Meh. Really attached now. I also wanted to post the error messages we got: 2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode 2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG:

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Noah Misch
On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote: On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch n...@leadboat.com wrote: On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote: SetMultiXactIdLimit() bracketed certain parts of its logic with if (!InRecovery), but those guards

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-31 Thread Noah Misch
On Fri, May 29, 2015 at 10:37:57AM +1200, Thomas Munro wrote: On Fri, May 29, 2015 at 7:56 AM, Robert Haas robertmh...@gmail.com wrote: - There's a third possible problem related to boundary cases in SlruScanDirCbRemoveMembers, but I don't understand that one well enough to explain it.

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-31 Thread Noah Misch
Incomplete review, done in a relative rush: On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote: OK, here's a patch. Actually two patches, differing only in whitespace, for 9.3 and for master (ha!). I now think that the root of the problem here is that DetermineSafeOldestOffset() and

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-31 Thread Robert Haas
On Sat, May 30, 2015 at 8:55 PM, Andres Freund and...@anarazel.de wrote: Is oldestMulti, nextMulti - 1 really suitable for this? Are both actually guaranteed to exist in the offsets slru and be valid? Hm. I guess you intend to simply truncate everything else, but just in offsets? oldestMulti

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Christoph Berg
Re: Robert Haas 2015-05-29 ca+tgmozzdjn38tfqydgagj-ap+zkrqsrgbq4eu_zrefryk+...@mail.gmail.com FTR: Robert, you have been a Samurai on this issue. Our many thanks. Thanks! I really appreciate the kind words. I'm still watching with admiration. This list of steps-to-reproduce is the longest

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Thomas Munro
On Fri, May 29, 2015 at 11:24 AM, Robert Haas robertmh...@gmail.com wrote: A. Most obviously, we should fix pg_upgrade so that it installs chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so that we stop creating new instances of this problem. That won't get us out of the hole

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Tom Lane
Thomas Munro thomas.mu...@enterprisedb.com writes: On Fri, May 29, 2015 at 11:24 AM, Robert Haas robertmh...@gmail.com wrote: B. We need to change find_multixact_start() to fail softly. Here is an experimental WIP patch that changes StartupMultiXact and SetMultiXactIdLimit to find the oldest

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas
On Fri, May 29, 2015 at 10:17 AM, Tom Lane t...@sss.pgh.pa.us wrote: Thomas Munro thomas.mu...@enterprisedb.com writes: On Fri, May 29, 2015 at 11:24 AM, Robert Haas robertmh...@gmail.com wrote: B. We need to change find_multixact_start() to fail softly. Here is an experimental WIP patch that

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Thomas Munro
On Sat, May 30, 2015 at 10:48 AM, Andres Freund and...@anarazel.de wrote: On 2015-05-30 10:41:01 +1200, Thomas Munro wrote: On Sat, May 30, 2015 at 10:29 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, May 29, 2015 at 5:14 PM, Josh Berkus j...@agliodbs.com wrote: Just saw what looks

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas
On Fri, May 29, 2015 at 5:14 PM, Josh Berkus j...@agliodbs.com wrote: Just saw what looks like a report of this issue on 9.2. https://github.com/wal-e/wal-e/issues/177 Urk. That looks awfully similar, but I don't think any of the code that is affected here exists in 9.2, or that any of the

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Andres Freund
On 2015-05-30 10:55:30 +1200, Thomas Munro wrote: That's the error message, but then further down: Ooops. I have confirmed that directory pg_multixact/members does not existing in the restored data directory. I can see this directory and the file if i restore a few days old backup. I have

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Andres Freund
On 2015-05-30 10:41:01 +1200, Thomas Munro wrote: On Sat, May 30, 2015 at 10:29 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, May 29, 2015 at 5:14 PM, Josh Berkus j...@agliodbs.com wrote: Just saw what looks like a report of this issue on 9.2.

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Thomas Munro
On Sat, May 30, 2015 at 10:29 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, May 29, 2015 at 5:14 PM, Josh Berkus j...@agliodbs.com wrote: Just saw what looks like a report of this issue on 9.2. https://github.com/wal-e/wal-e/issues/177 Urk. That looks awfully similar, but I don't

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Andres Freund
On 2015-05-29 15:49:53 -0400, Bruce Momjian wrote: I think we need to step back and look at the brain power required to unravel the mess we have made regarding multi-xact and fixes. (I bet few people can even remember which multi-xact fixes went into which releases --- I can't.) Instead of

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas
On Fri, May 29, 2015 at 3:08 PM, Robert Haas robertmh...@gmail.com wrote: It won't fix the fact that pg_upgrade is putting a wrong value into everybody's datminmxid field, which should really be addressed too, but I've been working on this for about three days virtually non-stop and I don't

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Alvaro Herrera
Andres Freund wrote: I considered for a second whether the solution for that could be to not truncate while inconsistent - but I think that doesn't solve anything as then we can end up with directories where every single offsets/member file exists. Hang on a minute. We don't need to scan

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Alvaro Herrera
Bruce Momjian wrote: I think we need to step back and look at the brain power required to unravel the mess we have made regarding multi-xact and fixes. (I bet few people can even remember which multi-xact fixes went into which releases --- I can't.) Instead of working on actual features, we

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas
On Fri, May 29, 2015 at 9:46 PM, Andres Freund and...@anarazel.de wrote: On 2015-05-29 15:08:11 -0400, Robert Haas wrote: It seems pretty clear that we can't effectively determine anything about member wraparound until the cluster is consistent. I wonder if this doesn't actually hints at a

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Andres Freund
On 2015-05-29 15:08:11 -0400, Robert Haas wrote: It seems pretty clear that we can't effectively determine anything about member wraparound until the cluster is consistent. I wonder if this doesn't actually hints at a bigger problem. Currently, to determine where we need to truncate

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Thomas Munro
On Sat, May 30, 2015 at 1:46 PM, Andres Freund and...@anarazel.de wrote: On 2015-05-29 15:08:11 -0400, Robert Haas wrote: It seems pretty clear that we can't effectively determine anything about member wraparound until the cluster is consistent. I wonder if this doesn't actually hints at a

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas
On Fri, May 29, 2015 at 12:43 PM, Robert Haas robertmh...@gmail.com wrote: Working on that now. OK, here's a patch. Actually two patches, differing only in whitespace, for 9.3 and for master (ha!). I now think that the root of the problem here is that DetermineSafeOldestOffset() and

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Josh Berkus
All, Just saw what looks like a report of this issue on 9.2. https://github.com/wal-e/wal-e/issues/177 -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Steve Kehlet
On Fri, May 29, 2015 at 12:08 PM Robert Haas robertmh...@gmail.com wrote: OK, here's a patch. I grabbed branch REL9_4_STABLE from git, and Robert got me a 9.4-specific patch. I rebuilt, installed, and postgres started up successfully! I did a bunch of checks, had our app run several thousand

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Thomas Munro
On Fri, May 29, 2015 at 7:56 AM, Robert Haas robertmh...@gmail.com wrote: On Thu, May 28, 2015 at 8:51 AM, Robert Haas robertmh...@gmail.com wrote: [ speculation ] [...] However, since the vacuum did advance relfrozenxid, it will call vac_truncate_clog, which will call SetMultiXactIdLimit,

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Alvaro Herrera
Robert Haas wrote: On Thu, May 28, 2015 at 8:51 AM, Robert Haas robertmh...@gmail.com wrote: [ speculation ] OK, I finally managed to reproduce this, after some off-list help from Steve Kehlet (the reporter), Alvaro, and Thomas Munro. Here's how to do it: It's a long list of steps, but

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Joshua D. Drake
On 05/28/2015 12:56 PM, Robert Haas wrote: FTR: Robert, you have been a Samurai on this issue. Our many thanks. Sincerely, jD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing I'm offended

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas
On Thu, May 28, 2015 at 8:51 AM, Robert Haas robertmh...@gmail.com wrote: [ speculation ] OK, I finally managed to reproduce this, after some off-list help from Steve Kehlet (the reporter), Alvaro, and Thomas Munro. Here's how to do it: 1. Install any pre-9.3 version of the server and generate

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas
On Thu, May 28, 2015 at 10:41 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid values which are equal to the next-mxid counter instead of the correct value; in other words, they are too new. [ discussion of how the control

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas
On Thu, May 28, 2015 at 4:06 PM, Joshua D. Drake j...@commandprompt.com wrote: FTR: Robert, you have been a Samurai on this issue. Our many thanks. Thanks! I really appreciate the kind words. So, in thinking through this situation further, it seems to me that the situation is pretty dire: 1.

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Alvaro Herrera
Robert Haas wrote: 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid values which are equal to the next-mxid counter instead of the correct value; in other words, they are too new. What you describe is what happens if you upgrade from 9.2 or earlier. For this case we use

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Alvaro Herrera
Alvaro Herrera wrote: Robert Haas wrote: 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid values which are equal to the next-mxid counter instead of the correct value; in other words, they are too new. What you describe is what happens if you upgrade from 9.2 or

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas
On Thu, May 28, 2015 at 8:03 AM, Robert Haas robertmh...@gmail.com wrote: Steve, is there any chance we can get your pg_controldata output and a list of all the files in pg_clog? Err, make that pg_multixact/members, which I assume is at issue here. You didn't show us the DETAIL line from this

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-27 Thread Alvaro Herrera
Robert Haas wrote: On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Steve Kehlet wrote: I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we just dropped new binaries in place) but it wouldn't start up. I found this in the logs:

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-27 Thread Robert Haas
On Wed, May 27, 2015 at 10:14 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Well I'm not very clear on what's the problematic case. The scenario I actually saw this first reported was a pg_basebackup taken on a very large database, so the master could have truncated multixact and the

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-27 Thread Robert Haas
On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Steve Kehlet wrote: I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we just dropped new binaries in place) but it wouldn't start up. I found this in the logs: waiting for server to

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-27 Thread Alvaro Herrera
Steve Kehlet wrote: On Wed, May 27, 2015 at 3:21 PM Alvaro Herrera alvhe...@2ndquadrant.com wrote: I think a patch like this should be able to fix it ... not tested yet. Thanks Alvaro. I got a compile error, so looked for other uses of SimpleLruDoesPhysicalPageExist and added

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-27 Thread Steve Kehlet
On Wed, May 27, 2015 at 3:21 PM Alvaro Herrera alvhe...@2ndquadrant.com wrote: I think a patch like this should be able to fix it ... not tested yet. Thanks Alvaro. I got a compile error, so looked for other uses of SimpleLruDoesPhysicalPageExist and added MultiXactOffsetCtl, does this look

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-27 Thread Alvaro Herrera
Steve Kehlet wrote: I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we just dropped new binaries in place) but it wouldn't start up. I found this in the logs: waiting for server to start2015-05-27 13:13:00 PDT [27341]: [1-1] LOG: database system was shut down at