Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-12-10 Thread Andres Freund
On 2013-12-10 11:12:03 -0800, Josh Berkus wrote:
> On 12/10/2013 10:48 AM, Andres Freund wrote:
> > On 2013-12-10 10:44:30 -0800, Josh Berkus wrote:
> >> On 12/10/2013 10:39 AM, Andres Freund wrote:
> >>> Hi,
> >>>
> >>> On 2013-12-10 10:38:32 -0800, Josh Berkus wrote:
>  We've just run across a case of this exact issue on 9.2.4.  I thought it
>  was supposed to be 9.3-only?
> >>>
> >>> Could you please describe "this exact issue"?
> >>
> >> Fatal errors due to missing pg_subtrans files on creating a new replica.
> > 
> >> If this is fixed in 9.2.6, great, but I didn't get that impression from
> >> the commits ...
> > 
> > I am pretty sure this thread isn't about the bug you're hitting. You get
> > errors during xid assignment, right? If so, upgrade do 9.2.6, that's
> > fixed (it's the bug in which wake the replication bug was introduced).
> 
> I thought that only affected 9.2.5?  This machine is 9.2.4, as was the
> prior master.

9.2.5 tried to fix the issue (errors because of missing pg_subtrans
files during xid assignments), but the fix had unintended consequences,
namely the HS init bug. 9.2.6 removed those unintended consequences.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-12-10 Thread Josh Berkus
On 12/10/2013 10:48 AM, Andres Freund wrote:
> On 2013-12-10 10:44:30 -0800, Josh Berkus wrote:
>> On 12/10/2013 10:39 AM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2013-12-10 10:38:32 -0800, Josh Berkus wrote:
 We've just run across a case of this exact issue on 9.2.4.  I thought it
 was supposed to be 9.3-only?
>>>
>>> Could you please describe "this exact issue"?
>>
>> Fatal errors due to missing pg_subtrans files on creating a new replica.
> 
>> If this is fixed in 9.2.6, great, but I didn't get that impression from
>> the commits ...
> 
> I am pretty sure this thread isn't about the bug you're hitting. You get
> errors during xid assignment, right? If so, upgrade do 9.2.6, that's
> fixed (it's the bug in which wake the replication bug was introduced).

I thought that only affected 9.2.5?  This machine is 9.2.4, as was the
prior master.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-12-10 Thread Andres Freund
On 2013-12-10 10:44:30 -0800, Josh Berkus wrote:
> On 12/10/2013 10:39 AM, Andres Freund wrote:
> > Hi,
> > 
> > On 2013-12-10 10:38:32 -0800, Josh Berkus wrote:
> >> We've just run across a case of this exact issue on 9.2.4.  I thought it
> >> was supposed to be 9.3-only?
> > 
> > Could you please describe "this exact issue"?
> 
> Fatal errors due to missing pg_subtrans files on creating a new replica.

> If this is fixed in 9.2.6, great, but I didn't get that impression from
> the commits ...

I am pretty sure this thread isn't about the bug you're hitting. You get
errors during xid assignment, right? If so, upgrade do 9.2.6, that's
fixed (it's the bug in which wake the replication bug was introduced).

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-12-10 Thread Josh Berkus
On 12/10/2013 10:39 AM, Andres Freund wrote:
> Hi,
> 
> On 2013-12-10 10:38:32 -0800, Josh Berkus wrote:
>> We've just run across a case of this exact issue on 9.2.4.  I thought it
>> was supposed to be 9.3-only?
> 
> Could you please describe "this exact issue"?

Fatal errors due to missing pg_subtrans files on creating a new replica.

Sequence:

1. Failed over from master to replica #1

2. Remastered other replicas

3. Tried to create a new replica.

4. New replica started failing with errors similar to the original report.

If this is fixed in 9.2.6, great, but I didn't get that impression from
the commits ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-12-10 Thread Andres Freund
Hi,

On 2013-12-10 10:38:32 -0800, Josh Berkus wrote:
> We've just run across a case of this exact issue on 9.2.4.  I thought it
> was supposed to be 9.3-only?

Could you please describe "this exact issue"?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-12-10 Thread Josh Berkus
Andres, all:

We've just run across a case of this exact issue on 9.2.4.  I thought it
was supposed to be 9.3-only?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-12-05 Thread J Smith
On Thu, Nov 28, 2013 at 5:15 AM, Andres Freund  wrote:
> Hi,
>
> Do you still have the core file around? If so could you 'p
> *ShmemVariableCache' and 'p *ControlFile'?
>

So sorry, I didn't see this message until just today. Seems it was
accidentally archived before hitting my eyeballs.

I see that 9.3.2 was released today along with what appears to be some
fixes regarding this and similar issues. Sorry if my missing this
message held anything up.

We still have the core file kicking around, so here's the output at any rate.

(gdb) p *ShmemVariableCache

$1 = {nextOid = 8036795, oidCount = 6974, nextXid = 10260628,
oldestXid = 1673, xidVacLimit = 21673, xidWarnLimit = 2136485320,
xidStopLimit = 2146485320, xidWrapLimit = 2147485320, oldestXidDB = 1,

  latestCompletedXid = 10260621}

(gdb) p *ControlFile

$2 = {system_identifier = 5942827484423487452, pg_control_version =
937, catalog_version_no = 201306121, state = DB_IN_PRODUCTION, time =
1385234938, checkPoint = 381771294048,

  prevCheckPoint = 381771293888, checkPointCopy = {redo =
381771293992, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites =
1 '\001', nextXidEpoch = 0, nextXid = 10217377, nextOid = 8035577,

nextMulti = 13448295, nextMultiOffset = 32161320, oldestXid =
1673, oldestXidDB = 1, oldestMulti = 1, oldestMultiDB = 1, time =
1385234938, oldestActiveXid = 10217377}, unloggedLSN = 1,

  minRecoveryPoint = 0, minRecoveryPointTLI = 0, backupStartPoint = 0,
backupEndPoint = 0, backupEndRequired = 0 '\000', wal_level = 2,
MaxConnections = 200, max_prepared_xacts = 0, max_locks_per_xact = 64,

  maxAlign = 8, floatFormat = 1234567, blcksz = 8192, relseg_size =
131072, xlog_blcksz = 8192, xlog_seg_size = 16777216, nameDataLen =
64, indexMaxKeys = 32, toast_max_chunk_size = 1996,

  enableIntTimes = 1 '\001', float4ByVal = 1 '\001', float8ByVal = 1
'\001', data_checksum_version = 0, crc = 517534097}

(gdb)

Cheers?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-28 Thread Andres Freund
Hi,

On 2013-11-24 16:56:26 -0500, J Smith wrote:
> coredumper worked like a charm. Useful tool, that is... although as a
> bit of advice, I'd try not to run it on Postgres if your various
> memory settings are tweaked towards production use -- the core dump
> that was captured on my server weighed in at 16 GB.
> 
> Anyways, I've attached both the log entries that caused the error as
> well as a backtrace using `bt full`.

Do you still have the core file around? If so could you 'p
*ShmemVariableCache' and 'p *ControlFile'?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-27 Thread Andres Freund
On 2013-11-27 13:57:52 -0300, Alvaro Herrera wrote:
> Per bug report by J Smith in
> cadfupgc5bmtv-yg9znxv-vcfkb+jprqs7m2oesqxam_4z1j...@mail.gmail.com
> diagnosed by Andres Freund.

Alvaro, do you see a way this could actually have caused J.'s problems?
I thought about a few, but each turned about to not really be
possible...
I can easily see why we would uselessly wait or worse, but the
pg_subtrans errors I can't really explain with it.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-27 Thread Alvaro Herrera
Alvaro Herrera escribió:
> Andres Freund escribió:

> This seems simple to handle by adding the check you propose to the loop.
> Basically if the xmax doesn't match the xmin, we reached the end,
> there's nothing more to lock and we can return success without any
> further work:

As mentioned in the thread for bug #8434, the complete working patch for
this is attached.

> > b) Check whether a chain element actually aborted - currently we're
> >only doing that in the HEAP_KEYS_UPDATED updated case, but that seems
> >wrong (we can't check for committed tho!).
> 
> Let me point out that this is exactly the same code that would be
> affected by my proposed fix for #8434, which would have this check the
> updateXid in all cases, not only in KEYS_UPDATED as currently.

I posted a patch for this problem in the thread about #8434.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
commit 1a4cadc729e724fa1aa6f260a4988b3615fccd1b
Author: Alvaro Herrera 
Date:   Wed Nov 27 12:12:42 2013 -0300

Compare new tuple's Xmin to previous Xmax while following an update chain

Not doing so causes us to traverse an update chain that has been broken
by concurrent page pruning.  All other code that traverses update chains
is careful to do this check, so replicate it here too.  Failure to do so
leads to erroneous CLOG, subtrans or multixact lookups.

Per bug report by J Smith in
cadfupgc5bmtv-yg9znxv-vcfkb+jprqs7m2oesqxam_4z1j...@mail.gmail.com
diagnosed by Andres Freund.

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c3b2108..5eb45ff 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -4819,6 +4819,7 @@ heap_lock_updated_tuple_rec(Relation rel, ItemPointer tid, TransactionId xid,
 old_infomask;
 	TransactionId xmax,
 new_xmax;
+	TransactionId priorXmax = InvalidTransactionId;
 
 	ItemPointerCopy(tid, &tupid);
 
@@ -4844,6 +4845,18 @@ l4:
 		CHECK_FOR_INTERRUPTS();
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
+		/*
+		 * Check the tuple XMIN against prior XMAX, if any.  If we reached
+		 * the end of the chain, we're done, so return success.
+		 */
+		if (TransactionIdIsValid(priorXmax) &&
+			!TransactionIdEquals(HeapTupleHeaderGetXmin(mytup.t_data),
+ priorXmax))
+		{
+			UnlockReleaseBuffer(buf);
+			return HeapTupleMayBeUpdated;
+		}
+
 		old_infomask = mytup.t_data->t_infomask;
 		xmax = HeapTupleHeaderGetRawXmax(mytup.t_data);
 
@@ -4944,6 +4957,7 @@ l4:
 		}
 
 		/* tail recursion */
+		priorXmax = HeapTupleHeaderGetUpdateXid(mytup.t_data);
 		ItemPointerCopy(&(mytup.t_data->t_ctid), &tupid);
 		UnlockReleaseBuffer(buf);
 	}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-25 Thread Andres Freund
On 2013-11-25 18:06:30 -0300, Alvaro Herrera wrote:
> > I mean that in the !KEYS_UPDATED case we don't need to abort if we're
> > only acquiring a key share...
> 
> Hm, I think that's correct -- we don't need to abort.  But we still need
> to wait until the updater completes.  So this proposed patch is not the
> full story.

Hm. Why do we need to wait in that case? Isn't the entire point of KEY
SHARE locks *not* having to wait for !KEYS_UPDATED? ISTM in that case we
should only check whether the creating transaction has aborted because
in that case we don't need to take out a lock?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-25 Thread Alvaro Herrera
Andres Freund escribió:
> On 2013-11-25 17:10:39 -0300, Alvaro Herrera wrote:

> > Let me point out that this is exactly the same code that would be
> > affected by my proposed fix for #8434, which would have this check the
> > updateXid in all cases, not only in KEYS_UPDATED as currently.
> 
> Hm. I don't see a patch in any of the mails about #8434 although I seem
> to remember talking with you about one. Maybe that was on IM?

Ah, yeah, that's possible.  The patch I proposed back then is attached
here.  I haven't merged this to latest master; this applies cleanly on
top of 16a906f5350

> > I don't understand your comment about "can't check for committed".  I
> > mean, if the updating transaction committed, then we need to return
> > failure to caller, HeapTupleUpdated.
> 
> I mean that in the !KEYS_UPDATED case we don't need to abort if we're
> only acquiring a key share...

Hm, I think that's correct -- we don't need to abort.  But we still need
to wait until the updater completes.  So this proposed patch is not the
full story.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
commit 71aa5c689c153a03bc09e7f9f433ca55a6f00a23
Author: Alvaro Herrera 
Date:   Mon Oct 7 18:33:59 2013 -0300

Don't skip waiting on tuple updaters if key is unchanged

My commit 0ac5ad5134 contained a bug that allowed lockers to avoid
sleeping waiting for a future version of the tuple being locked to be
released.  This happens while following the update chain to lock the
updated versions of the row.  Prior to that commit, there was no such
behavior, so backpatch no further than 9.3.

The isolation test changes illustrate the difference in behavior.  Some
transactions which were previously allowed to continue must now sleep
waiting for the other transaction; but there is still more concurrency
than there was prior to the mentioned commit.

Per bug #8434 reported by Tomonari Katsumata

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ead3d69..530a001 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -4832,15 +4832,12 @@ l4:
 		xmax = HeapTupleHeaderGetRawXmax(mytup.t_data);
 
 		/*
-		 * If this tuple is updated and the key has been modified (or
-		 * deleted), what we do depends on the status of the updating
-		 * transaction: if it's live, we sleep until it finishes; if it has
-		 * committed, we have to fail (i.e. return HeapTupleUpdated); if it
-		 * aborted, we ignore it. For updates that didn't touch the key, we
-		 * can just plough ahead.
+		 * If this tuple is updated, what we do depends on the status of the
+		 * updating transaction: if it's live, we sleep until it finishes; if
+		 * it has committed, we have to fail (i.e. return HeapTupleUpdated); if
+		 * it aborted, we ignore it.
 		 */
-		if (!(old_infomask & HEAP_XMAX_INVALID) &&
-			(mytup.t_data->t_infomask2 & HEAP_KEYS_UPDATED))
+		if (!(old_infomask & HEAP_XMAX_INVALID))
 		{
 			TransactionId update_xid;
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 791f336..efe7d5a 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1970,7 +1970,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 			 * heap_lock_tuple() will throw an error, and so would any later
 			 * attempt to update or delete the tuple.  (We need not check cmax
 			 * because HeapTupleSatisfiesDirty will consider a tuple deleted
-			 * by our transaction dead, regardless of cmax.) Wee just checked
+			 * by our transaction dead, regardless of cmax.) We just checked
 			 * that priorXmax == xmin, so we can test that variable instead of
 			 * doing HeapTupleHeaderGetXmin again.
 			 */
diff --git a/src/test/isolation/expected/fk-deadlock.out b/src/test/isolation/expected/fk-deadlock.out
index 69eac88..bbd1867 100644
--- a/src/test/isolation/expected/fk-deadlock.out
+++ b/src/test/isolation/expected/fk-deadlock.out
@@ -11,25 +11,22 @@ step s2c: COMMIT;
 starting permutation: s1i s1u s2i s1c s2u s2c
 step s1i: INSERT INTO child VALUES (1, 1);
 step s1u: UPDATE parent SET aux = 'bar';
-step s2i: INSERT INTO child VALUES (2, 1);
+step s2i: INSERT INTO child VALUES (2, 1); 
 step s1c: COMMIT;
+step s2i: <... completed>
 step s2u: UPDATE parent SET aux = 'baz';
 step s2c: COMMIT;
 
 starting permutation: s1i s1u s2i s2u s1c s2c
 step s1i: INSERT INTO child VALUES (1, 1);
 step s1u: UPDATE parent SET aux = 'bar';
-step s2i: INSERT INTO child VALUES (2, 1);
-step s2u: UPDATE parent SET aux = 'baz'; 
-step s1c: COMMIT;
-step s2u: <... completed>
-step s2c: COMMIT;
+step s2i: INSERT INTO child VALUES (2, 1); 
+invalid permutation detected
 
 starting permutation: s1i s1u s2i s2u s2c s1c
 step s1i: INSERT INTO child VALUES (1, 1);
 step s1u: UPDATE parent SET aux = 'bar';
-step s2i: INSERT INTO child VALUES (2, 1);
-step s

Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-25 Thread Andres Freund
On 2013-11-25 17:10:39 -0300, Alvaro Herrera wrote:
> > I am not sure whether that's the origin of the problem but at the very
> > least it seems to me that heap_lock_updated_tuple_rec() is missing
> > several important pieces:
> > a) do the priorXmax==xmin dance to check we're still following the same
> >ctid chain. Currently we could easily stumble across completely
> >unrelated tuples if a chain element aborted and got vacuumed.
> 
> This seems simple to handle by adding the check you propose to the loop.
> Basically if the xmax doesn't match the xmin, we reached the end,
> there's nothing more to lock and we can return success without any
> further work:

Right, that's what we do in other places (notably EvalPlanQualFetch()).

> > b) Check whether a chain element actually aborted - currently we're
> >only doing that in the HEAP_KEYS_UPDATED updated case, but that seems
> >wrong (we can't check for committed tho!).
> 
> Let me point out that this is exactly the same code that would be
> affected by my proposed fix for #8434, which would have this check the
> updateXid in all cases, not only in KEYS_UPDATED as currently.

Hm. I don't see a patch in any of the mails about #8434 although I seem
to remember talking with you about one. Maybe that was on IM?

> I don't understand your comment about "can't check for committed".  I
> mean, if the updating transaction committed, then we need to return
> failure to caller, HeapTupleUpdated.

I mean that in the !KEYS_UPDATED case we don't need to abort if we're
only acquiring a key share...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-25 Thread Alvaro Herrera
Andres Freund escribió:

> Ok, this is helpful. Do you rather longrunning transactions? The
> transaction that does foreign key checks has an xid of 10260613, while
> the row that's getting checked has 13514992.

Thanks for the analysis.

> #4  0x00635dc7 in XactLockTableWait (xid=13514992) at lmgr.c:501
> tag = {locktag_field1 = 13514992, locktag_field2 = 0, locktag_field3 
> = 0, locktag_field4 = 0, locktag_type = 4 '\004', locktag_lockmethodid = 1 
> '\001'}
> #5  0x00482223 in heap_lock_updated_tuple_rec (rel=0x2b20f050a8d0, 
> tuple=, ctid=, xid=10260613, 
> mode=LockTupleKeyShare) at heapam.c:4847
> 
> I am not sure whether that's the origin of the problem but at the very
> least it seems to me that heap_lock_updated_tuple_rec() is missing
> several important pieces:
> a) do the priorXmax==xmin dance to check we're still following the same
>ctid chain. Currently we could easily stumble across completely
>unrelated tuples if a chain element aborted and got vacuumed.

This seems simple to handle by adding the check you propose to the loop.
Basically if the xmax doesn't match the xmin, we reached the end,
there's nothing more to lock and we can return success without any
further work:

/*
 * Check the tuple XMIN against prior XMAX, if any.  If we 
reached
 * the end of the chain, we're done, so return success.
 */
if (TransactionIdIsValid(priorXmax) &&
!TransactionIdEquals(HeapTupleHeaderGetXmin(&mytup),
 priorXmax))
{
UnlockReleaseBuffer(buf);
return HeapTupleMayBeUpdated;
}


> b) Check whether a chain element actually aborted - currently we're
>only doing that in the HEAP_KEYS_UPDATED updated case, but that seems
>wrong (we can't check for committed tho!).

Let me point out that this is exactly the same code that would be
affected by my proposed fix for #8434, which would have this check the
updateXid in all cases, not only in KEYS_UPDATED as currently.

I don't understand your comment about "can't check for committed".  I
mean, if the updating transaction committed, then we need to return
failure to caller, HeapTupleUpdated.  This signals to the caller that
the future version of the tuple is locked and the whole thing needs to
be failed.  (In READ COMMITTED isolation level, this would cause
EvalPlanQual to fetch the updated version of the tuple and redo the
whole thing from there.  In REPEATABLE READ and above, the whole
operation would be aborted.)

In short I propose to fix part this with the simple patch I proposed for
bug #8434.

> c) (reported separately as well) cope with failure of heap_fetch() to
>return anything.

Proposed patch for this was posted in the other thread.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-25 Thread J Smith
On Mon, Nov 25, 2013 at 11:46 AM, Alvaro Herrera
 wrote:
> J Smith escribió:
>
>> We did have some long-running transactions, yes. We refactored a bit
>> and removed them and the problem ceased on our end. We ended up
>> reverting our changes for the sake of running this experiment over the
>> weekend and the errors returned. We've since restored our fix and
>> haven't had any problems since, so yeah, long-running transactions
>> appear to be involved.
>>
>> We can continue to experiment if you have any additional tests you'd
>> like us to run. We may have to keep the experiments to running over
>> the weekend, but they're definitely do-able.
>
> I am working on patches to get these bugs fixed.  Would you be up for
> running a patched binary and see if the errors go away?

Sure, just give me a git commit hash and I can do a test build towards
the weekend. I'll include coredumper again just in case things go awry
and we can go from there.

Cheers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-25 Thread Alvaro Herrera
J Smith escribió:

> We did have some long-running transactions, yes. We refactored a bit
> and removed them and the problem ceased on our end. We ended up
> reverting our changes for the sake of running this experiment over the
> weekend and the errors returned. We've since restored our fix and
> haven't had any problems since, so yeah, long-running transactions
> appear to be involved.
> 
> We can continue to experiment if you have any additional tests you'd
> like us to run. We may have to keep the experiments to running over
> the weekend, but they're definitely do-able.

I am working on patches to get these bugs fixed.  Would you be up for
running a patched binary and see if the errors go away?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-25 Thread J Smith
On Mon, Nov 25, 2013 at 6:47 AM, Andres Freund  wrote:
> Hi,
>
> On 2013-11-24 16:56:26 -0500, J Smith wrote:
>
>> Nov 23 14:38:32 dev postgres[23810]: [4-1] user=dev,db=dev ERROR:  could not 
>> access status of transaction 13514992
>> Nov 23 14:38:32 dev postgres[23810]: [4-2] user=dev,db=dev DETAIL:  Could 
>> not open file "pg_subtrans/00CE": Success.
>> Nov 23 14:38:32 dev postgres[23810]: [4-3] user=dev,db=dev CONTEXT:  SQL 
>> statement "SELECT 1 FROM ONLY "dev"."collection_batches" x WHERE "id" 
>> OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
>
>
> Ok, this is helpful. Do you rather longrunning transactions? The
> transaction that does foreign key checks has an xid of 10260613, while
> the row that's getting checked has 13514992.

We did have some long-running transactions, yes. We refactored a bit
and removed them and the problem ceased on our end. We ended up
reverting our changes for the sake of running this experiment over the
weekend and the errors returned. We've since restored our fix and
haven't had any problems since, so yeah, long-running transactions
appear to be involved.

We can continue to experiment if you have any additional tests you'd
like us to run. We may have to keep the experiments to running over
the weekend, but they're definitely do-able.

Cheers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-25 Thread Andres Freund
Hi,

On 2013-11-24 16:56:26 -0500, J Smith wrote:
> coredumper worked like a charm. Useful tool, that is... although as a
> bit of advice, I'd try not to run it on Postgres if your various
> memory settings are tweaked towards production use -- the core dump
> that was captured on my server weighed in at 16 GB.

> Nov 23 14:38:32 dev postgres[23810]: [4-1] user=dev,db=dev ERROR:  could not 
> access status of transaction 13514992
> Nov 23 14:38:32 dev postgres[23810]: [4-2] user=dev,db=dev DETAIL:  Could not 
> open file "pg_subtrans/00CE": Success.
> Nov 23 14:38:32 dev postgres[23810]: [4-3] user=dev,db=dev CONTEXT:  SQL 
> statement "SELECT 1 FROM ONLY "dev"."collection_batches" x WHERE "id" 
> OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"


Ok, this is helpful. Do you rather longrunning transactions? The
transaction that does foreign key checks has an xid of 10260613, while
the row that's getting checked has 13514992.

#4  0x00635dc7 in XactLockTableWait (xid=13514992) at lmgr.c:501
tag = {locktag_field1 = 13514992, locktag_field2 = 0, locktag_field3 = 
0, locktag_field4 = 0, locktag_type = 4 '\004', locktag_lockmethodid = 1 '\001'}
#5  0x00482223 in heap_lock_updated_tuple_rec (rel=0x2b20f050a8d0, 
tuple=, ctid=, xid=10260613, 
mode=LockTupleKeyShare) at heapam.c:4847

I am not sure whether that's the origin of the problem but at the very
least it seems to me that heap_lock_updated_tuple_rec() is missing
several important pieces:
a) do the priorXmax==xmin dance to check we're still following the same
   ctid chain. Currently we could easily stumble across completely
   unrelated tuples if a chain element aborted and got vacuumed.
b) Check whether a chain element actually aborted - currently we're
   only doing that in the HEAP_KEYS_UPDATED updated case, but that seems
   wrong (we can't check for committed tho!).
c) (reported separately as well) cope with failure of heap_fetch() to
   return anything.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-24 Thread J Smith
coredumper worked like a charm. Useful tool, that is... although as a
bit of advice, I'd try not to run it on Postgres if your various
memory settings are tweaked towards production use -- the core dump
that was captured on my server weighed in at 16 GB.

Anyways, I've attached both the log entries that caused the error as
well as a backtrace using `bt full`.

Cheers, hope it helps.
Logfile:

Nov 23 14:38:32 dev postgres[23810]: [4-1] user=dev,db=dev ERROR:  could not 
access status of transaction 13514992
Nov 23 14:38:32 dev postgres[23810]: [4-2] user=dev,db=dev DETAIL:  Could not 
open file "pg_subtrans/00CE": Success.
Nov 23 14:38:32 dev postgres[23810]: [4-3] user=dev,db=dev CONTEXT:  SQL 
statement "SELECT 1 FROM ONLY "dev"."collection_batches" x WHERE "id" 
OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 23 14:38:32 dev postgres[23810]: [4-4] user=dev,db=dev STATEMENT:  update 
provider_listings set deactivated_at=$1, effective_provider_id=$2, 
first_collection_date=$3, last_modified=$4, latest_collection_batch_id=$5, 
latest_collection_date=$6, latest_notfound_collection_batch_id=$7, 
latest_notfound_date=$8, latest_raw_listing_id=$9, latest_raw_listing_hash=$10, 
latitude=$11, lock_version=$12, longitude=$13, process_state=$14, 
provider_id=$15, source_listing_key=$16, source_provider_key=$17, status=$18 
where id=$19 and lock_version=$20


Backtrace:

[jay@dev local]% gdb /usr/pgsql-9.3/bin/postgres 
core.23810.2013-11-23.14:35:28.EST
GNU gdb (GDB) CentOS (7.0.1-45.el5.centos)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /usr/pgsql-9.3/bin/postgres...Reading symbols from 
/usr/lib/debug/usr/pgsql-9.3/bin/postgres.debug...done.
done.
[New Thread 23810]

warning: .dynamic section for "/usr/lib64/libxslt.so.1" is not at the expected 
address

warning: difference appears to be caused by prelink, adjusting expectations

warning: .dynamic section for "/usr/lib64/libxml2.so.2" is not at the expected 
address

warning: difference appears to be caused by prelink, adjusting expectations

warning: .dynamic section for "/lib64/libpam.so.0" is not at the expected 
address

warning: difference appears to be caused by prelink, adjusting expectations

warning: .dynamic section for "/lib64/libssl.so.6" is not at the expected 
address

warning: difference appears to be caused by prelink, adjusting expectations

warning: .dynamic section for "/lib64/libcrypto.so.6" is not at the expected 
address

warning: difference appears to be caused by prelink, adjusting expectations

warning: .dynamic section for "/usr/lib64/libgssapi_krb5.so.2" is not at the 
expected address

warning: difference appears to be caused by prelink, adjusting expectations

warning: .dynamic section for "/lib64/libcrypt.so.1" is not at the expected 
address

warning: difference appears to be caused by prelink, adjusting expectations

warning: .dynamic section for "/lib64/libdl.so.2" is not at the expected address

warning: difference appears to be caused by prelink, adjusting expectations

warning: .dynamic section for "/lib64/libm.so.6" is not at the expected address

warning: difference appears to be caused by prelink, adjusting expectations

warning: .dynamic section for "/usr/lib64/libldap-2.3.so.0" is not at the 
expected address

warning: difference appears to be caused by prelink, adjusting expectations

warning: .dynamic section for "/lib64/libc.so.6" is not at the expected address

warning: difference appears to be caused by prelink, adjusting expectations

warning: .dynamic section for "/usr/lib64/libkrb5.so.3" is not at the expected 
address

warning: difference appears to be caused by prelink, adjusting expectations

warning: .dynamic section for "/lib64/libcom_err.so.2" is not at the expected 
address

warning: difference appears to be caused by prelink, adjusting expectations

warning: .dynamic section for "/lib64/libz.so.1" is not at the expected address

warning: difference appears to be caused by prelink, adjusting expectations

warning: .dynamic section for "/lib64/libaudit.so.0" is not at the expected 
address

warning: difference appears to be caused by prelink, adjusting expectations

warning: .dynamic section for "/usr/lib64/libk5crypto.so.3" is not at the 
expected address

warning: difference appears to be caused by prelink, adjusting expectations

warning: .dynamic section for "/usr/lib64/libkrb5support.so.0" is not at the 
expected address

warning: difference appears to be caused by prelink, adjusting expectations

warning: .dynamic section for "/lib64/libkeyutils.so.1" is not at the expected 
address

war

Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-23 Thread J Smith
On Tue, Nov 19, 2013 at 10:16 AM, J Smith  wrote:
> Alright, we'll look into doing that heading into the weekend.
> Interestingly, we haven't experienced the issue since our main Java
> developer made some modifications to our backend system. I'm not
> entirely sure what the changes entail except that it's a one-liner
> that involves re-SELECTing a table during a transaction. We'll
> rollback this change and re-compile Postgres with google-coredumper
> and let it run over the weekend and see where we stand.
>

Okay, I've patched our Postgres install and added in a call for
coredumper to output a core dump. We'll let our processes run for the
next few days and hopefully we'll get something we can look at. I've
attached the patch just in case anyone would like to make sure I've
done it in a sane way and that I've inserted the call in the correct
location.

Cheers


postgresql-coredumper.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-19 Thread J Smith
Alright, we'll look into doing that heading into the weekend.
Interestingly, we haven't experienced the issue since our main Java
developer made some modifications to our backend system. I'm not
entirely sure what the changes entail except that it's a one-liner
that involves re-SELECTing a table during a transaction. We'll
rollback this change and re-compile Postgres with google-coredumper
and let it run over the weekend and see where we stand.

Cheers

On Tue, Nov 19, 2013 at 9:14 AM, Robert Haas  wrote:
> On Fri, Nov 15, 2013 at 4:01 PM, J Smith  wrote:
>> On Fri, Nov 15, 2013 at 3:21 PM, Robert Haas  wrote:
>>> I think what would help the most is if you could arrange to obtain a
>>> stack backtrace at the point when the error is thrown.  Maybe put a
>>> long sleep call in just before the error happens, and when it gets
>>> stuck there, attach gdb and run bt full.
>>>
>>
>> That could potentially be doable. Perhaps I could use something like
>> google-coredumper or something similar to have a core dump generated
>> if the error comes up? Part of the problem is that the error is so
>> sporadic that it's going to be tough to say when the next one will
>> occur. For instance, we haven't changed our load on the server, yet
>> the error hasn't occurred since Nov 13, 15:01. I'd also like to avoid
>> blocking on the server with sleep or anything like that unless
>> absolutely necessary, as there are other services we have in
>> development that are using other databases on this cluster. (I can as
>> a matter of last resort, of course, but if google-coredumper can do
>> the job I'd like to give that a shot first.)
>>
>> Any hints on where I could insert something like this? Should I try
>> putting it into the section of elog.c dealing with ENOENT errors, or
>> try to find a spot closer to where the file itself is being opened? I
>> haven't looked at Postgres internals for a while now so I'm not quite
>> sure of the best location for this sort of thing.
>
> I'd look for the specific ereport() call that's firing, and put it
> just before that.
>
> (note that setting the error verbosity to 'verbose' will give you the
> file and line number where the error is happening, which is useful if
> the message can be generated from more than one place)
>
> I'm not familiar with google-coredumper but it sounds like a promising
> technique.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-19 Thread Robert Haas
On Fri, Nov 15, 2013 at 4:01 PM, J Smith  wrote:
> On Fri, Nov 15, 2013 at 3:21 PM, Robert Haas  wrote:
>> I think what would help the most is if you could arrange to obtain a
>> stack backtrace at the point when the error is thrown.  Maybe put a
>> long sleep call in just before the error happens, and when it gets
>> stuck there, attach gdb and run bt full.
>>
>
> That could potentially be doable. Perhaps I could use something like
> google-coredumper or something similar to have a core dump generated
> if the error comes up? Part of the problem is that the error is so
> sporadic that it's going to be tough to say when the next one will
> occur. For instance, we haven't changed our load on the server, yet
> the error hasn't occurred since Nov 13, 15:01. I'd also like to avoid
> blocking on the server with sleep or anything like that unless
> absolutely necessary, as there are other services we have in
> development that are using other databases on this cluster. (I can as
> a matter of last resort, of course, but if google-coredumper can do
> the job I'd like to give that a shot first.)
>
> Any hints on where I could insert something like this? Should I try
> putting it into the section of elog.c dealing with ENOENT errors, or
> try to find a spot closer to where the file itself is being opened? I
> haven't looked at Postgres internals for a while now so I'm not quite
> sure of the best location for this sort of thing.

I'd look for the specific ereport() call that's firing, and put it
just before that.

(note that setting the error verbosity to 'verbose' will give you the
file and line number where the error is happening, which is useful if
the message can be generated from more than one place)

I'm not familiar with google-coredumper but it sounds like a promising
technique.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-15 Thread J Smith
On Fri, Nov 15, 2013 at 3:21 PM, Robert Haas  wrote:
>
> I think what would help the most is if you could arrange to obtain a
> stack backtrace at the point when the error is thrown.  Maybe put a
> long sleep call in just before the error happens, and when it gets
> stuck there, attach gdb and run bt full.
>

That could potentially be doable. Perhaps I could use something like
google-coredumper or something similar to have a core dump generated
if the error comes up? Part of the problem is that the error is so
sporadic that it's going to be tough to say when the next one will
occur. For instance, we haven't changed our load on the server, yet
the error hasn't occurred since Nov 13, 15:01. I'd also like to avoid
blocking on the server with sleep or anything like that unless
absolutely necessary, as there are other services we have in
development that are using other databases on this cluster. (I can as
a matter of last resort, of course, but if google-coredumper can do
the job I'd like to give that a shot first.)

Any hints on where I could insert something like this? Should I try
putting it into the section of elog.c dealing with ENOENT errors, or
try to find a spot closer to where the file itself is being opened? I
haven't looked at Postgres internals for a while now so I'm not quite
sure of the best location for this sort of thing.

Cheers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-15 Thread Robert Haas
On Wed, Nov 13, 2013 at 12:29 PM, J Smith  wrote:
> Looks like we got another set of errors overnight. Here's the log file
> from the errors. (Log file scrubbed slightly to remove private data,
> but still representative of the problem I believe.)
>
> Nov 13 05:34:34 dev postgres[6084]: [4-1] user=dev,db=dev ERROR:
> could not access status of transaction 6337381
> Nov 13 05:34:34 dev postgres[6084]: [4-2] user=dev,db=dev DETAIL:
> Could not open file "pg_subtrans/0060": No such file or directory.
> Nov 13 05:34:34 dev postgres[6084]: [4-3] user=dev,db=dev CONTEXT:
> SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
> WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
> Nov 13 05:34:34 dev postgres[6084]: [4-4] user=dev,db=dev STATEMENT:
> update listings set deactivated_at=$1 where id=$2 and lock_version=$3
> Nov 13 05:34:34 dev postgres[6076]: [4-1] user=dev,db=dev ERROR:
> could not access status of transaction 6337381
> Nov 13 05:34:34 dev postgres[6076]: [4-2] user=dev,db=dev DETAIL:
> Could not open file "pg_subtrans/0060": No such file or directory.
> Nov 13 05:34:34 dev postgres[6076]: [4-3] user=dev,db=dev CONTEXT:
> SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
> WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
> Nov 13 05:34:34 dev postgres[6076]: [4-4] user=dev,db=dev STATEMENT:
> update listings set deactivated_at=$1 where id=$2 and lock_version=$3
> Nov 13 05:34:34 dev postgres[6087]: [4-1] user=dev,db=dev ERROR:
> could not access status of transaction 6337381
> Nov 13 05:34:34 dev postgres[6087]: [4-2] user=dev,db=dev DETAIL:
> Could not open file "pg_subtrans/0060": No such file or directory.
> Nov 13 05:34:34 dev postgres[6087]: [4-3] user=dev,db=dev CONTEXT:
> SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
> WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
> Nov 13 05:34:34 dev postgres[6087]: [4-4] user=dev,db=dev STATEMENT:
> update listings set deactivated_at=$1 where id=$2 and lock_version=$3
> Nov 13 05:34:34 dev postgres[6086]: [4-1] user=dev,db=dev ERROR:
> could not access status of transaction 6337381
> Nov 13 05:34:34 dev postgres[6086]: [4-2] user=dev,db=dev DETAIL:
> Could not open file "pg_subtrans/0060": No such file or directory.
> Nov 13 05:34:34 dev postgres[6086]: [4-3] user=dev,db=dev CONTEXT:
> SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
> WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
> Nov 13 05:34:34 dev postgres[6086]: [4-4] user=dev,db=dev STATEMENT:
> update listings set deactivated_at=$1 where id=$2 and lock_version=$3
> Nov 13 05:34:34 dev postgres[6088]: [4-1] user=dev,db=dev ERROR:
> could not access status of transaction 6337381
> Nov 13 05:34:34 dev postgres[6088]: [4-2] user=dev,db=dev DETAIL:
> Could not open file "pg_subtrans/0060": No such file or directory.
> Nov 13 05:34:34 dev postgres[6088]: [4-3] user=dev,db=dev CONTEXT:
> SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
> WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
> Nov 13 05:34:34 dev postgres[6088]: [4-4] user=dev,db=dev STATEMENT:
> update listings set deactivated_at=$1 where id=$2 and lock_version=$3
> Nov 13 05:34:34 dev postgres[6085]: [4-1] user=dev,db=dev ERROR:
> could not access status of transaction 6337381
> Nov 13 05:34:34 dev postgres[6085]: [4-2] user=dev,db=dev DETAIL:
> Could not open file "pg_subtrans/0060": No such file or directory.
> Nov 13 05:34:34 dev postgres[6085]: [4-3] user=dev,db=dev CONTEXT:
> SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
> WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
> Nov 13 05:34:34 dev postgres[6085]: [4-4] user=dev,db=dev STATEMENT:
> update listings set deactivated_at=$1 where id=$2 and lock_version=$3
>
> Several processes all seemed to hit the problem at the same moment,
> and all of them refer to the same transaction ID. Again, the file
> pg_subtrans/0060 doesn't exist, and the only file that does exist is
> pg_subtrans/005A which appears to be a zeroed-out file 245760 bytes in
> length.
>
> Still don't have a clue as to how I can reproduce the problem. It
> seems that in all cases the error occurred during either an UPDATE to
> a table_X or an INSERT to table_Y. In all cases, the error occurred in
> a manner identical to those shown in the log above, the only
> difference being either an UPDATE on table_X or an INSERT on table_Y.
>
> Not sure what direction I should head to now. Perhaps some aggressive
> logging would help, so we can see the queries surrounding the
> problems? I could reconfigure things to capture all statements and set
> up monit or something to send an alert when the problem resurfaces,
> for instance.
>
> Cheers all.

I think what would help the most is if you could arrange to obtain a
stack backtrace at the point when the error is thrown.  Maybe put a
long sleep call in just before the error happens, and when it gets
stuck there, attach gdb and run bt 

Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-13 Thread J Smith
Looks like we got another set of errors overnight. Here's the log file
from the errors. (Log file scrubbed slightly to remove private data,
but still representative of the problem I believe.)

Nov 13 05:34:34 dev postgres[6084]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6084]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6084]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6084]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6076]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6076]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6076]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6076]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6087]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6087]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6087]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6087]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6086]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6086]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6086]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6086]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6088]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6088]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6088]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6088]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6085]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6085]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6085]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6085]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3

Several processes all seemed to hit the problem at the same moment,
and all of them refer to the same transaction ID. Again, the file
pg_subtrans/0060 doesn't exist, and the only file that does exist is
pg_subtrans/005A which appears to be a zeroed-out file 245760 bytes in
length.

Still don't have a clue as to how I can reproduce the problem. It
seems that in all cases the error occurred during either an UPDATE to
a table_X or an INSERT to table_Y. In all cases, the error occurred in
a manner identical to those shown in the log above, the only
difference being either an UPDATE on table_X or an INSERT on table_Y.

Not sure what direction I should head to now. Perhaps some aggressive
logging would help, so we can see the queries surrounding the
problems? I could reconfigure things to capture all statements and set
up monit or something to send an alert when the problem resurfaces,
for instance.

Cheers all.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-12 Thread J Smith
On Tue, Nov 12, 2013 at 11:55 AM, Andres Freund  wrote:
> Hi,
>
> On 2013-11-12 11:46:19 -0500, J Smith wrote:
>> > * Does SELECT count(*) FROM pg_prepared_xacts; return 0?
>>
>> Yes it does.
>
> Could you show the output? Do you actually use prepared xacts actively?

jay:dev@jagger=# select * from pg_prepared_xacts;
 transaction | gid | prepared | owner | database
-+-+--+---+--
(0 rows)

We don't actively use prepared transactions, no. We use a lot of
prepared statements, but no two-phase committing.

> Do you actively use row level locking?

Not to my knowledge, no. I've enabled statement logging for a little
while so we can get a snapshot of what our average queries look like,
but I'm not seeing any PREPARE TRANSACTION or explicit LOCK
statements.

> Is there high concurrency in that environment? In short, could you describe 
> your database usage a bit?

This particular database is used for an aggregation service. There are
a handful of Hibernate-based Java processes that manipulate and
massage the data from a number of data feeds we receive 24-7  From a
quick grepping of the statement log file we captured the following
over the course of approximately 13 minutes:

167452 SELECTs
72388 UPDATEs
6782 INSERTs
2 DELETEs

> Did you restart the database since getting the errors?

Yes, we've restarted several times.

>> Latest checkpoint's NextXID:  0/5138174
>> Latest checkpoint's NextOID:  3574036
>> Latest checkpoint's NextMultiXactId:  5762623
>> Latest checkpoint's NextMultiOffset:  11956824
>> Latest checkpoint's oldestXID:1673
>> Latest checkpoint's oldestXID's DB:   1
>> Latest checkpoint's oldestActiveXID:  5138174
>> Latest checkpoint's oldestMultiXid:   1
>
> Interesting. Pretty high multixact usage.
>

Part of our process involves a collecting several records and
aggregating them within a single transaction. There will usually be a
handful of SELECTs to fetch the data, a bit of processing in the Java
backend, then a set of either UPDATEs or INSERTs into two tables to
store the results of the processing or in rarer cases a couple of
DELETEs.

Our Java guy would have more details, as I just have the output of the
log files and a high-level view of how our backend processes work, but
this is the general gist of things as I understand it.

Cheers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-12 Thread J Smith
On Tue, Nov 12, 2013 at 11:54 AM, Stephen Frost  wrote:
>
> Did you also upgrade to PostGIS 2.x as part of this..?  Seems like it'd
> be unrelated, but one never knows.  Any chance you could distill this
> down into a small test case which exhibits the problem?  I'm guessing
> 'no', but figured I'd ask anyway..
>

Yeah, we upgraded to PostGIS 2.1 as part of the upgrade. In the case
of the specific database that is exhibiting the problem, we only
really use PostGIS for storage and don't do a lot of spatial querying,
as this database is basically for an aggregation service we run. In
the case of the transactions that appear to be causing the problems,
we don't even use geometry fields at all.

Cheers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-12 Thread Andres Freund
Hi,

On 2013-11-12 11:46:19 -0500, J Smith wrote:
> > * Does SELECT count(*) FROM pg_prepared_xacts; return 0?
> 
> Yes it does.

Could you show the output? Do you actually use prepared xacts actively?

Do you actively use row level locking? Is there high concurrency in that
environment? In short, could you describe your database usage a bit?

Did you restart the database since getting the errors?

> Latest checkpoint's NextXID:  0/5138174
> Latest checkpoint's NextOID:  3574036
> Latest checkpoint's NextMultiXactId:  5762623
> Latest checkpoint's NextMultiOffset:  11956824
> Latest checkpoint's oldestXID:1673
> Latest checkpoint's oldestXID's DB:   1
> Latest checkpoint's oldestActiveXID:  5138174
> Latest checkpoint's oldestMultiXid:   1

Interesting. Pretty high multixact usage.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-12 Thread Stephen Frost
* J Smith (dark.panda+li...@gmail.com) wrote:
> We haven't been able to use pg_upgrade as we rely heavily on PostGIS
> and do hard upgrades via pg_dump and postgis_restore.pl when we
> upgrade.

Did you also upgrade to PostGIS 2.x as part of this..?  Seems like it'd
be unrelated, but one never knows.  Any chance you could distill this
down into a small test case which exhibits the problem?  I'm guessing
'no', but figured I'd ask anyway..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-12 Thread Stephen Frost
* Andres Freund (and...@2ndquadrant.com) wrote:
> He referred to using pg_dumpall/pg_dump. But that bug was erroring out
> on pg_clog, not pg_subtrans, right?

Yeah, that was pg_clog.  Obviously responded before really looking at
it. :)

> My gut feeling is thats it's related to foreign key locks doing
> XactLockTableWait()s on too old transactions or something like that.

Ugh, that wouldn't be good..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-12 Thread J Smith
G'day Andres.

On Tue, Nov 12, 2013 at 11:13 AM, Andres Freund  wrote:
> Hi,
>
> On 2013-11-12 10:56:55 -0500, J Smith wrote:
>> G'day list. Didn't get any interest in pgsql-general, thought I'd try
>> my luck here, which perhaps would be more fitting in case I've
>> stumbled upon an edge case issue or something...
>
> Normally the bug report for/the -bugs mailing list is the right one for
> things like this.

Ah, okay. I can redirect there if that's the better location.
(Assuming it's a bug and not a misconfiguration on my part, although
the configuration on these servers has been consistent since 9.0, and
we've upgraded through each of the 9.x series while merging together
new settings accordingly.)

>> ERROR could not access status of transaction 4179979
>> Could not open file "pg_subtrans/003F": No such file or directory.
>
> A couple of question:
> * Did you have any OS, hardware crashes?

Haven't had any issues on this machine.

> * Do you run with fsync=off or full_page_writes=off?

Both fsync and full_page_writes are in their default "on"
configurations. I'll post our full postgresql.conf below, as it is
quite short.

> * Could you post the pg_controldata output?

Sure, posted below.

> * Is it always the same xid?

Seems to be different every time.

> * Does SELECT count(*) FROM pg_prepared_xacts; return 0?

Yes it does.

Here's the postgresql.conf and pg_controldata dump.

# postgresql.conf

listen_addresses = '*'
wal_level = hot_standby
checkpoint_completion_target = 0.7
max_wal_senders = 3
wal_keep_segments = 64
logging_collector = off
datestyle = 'iso, mdy'
timezone = 'Canada/Eastern'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'
default_statistics_target = 100
maintenance_work_mem = 1GB
effective_cache_size = 44GB
work_mem = 320MB
wal_buffers = 4MB
checkpoint_segments = 64
shared_buffers = 15GB
max_connections = 200

# pg_controldata

pg_control version number:937
Catalog version number:   201306121
Database system identifier:   5942827484423487452
Database cluster state:   in production
pg_control last modified: Tue 12 Nov 2013 11:39:07 AM EST
Latest checkpoint location:   30/50F7A180
Prior checkpoint location:30/4914C428
Latest checkpoint's REDO location:30/4CB2C450
Latest checkpoint's REDO WAL file:00010030004C
Latest checkpoint's TimeLineID:   1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:  0/5138174
Latest checkpoint's NextOID:  3574036
Latest checkpoint's NextMultiXactId:  5762623
Latest checkpoint's NextMultiOffset:  11956824
Latest checkpoint's oldestXID:1673
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  5138174
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Time of latest checkpoint:Tue 12 Nov 2013 11:35:37 AM EST
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline:   0
Backup start location:0/0
Backup end location:  0/0
End-of-backup record required:no
Current wal_level setting:hot_standby
Current max_connections setting:  200
Current max_prepared_xacts setting:   0
Current max_locks_per_xact setting:   64
Maximum data alignment:   8
Database block size:  8192
Blocks per segment of large relation: 131072
WAL block size:   8192
Bytes per WAL segment:16777216
Maximum length of identifiers:64
Maximum columns in an index:  32
Maximum size of a TOAST chunk:1996
Date/time type storage:   64-bit integers
Float4 argument passing:  by value
Float8 argument passing:  by value
Data page checksum version:   0


Cheers and thanks for the help all.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-12 Thread J Smith
On Tue, Nov 12, 2013 at 11:25 AM, Stephen Frost  wrote:
>
> How was this upgrade done?  If you used pg_upgrade, what version of the
> pg_upgrade code did you use?  As I recall, there was a bug in older
> versions which could exhibit in this way..
>
> http://wiki.postgresql.org/wiki/20110408pg_upgrade_fix
>

As mentioned in the original post...

"the new database cluster was created by dumping global settings
using pg_dumpall and then dumping each database individually."

We haven't been able to use pg_upgrade as we rely heavily on PostGIS
and do hard upgrades via pg_dump and postgis_restore.pl when we
upgrade.

Cheers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-12 Thread Andres Freund
On 2013-11-12 11:25:03 -0500, Stephen Frost wrote:
> * J Smith (dark.panda+li...@gmail.com) wrote:
> > I've recently upgraded a number of servers from PostgreSQL 9.2.5 to
> > 9.3.1 and have started getting the following errors every couple of
> > hours along with some failed transactions.
> 
> How was this upgrade done?  If you used pg_upgrade, what version of the
> pg_upgrade code did you use?  As I recall, there was a bug in older
> versions which could exhibit in this way..

He referred to using pg_dumpall/pg_dump. But that bug was erroring out
on pg_clog, not pg_subtrans, right?

Getting such errors on pg_subtrans indicates corruption or a bug
somewhere, we don't do them if the to-be-looked up xid is older than
TransactionXmin and we zero pg_subtrans at restarts.

My gut feeling is thats it's related to foreign key locks doing
XactLockTableWait()s on too old transactions or something like that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-12 Thread Stephen Frost
* J Smith (dark.panda+li...@gmail.com) wrote:
> I've recently upgraded a number of servers from PostgreSQL 9.2.5 to
> 9.3.1 and have started getting the following errors every couple of
> hours along with some failed transactions.

How was this upgrade done?  If you used pg_upgrade, what version of the
pg_upgrade code did you use?  As I recall, there was a bug in older
versions which could exhibit in this way..

http://wiki.postgresql.org/wiki/20110408pg_upgrade_fix

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-11-12 Thread Andres Freund
Hi,

On 2013-11-12 10:56:55 -0500, J Smith wrote:
> G'day list. Didn't get any interest in pgsql-general, thought I'd try
> my luck here, which perhaps would be more fitting in case I've
> stumbled upon an edge case issue or something...

Normally the bug report for/the -bugs mailing list is the right one for
things like this.

> ERROR could not access status of transaction 4179979
> Could not open file "pg_subtrans/003F": No such file or directory.

A couple of question:
* Did you have any OS, hardware crashes?
* Do you run with fsync=off or full_page_writes=off?
* Could you post the pg_controldata output?
* Is it always the same xid?
* Does SELECT count(*) FROM pg_prepared_xacts; return 0?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers