Re: [HACKERS] logical decoding - reading a user catalog table

2014-11-17 Thread Steve Singer

On 11/16/2014 04:49 PM, Steve Singer wrote:



I installed things following the above steps on a different system 
than my usual development laptop and I have been unable to reproduce 
the error so for (on that system).  But I am still able to reproduce 
it on occasion on my normal development laptop.






After continuously running the test I was eventually able to reproduce 
the error on the other system as well.




Greetings,

Andres Freund









--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding - reading a user catalog table

2014-11-17 Thread Andres Freund
On 2014-11-17 10:33:52 -0500, Steve Singer wrote:
 On 11/16/2014 04:49 PM, Steve Singer wrote:
 
 
 I installed things following the above steps on a different system than my
 usual development laptop and I have been unable to reproduce the error so
 for (on that system).  But I am still able to reproduce it on occasion on
 my normal development laptop.

Thanks for the testcase! I'll try to reproduce/fix.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding - reading a user catalog table

2014-11-17 Thread Andres Freund
On 2014-11-13 22:23:02 -0500, Steve Singer wrote:
 On 11/13/2014 02:44 PM, Andres Freund wrote:
 H
 
 I've pushed a fix for a bug that could possibly also cause
 this. Although it'd be odd that it always hits the user catalog
 table. Except if your tests mostly modify the slony tables, but do not
 do much DDL otherwise?
 
 The test I was running doesn't do DDL, so only the user catalog tables would
 have changes being tracked.
 
 I still sometimes get the error. When I get sometime on the weekend I'll
 work on some detailed  instructions on how to grab and setup the various
 components to replicate this test.
 
 Also since updating (to 2c267e47afa4f9a7c) I've seen a assertion failure in
 a normal client connection, not the walsender
 
 #3  0x006b4978 in GetSerializableTransactionSnapshotInt (
 snapshot=snapshot@entry=0xbfa8a0 CurrentSnapshotData,
 sourcexid=sourcexid@entry=0) at predicate.c:1738
 #4  0x006b66c3 in GetSafeSnapshot (origSnapshot=optimized out)
 at predicate.c:1517
 #5  GetSerializableTransactionSnapshot (
 snapshot=0xbfa8a0 CurrentSnapshotData) at predicate.c:1598
 #6  0x007d16dd in GetTransactionSnapshot () at snapmgr.c:200
 #7  0x006c0e35 in exec_simple_query (
 query_string=0x1fd01b8 select ev_origin, ev_seqno, ev_timestamp,
 ev_snapshot, \pg_catalog\.txid_snapshot_xmin(ev_snapshot),
 \pg_catalog\.txid_snapshot_xmax(ev_snapshot),
 coalesce(ev_provider_xid,\...)
 at postgres.c:959
 #8  PostgresMain (argc=optimized out, argv=argv@entry=0x1f5ab50,
 
 
 I have no idea if this has anything to do with your recent changes or some
 other change. I haven't so far been able to replicate that since the first
 time I saw it.

That crash is decidedly odd. Any chance you still have the full
backtrace around?

This is in the SSI code... I'm not immediately seeing how this could be
related to logical decoding. It can't even be a imported snapshot,
because the exported snapshot is marked repeatable read.

Are you actually using serializable transactions? If so, how and why?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding - reading a user catalog table

2014-11-17 Thread Steve Singer

On 11/17/2014 10:37 AM, Andres Freund wrote:

On 2014-11-13 22:23:02 -0500, Steve Singer wrote:


Also since updating (to 2c267e47afa4f9a7c) I've seen a assertion failure in
a normal client connection, not the walsender

#3  0x006b4978 in GetSerializableTransactionSnapshotInt (
 snapshot=snapshot@entry=0xbfa8a0 CurrentSnapshotData,
 sourcexid=sourcexid@entry=0) at predicate.c:1738
#4  0x006b66c3 in GetSafeSnapshot (origSnapshot=optimized out)
 at predicate.c:1517
#5  GetSerializableTransactionSnapshot (
 snapshot=0xbfa8a0 CurrentSnapshotData) at predicate.c:1598
#6  0x007d16dd in GetTransactionSnapshot () at snapmgr.c:200
#7  0x006c0e35 in exec_simple_query (
 query_string=0x1fd01b8 select ev_origin, ev_seqno, ev_timestamp,
ev_snapshot, \pg_catalog\.txid_snapshot_xmin(ev_snapshot),
\pg_catalog\.txid_snapshot_xmax(ev_snapshot),
coalesce(ev_provider_xid,\...)
 at postgres.c:959
#8  PostgresMain (argc=optimized out, argv=argv@entry=0x1f5ab50,


I have no idea if this has anything to do with your recent changes or some
other change. I haven't so far been able to replicate that since the first
time I saw it.
That crash is decidedly odd. Any chance you still have the full
backtrace around?


Yes I still have the core file



This is in the SSI code... I'm not immediately seeing how this could be
related to logical decoding. It can't even be a imported snapshot,
because the exported snapshot is marked repeatable read.

Are you actually using serializable transactions? If so, how and why?


Yes, the test client that performs the 'simulated workload' does some 
serializable transactions.  It connects as a normal client via JDBC and 
does a prepareStatement(SET TRANSACTION ISOLATION LEVEL SERIALIZABLE) 
and then does some DML. Why? because it seemed like a good thing to 
include in the test suite.


Your right this might have nothing to do with logical decoding.   I 
haven't been able to reproduce again either, I can't even say
 if this problem was introduced to 9.4 in the past month or if it has 
been around much longer and I just haven't hit it before.







Greetings,

Andres Freund





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding - reading a user catalog table

2014-11-17 Thread Andres Freund
Hi,

Kevin: CCed you, because it doesn't really look like a logical decoding
related issue.

On 2014-11-17 11:25:40 -0500, Steve Singer wrote:
 On 11/17/2014 10:37 AM, Andres Freund wrote:
 On 2014-11-13 22:23:02 -0500, Steve Singer wrote:
 
 
 Also since updating (to 2c267e47afa4f9a7c) I've seen a assertion failure in
 a normal client connection, not the walsender
 
 #3  0x006b4978 in GetSerializableTransactionSnapshotInt (
  snapshot=snapshot@entry=0xbfa8a0 CurrentSnapshotData,
  sourcexid=sourcexid@entry=0) at predicate.c:1738
 #4  0x006b66c3 in GetSafeSnapshot (origSnapshot=optimized out)
  at predicate.c:1517
 #5  GetSerializableTransactionSnapshot (
  snapshot=0xbfa8a0 CurrentSnapshotData) at predicate.c:1598
 #6  0x007d16dd in GetTransactionSnapshot () at snapmgr.c:200
 #7  0x006c0e35 in exec_simple_query (
  query_string=0x1fd01b8 select ev_origin, ev_seqno, ev_timestamp,
 ev_snapshot, \pg_catalog\.txid_snapshot_xmin(ev_snapshot),
 \pg_catalog\.txid_snapshot_xmax(ev_snapshot),
 coalesce(ev_provider_xid,\...)
  at postgres.c:959
 #8  PostgresMain (argc=optimized out, argv=argv@entry=0x1f5ab50,
 
 
 I have no idea if this has anything to do with your recent changes or some
 other change. I haven't so far been able to replicate that since the first
 time I saw it.
 That crash is decidedly odd. Any chance you still have the full
 backtrace around?
 
 Yes I still have the core file

Cool, could you show the full thing? Or did you just snip it because
it's just the Assert/ExceptionalCondition thing?

Could you print *snapshot in frame #3?

 
 This is in the SSI code... I'm not immediately seeing how this could be
 related to logical decoding. It can't even be a imported snapshot,
 because the exported snapshot is marked repeatable read.
 
 Are you actually using serializable transactions? If so, how and why?
 
 Yes, the test client that performs the 'simulated workload' does some
 serializable transactions.  It connects as a normal client via JDBC and does
 a prepareStatement(SET TRANSACTION ISOLATION LEVEL SERIALIZABLE) and then
 does some DML. Why? because it seemed like a good thing to include in the
 test suite.

Yes, it's a good reason as the above backtrace proves ;). I'm just
trying to understand uner which circumstances it happens.

The above backtrace looks like it can only be triggered if you do a
BEGIN TRANSACTION SERIALIZABLE DEFERRABLE READ ONLY; Is that something
you do?

 Your right this might have nothing to do with logical decoding.   I haven't
 been able to reproduce again either, I can't even say
  if this problem was introduced to 9.4 in the past month or if it has been
 around much longer and I just haven't hit it before.

It's not hard to imagine that the safe/deferred snapshot code isn't all
that well exercised. I don't think many people use it yet.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding - reading a user catalog table

2014-11-17 Thread Steve Singer

On 11/17/2014 11:34 AM, Andres Freund wrote:

Hi,

Kevin: CCed you, because it doesn't really look like a logical decoding
related issue.

On 2014-11-17 11:25:40 -0500, Steve Singer wrote:

On 11/17/2014 10:37 AM, Andres Freund wrote:

On 2014-11-13 22:23:02 -0500, Steve Singer wrote:


Also since updating (to 2c267e47afa4f9a7c) I've seen a assertion failure in
a normal client connection, not the walsender

#3  0x006b4978 in GetSerializableTransactionSnapshotInt (
 snapshot=snapshot@entry=0xbfa8a0 CurrentSnapshotData,
 sourcexid=sourcexid@entry=0) at predicate.c:1738
#4  0x006b66c3 in GetSafeSnapshot (origSnapshot=optimized out)
 at predicate.c:1517
#5  GetSerializableTransactionSnapshot (
 snapshot=0xbfa8a0 CurrentSnapshotData) at predicate.c:1598
#6  0x007d16dd in GetTransactionSnapshot () at snapmgr.c:200
#7  0x006c0e35 in exec_simple_query (
 query_string=0x1fd01b8 select ev_origin, ev_seqno, ev_timestamp,
ev_snapshot, \pg_catalog\.txid_snapshot_xmin(ev_snapshot),
\pg_catalog\.txid_snapshot_xmax(ev_snapshot),
coalesce(ev_provider_xid,\...)
 at postgres.c:959
#8  PostgresMain (argc=optimized out, argv=argv@entry=0x1f5ab50,


I have no idea if this has anything to do with your recent changes or some
other change. I haven't so far been able to replicate that since the first
time I saw it.
That crash is decidedly odd. Any chance you still have the full
backtrace around?

Yes I still have the core file

Cool, could you show the full thing? Or did you just snip it because
it's just the Assert/ExceptionalCondition thing?


I just snipped the exception stuff. Here is the full thing

(gdb) backtrace
#0  0x7f6fb22e8295 in __GI_raise (sig=sig@entry=6)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x7f6fb22eb438 in __GI_abort () at abort.c:90
#2  0x007a08e7 in ExceptionalCondition (
conditionName=conditionName@entry=0x918e88 
!(TransactionIdFollows(snapshot-xmin, PredXact-SxactGlobalXmin)),

errorType=errorType@entry=0x7da390 FailedAssertion,
fileName=fileName@entry=0x9182c3 predicate.c,
lineNumber=lineNumber@entry=1738) at assert.c:54
#3  0x006b4978 in GetSerializableTransactionSnapshotInt (
snapshot=snapshot@entry=0xbfa8a0 CurrentSnapshotData,
sourcexid=sourcexid@entry=0) at predicate.c:1738
#4  0x006b66c3 in GetSafeSnapshot (origSnapshot=optimized out)
at predicate.c:1517
#5  GetSerializableTransactionSnapshot (
snapshot=0xbfa8a0 CurrentSnapshotData) at predicate.c:1598
#6  0x007d16dd in GetTransactionSnapshot () at snapmgr.c:200
#7  0x006c0e35 in exec_simple_query (
query_string=0x1fd01b8 select ev_origin, ev_seqno, 
ev_timestamp,ev_snapshot, 
\pg_catalog\.txid_snapshot_xmin(ev_snapshot), 
\pg_catalog\.txid_snapshot_xmax(ev_snapshot), 
coalesce(ev_provider_xid,\...)

at postgres.c:959
#8  PostgresMain (argc=optimized out, argv=argv@entry=0x1f5ab50,
---Type return to continue, or q return to quit---
dbname=0x1f5ab30 test2, username=optimized out) at postgres.c:4016
#9  0x0046a08e in BackendRun (port=0x1f83010) at postmaster.c:4123
#10 BackendStartup (port=0x1f83010) at postmaster.c:3797
#11 ServerLoop () at postmaster.c:1576
#12 0x00665eae in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x1f59d00) at postmaster.c:1223
#13 0x0046ab58 in main (argc=3, argv=0x1f59d00) at main.c:227




Could you print *snapshot in frame #3?


(gdb) p *snapshot
$1 = {satisfies = 0x7d0330 HeapTupleSatisfiesMVCC, xmin = 412635,
  xmax = 412637, xip = 0x1f86e40, xcnt = 1, subxcnt = 0, subxip = 
0x1fd2190,
  suboverflowed = 0 '\000', takenDuringRecovery = 0 '\000', copied = 0 
'\000',

  curcid = 0, active_count = 0, regd_count = 0}
(gdb)



This is in the SSI code... I'm not immediately seeing how this could be
related to logical decoding. It can't even be a imported snapshot,
because the exported snapshot is marked repeatable read.

Are you actually using serializable transactions? If so, how and why?

Yes, the test client that performs the 'simulated workload' does some
serializable transactions.  It connects as a normal client via JDBC and does
a prepareStatement(SET TRANSACTION ISOLATION LEVEL SERIALIZABLE) and then
does some DML. Why? because it seemed like a good thing to include in the
test suite.

Yes, it's a good reason as the above backtrace proves ;). I'm just
trying to understand uner which circumstances it happens.

The above backtrace looks like it can only be triggered if you do a
BEGIN TRANSACTION SERIALIZABLE DEFERRABLE READ ONLY; Is that something
you do?


The slony remote listener connection does this when it selects from 
sl_event.  We switched to this shortly after 9.1 came out.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding - reading a user catalog table

2014-11-16 Thread Steve Singer

On 11/13/2014 02:44 PM, Andres Freund wrote:

Hi Steve,



If it still happens, could you send me instructions of how to reproduce
the problem after cloning the necessary source repositories? It's quite
hard to validate a possible fix otherwise.


1. Install PG 9.4

2. Perform an initdb
   max_connections = 200
   wal_level=logical
   max_walsenders=50
   wal_keep_segments = 100
   wal_sender_timeout = 600s
   max_replication_slots = 120



Create a user slon with superuser
create a user test
(set passwords for them you'll use them later)

Configure pg_hba.conf to allow slon to connect as a replication user


3. Download slony from github (https://github.com/ssinger/slony1-engine.git)
   checkout  the branch logical_remote_helper
./configure --with-pgconfigdir=/path_to_your_pgcbindir
make
make install

4. Download clustertest  compile clustertest from 
(https://github.com/clustertest/clustertest-framework).

5. In the slony source tree cd to the clustertest directory
6. cp conf/disorder.properties.sample to conf/disorder.properties
Edit the file to have the correct paths, ports, users, passwords.

7 cp conf/java.properties.sample to conf/java.properties edit the file 
to point at the JAR you downloaded earlier.


I run with all 5 databases on the same PG cluster. Performance will be 
much better with 5 different clusters, but I recommend trying to emulate 
my configuration as much as possible to replicate this


To run the tests then do
./run_all_disorder_tests.sh


At the moment (commit df5acfd1a3) is configured to just run the Failover 
node test cases where I am seeing this not the entire suite.


It typically takes between 30 minutes to an hour to run though.

I installed things following the above steps on a different system than 
my usual development laptop and I have been unable to reproduce the 
error so for (on that system).  But I am still able to reproduce it on 
occasion on my normal development laptop.





Greetings,

Andres Freund





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding - reading a user catalog table

2014-11-13 Thread Andres Freund
Hi Steve,

On 2014-10-28 22:38:29 -0400, Steve Singer wrote:
 On 10/28/2014 01:31 PM, Andres Freund wrote:
 On 2014-10-25 18:18:07 -0400, Steve Singer wrote:
 My logical decoding plugin is occasionally getting this  error
 
 could not resolve cmin/cmax of catalog tuple
 
 I get this when my output plugin is trying to read one of the user defined
 catalog tables (user_catalog_table=true)
 Hm. That should obviously not happen.

I've pushed a fix for a bug that could possibly also cause
this. Although it'd be odd that it always hits the user catalog
table. Except if your tests mostly modify the slony tables, but do not
do much DDL otherwise?

 I'll send you tar of the data directory off list with things in this state.
 
 Do you have a testcase that would allow me to easily reproduce the
 problem?
 
 I don't have a isolated test case that does this.  The test that I'm hitting
 this with does lots of stuff and doesn't even always hit this.

If it still happens, could you send me instructions of how to reproduce
the problem after cloning the necessary source repositories? It's quite
hard to validate a possible fix otherwise.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding - reading a user catalog table

2014-11-13 Thread Steve Singer

On 11/13/2014 02:44 PM, Andres Freund wrote:

H



I've pushed a fix for a bug that could possibly also cause
this. Although it'd be odd that it always hits the user catalog
table. Except if your tests mostly modify the slony tables, but do not
do much DDL otherwise?


The test I was running doesn't do DDL, so only the user catalog tables 
would have changes being tracked.


I still sometimes get the error. When I get sometime on the weekend I'll 
work on some detailed  instructions on how to grab and setup the various 
components to replicate this test.


Also since updating (to 2c267e47afa4f9a7c) I've seen a assertion failure 
in a normal client connection, not the walsender


#3  0x006b4978 in GetSerializableTransactionSnapshotInt (
snapshot=snapshot@entry=0xbfa8a0 CurrentSnapshotData,
sourcexid=sourcexid@entry=0) at predicate.c:1738
#4  0x006b66c3 in GetSafeSnapshot (origSnapshot=optimized out)
at predicate.c:1517
#5  GetSerializableTransactionSnapshot (
snapshot=0xbfa8a0 CurrentSnapshotData) at predicate.c:1598
#6  0x007d16dd in GetTransactionSnapshot () at snapmgr.c:200
#7  0x006c0e35 in exec_simple_query (
query_string=0x1fd01b8 select ev_origin, ev_seqno, 
ev_timestamp,ev_snapshot, 
\pg_catalog\.txid_snapshot_xmin(ev_snapshot), 
\pg_catalog\.txid_snapshot_xmax(ev_snapshot), 
coalesce(ev_provider_xid,\...)

at postgres.c:959
#8  PostgresMain (argc=optimized out, argv=argv@entry=0x1f5ab50,


I have no idea if this has anything to do with your recent changes or 
some other change. I haven't so far been able to replicate that since 
the first time I saw it.





I'll send you tar of the data directory off list with things in this state.


Do you have a testcase that would allow me to easily reproduce the
problem?

I don't have a isolated test case that does this.  The test that I'm hitting
this with does lots of stuff and doesn't even always hit this.

If it still happens, could you send me instructions of how to reproduce
the problem after cloning the necessary source repositories? It's quite
hard to validate a possible fix otherwise.

Greetings,

Andres Freund





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding - reading a user catalog table

2014-10-28 Thread Andres Freund
On 2014-10-25 18:18:07 -0400, Steve Singer wrote:
 My logical decoding plugin is occasionally getting this  error
 
 could not resolve cmin/cmax of catalog tuple
 
 I get this when my output plugin is trying to read one of the user defined
 catalog tables (user_catalog_table=true)

Hm. That should obviously not happen.

Could you describe how that table is modified? Does that bug happen
initially, or only after a while?

Do you have a testcase that would allow me to easily reproduce the
problem?

 I am not sure if this is a bug in the time-travel support in the logical
 decoding support of if I'm just using it wrong (ie not getting a sufficient
 lock on the relation or something).

I don't know yet...

 This is the interesting part of the stack trace
 
 #4  0x0091bbc8 in HeapTupleSatisfiesHistoricMVCC
 (htup=0x7fffcf42a900,
 snapshot=0x7f786ffe92d8, buffer=10568) at tqual.c:1631
 #5  0x004aedf3 in heapgetpage (scan=0x28d7080, page=0) at
 heapam.c:399
 #6  0x004b0182 in heapgettup_pagemode (scan=0x28d7080,
 dir=ForwardScanDirection, nkeys=0, key=0x0) at heapam.c:747
 #7  0x004b1ba6 in heap_getnext (scan=0x28d7080,
 direction=ForwardScanDirection) at heapam.c:1475
 #8  0x7f787002dbfb in lookupSlonyInfo (tableOid=91754, ctx=0x2826118,
 origin_id=0x7fffcf42ab8c, table_id=0x7fffcf42ab88,
 set_id=0x7fffcf42ab84)
 at slony_logical.c:663
 #9  0x7f787002b7a3 in pg_decode_change (ctx=0x2826118, txn=0x28cbec0,
 relation=0x7f787a3446a8, change=0x7f786ffe3268) at slony_logical.c:237
 #10 0x007497d4 in change_cb_wrapper (cache=0x28cbda8, txn=0x28cbec0,
 relation=0x7f787a3446a8, change=0x7f786ffe3268) at logical.c:704
 
 
 
 Here is what the code in lookupSlonyInfo is doing
 --
 
   sltable_oid = get_relname_relid(sl_table,slony_namespace);
 
   sltable_rel = relation_open(sltable_oid,AccessShareLock);
   tupdesc=RelationGetDescr(sltable_rel);
   scandesc=heap_beginscan(sltable_rel,
 GetCatalogSnapshot(sltable_oid),0,NULL);
   reloid_attnum = get_attnum(sltable_oid,tab_reloid);
 
   if(reloid_attnum == InvalidAttrNumber)
 elog(ERROR,sl_table does not have a tab_reloid column);
   set_attnum = get_attnum(sltable_oid,tab_set);
 
   if(set_attnum == InvalidAttrNumber)
 elog(ERROR,sl_table does not have a tab_set column);
   tableid_attnum = get_attnum(sltable_oid, tab_id);
 
   if(tableid_attnum == InvalidAttrNumber)
 elog(ERROR,sl_table does not have a tab_id column);
 
   while( (tuple = heap_getnext(scandesc,ForwardScanDirection) ))

(Except missing spaces ;)) I don't see anything obviously wrong with
this.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding - reading a user catalog table

2014-10-28 Thread Steve Singer

On 10/28/2014 01:31 PM, Andres Freund wrote:

On 2014-10-25 18:18:07 -0400, Steve Singer wrote:

My logical decoding plugin is occasionally getting this  error

could not resolve cmin/cmax of catalog tuple

I get this when my output plugin is trying to read one of the user defined
catalog tables (user_catalog_table=true)

Hm. That should obviously not happen.

Could you describe how that table is modified? Does that bug happen
initially, or only after a while?


It doesn't happen right away, in this case it was maybe 4 minutes after 
creating the slot.
The error also doesn't always happen when I run the this test workload 
but it is reproducible with some trying.
I' don't do anything special to that table, it gets created then I do 
inserts on it.  I don't do an alter table or anything fancy like that.  
I was running the slony failover test (all nodes under the same 
postmaster) which involves the occasional dropping and recreating of 
databases along with normal query load + replication.


I'll send you tar of the data directory off list with things in this state.


Do you have a testcase that would allow me to easily reproduce the
problem?


I don't have a isolated test case that does this.  The test that I'm 
hitting this with does lots of stuff and doesn't even always hit this.



I am not sure if this is a bug in the time-travel support in the logical
decoding support of if I'm just using it wrong (ie not getting a sufficient
lock on the relation or something).

I don't know yet...


This is the interesting part of the stack trace

#4  0x0091bbc8 in HeapTupleSatisfiesHistoricMVCC
(htup=0x7fffcf42a900,
 snapshot=0x7f786ffe92d8, buffer=10568) at tqual.c:1631
#5  0x004aedf3 in heapgetpage (scan=0x28d7080, page=0) at
heapam.c:399
#6  0x004b0182 in heapgettup_pagemode (scan=0x28d7080,
 dir=ForwardScanDirection, nkeys=0, key=0x0) at heapam.c:747
#7  0x004b1ba6 in heap_getnext (scan=0x28d7080,
 direction=ForwardScanDirection) at heapam.c:1475
#8  0x7f787002dbfb in lookupSlonyInfo (tableOid=91754, ctx=0x2826118,
 origin_id=0x7fffcf42ab8c, table_id=0x7fffcf42ab88,
set_id=0x7fffcf42ab84)
 at slony_logical.c:663
#9  0x7f787002b7a3 in pg_decode_change (ctx=0x2826118, txn=0x28cbec0,
 relation=0x7f787a3446a8, change=0x7f786ffe3268) at slony_logical.c:237
#10 0x007497d4 in change_cb_wrapper (cache=0x28cbda8, txn=0x28cbec0,
 relation=0x7f787a3446a8, change=0x7f786ffe3268) at logical.c:704



Here is what the code in lookupSlonyInfo is doing
--

   sltable_oid = get_relname_relid(sl_table,slony_namespace);

   sltable_rel = relation_open(sltable_oid,AccessShareLock);
   tupdesc=RelationGetDescr(sltable_rel);
   scandesc=heap_beginscan(sltable_rel,
GetCatalogSnapshot(sltable_oid),0,NULL);
   reloid_attnum = get_attnum(sltable_oid,tab_reloid);

   if(reloid_attnum == InvalidAttrNumber)
  elog(ERROR,sl_table does not have a tab_reloid column);
   set_attnum = get_attnum(sltable_oid,tab_set);

   if(set_attnum == InvalidAttrNumber)
  elog(ERROR,sl_table does not have a tab_set column);
   tableid_attnum = get_attnum(sltable_oid, tab_id);

   if(tableid_attnum == InvalidAttrNumber)
  elog(ERROR,sl_table does not have a tab_id column);

   while( (tuple = heap_getnext(scandesc,ForwardScanDirection) ))

(Except missing spaces ;)) I don't see anything obviously wrong with
this.

Greetings,

Andres Freund





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers