Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-08-10 Thread Craig Ringer
On 10 August 2017 at 23:25, Robert Haas  wrote:

> On Mon, Aug 7, 2017 at 2:06 AM, Craig Ringer 
> wrote:
> > I think so - specifically, that it's a leftover from a revision where the
> > xid limit was advanced before clog truncation.
> >
> > I'll be finding time in the next couple of days to look more closely and
> > ensure that's all it is.
>
> A couple of days having gone by without hearing further on this, I'm
> going to assume for the moment that my analysis is correct and just
> push a commit to remove this assertion.  That's not intended to
> discourage you from spending further time on this - it would be bad to
> be wrong about this - but doing something we all agree is wrong can't
> be better than doing something that I, at least, think is correct.
>


Thanks.

Keeps getting bumped down by other things, but it's on the stack to not get
lost


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-08-10 Thread Robert Haas
On Mon, Aug 7, 2017 at 2:06 AM, Craig Ringer  wrote:
> I think so - specifically, that it's a leftover from a revision where the
> xid limit was advanced before clog truncation.
>
> I'll be finding time in the next couple of days to look more closely and
> ensure that's all it is.

A couple of days having gone by without hearing further on this, I'm
going to assume for the moment that my analysis is correct and just
push a commit to remove this assertion.  That's not intended to
discourage you from spending further time on this - it would be bad to
be wrong about this - but doing something we all agree is wrong can't
be better than doing something that I, at least, think is correct.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-08-07 Thread Noah Misch
On Mon, Aug 07, 2017 at 05:29:34PM +1200, Thomas Munro wrote:
> On Thu, Aug 3, 2017 at 3:03 AM, Robert Haas  wrote:
> > On Fri, Jul 21, 2017 at 1:31 AM, Thomas Munro
> >  wrote:
> >> Thanks Neha.  It's be best to post the back trace and if possible
> >> print oldestXact and ShmemVariableCache->oldestXid from the stack
> >> frame for TruncateCLOG.
> >>
> >> The failing assertion in TruncateCLOG() has a comment that says
> >> "vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
> >> calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
> >> *after* it calls TruncateCLOG().  What am I missing here?
> >
> > This problem was introduced by commit
> > ea42cc18c35381f639d45628d792e790ff39e271, so this should be added to
> > the PostgreSQL 10 open items list. That commit intended to introduce a
> > distinction between (1) the oldest XID that can be safely examined and
> > (2) the oldest XID that can't yet be safely reused.  These are the
> > same except when we're in the middle of truncating CLOG: (1) advances
> > before the truncation, and (2) advances afterwards. That's why
> > AdvanceOldestClogXid() happens before truncation proper and
> > SetTransactionIdLimit() happens afterwards, and changing the order
> > would, I think, be quite wrong.
> 
> Added to open items.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Robert,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] 
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-08-07 Thread Craig Ringer
On 7 August 2017 at 14:04, Thomas Munro 
wrote:

> On Fri, Jul 21, 2017 at 7:17 PM, Kyotaro HORIGUCHI
>  wrote:
> > In vac_truncate_clog, TruncateCLOG is called before
> > SetTransactionIdLimit, which advances
> > ShmemVariableCache->oldestXid. Given that the assertion in
> > TruncateCLOG is valid, they should be called in reverse order. I
> > suppose that CLOG files can be safely truncated after advancing
> > XID limits.
>
> If we keep the assertion by changing the order of changes to match the
> comment like this, then don't we still have a problem if another
> backend moves it backwards because of the data race I mentioned?  That
> too could be fixed (perhaps by teaching SetTransactionIdLimit not to
> overwrite higher values), but it sounds like the assertion might be a
> mistake.
> 
>

I think so - specifically, that it's a leftover from a revision where the
xid limit was advanced before clog truncation.

I'll be finding time in the next couple of days to look more closely and
ensure that's all it is.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-08-07 Thread Thomas Munro
On Fri, Jul 21, 2017 at 7:17 PM, Kyotaro HORIGUCHI
 wrote:
> In vac_truncate_clog, TruncateCLOG is called before
> SetTransactionIdLimit, which advances
> ShmemVariableCache->oldestXid. Given that the assertion in
> TruncateCLOG is valid, they should be called in reverse order. I
> suppose that CLOG files can be safely truncated after advancing
> XID limits.

If we keep the assertion by changing the order of changes to match the
comment like this, then don't we still have a problem if another
backend moves it backwards because of the data race I mentioned?  That
too could be fixed (perhaps by teaching SetTransactionIdLimit not to
overwrite higher values), but it sounds like the assertion might be a
mistake.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-08-06 Thread Thomas Munro
On Thu, Aug 3, 2017 at 3:03 AM, Robert Haas  wrote:
> On Fri, Jul 21, 2017 at 1:31 AM, Thomas Munro
>  wrote:
>> Thanks Neha.  It's be best to post the back trace and if possible
>> print oldestXact and ShmemVariableCache->oldestXid from the stack
>> frame for TruncateCLOG.
>>
>> The failing assertion in TruncateCLOG() has a comment that says
>> "vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
>> calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
>> *after* it calls TruncateCLOG().  What am I missing here?
>
> This problem was introduced by commit
> ea42cc18c35381f639d45628d792e790ff39e271, so this should be added to
> the PostgreSQL 10 open items list. That commit intended to introduce a
> distinction between (1) the oldest XID that can be safely examined and
> (2) the oldest XID that can't yet be safely reused.  These are the
> same except when we're in the middle of truncating CLOG: (1) advances
> before the truncation, and (2) advances afterwards. That's why
> AdvanceOldestClogXid() happens before truncation proper and
> SetTransactionIdLimit() happens afterwards, and changing the order
> would, I think, be quite wrong.

Added to open items.

> AFAICS, that assertion is simply a holdover from an earlier version of
> the patch that escaped review.  There's just no reason to suppose that
> it's true.

Craig, are you planning to post a patch to remove the assertion, or
make it hold?

>> What actually prevents ShmemVariableCache->oldestXid from going
>> backwards anyway?  Suppose there are two or more autovacuum processes
>> that reach vac_truncate_clog() concurrently.  They do a scan of
>> pg_database whose tuples they access without locking through a
>> pointer-to-volatile because they expect concurrent in-place writers,
>> come up with a value for frozenXID, and then arrive at
>> SetTransactionIdLimit() in whatever order and clobber
>> ShmemVariableCache->oldestXid.  What am I missing here?
>
> Hmm, there could be a bug there, but I don't think it's *this* bug.

I'm not sure that it wrong per se, as long as no one asserts that the
number can't go backwards...

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-08-02 Thread Robert Haas
On Fri, Jul 21, 2017 at 1:31 AM, Thomas Munro
 wrote:
> Thanks Neha.  It's be best to post the back trace and if possible
> print oldestXact and ShmemVariableCache->oldestXid from the stack
> frame for TruncateCLOG.
>
> The failing assertion in TruncateCLOG() has a comment that says
> "vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
> calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
> *after* it calls TruncateCLOG().  What am I missing here?

This problem was introduced by commit
ea42cc18c35381f639d45628d792e790ff39e271, so this should be added to
the PostgreSQL 10 open items list. That commit intended to introduce a
distinction between (1) the oldest XID that can be safely examined and
(2) the oldest XID that can't yet be safely reused.  These are the
same except when we're in the middle of truncating CLOG: (1) advances
before the truncation, and (2) advances afterwards. That's why
AdvanceOldestClogXid() happens before truncation proper and
SetTransactionIdLimit() happens afterwards, and changing the order
would, I think, be quite wrong.

AFAICS, that assertion is simply a holdover from an earlier version of
the patch that escaped review.  There's just no reason to suppose that
it's true.

> What actually prevents ShmemVariableCache->oldestXid from going
> backwards anyway?  Suppose there are two or more autovacuum processes
> that reach vac_truncate_clog() concurrently.  They do a scan of
> pg_database whose tuples they access without locking through a
> pointer-to-volatile because they expect concurrent in-place writers,
> come up with a value for frozenXID, and then arrive at
> SetTransactionIdLimit() in whatever order and clobber
> ShmemVariableCache->oldestXid.  What am I missing here?

Hmm, there could be a bug there, but I don't think it's *this* bug.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-07-21 Thread Kyotaro HORIGUCHI
At Fri, 21 Jul 2017 11:39:38 +0530, Neha Sharma  
wrote in 
> Here is the back trace from the core dump attached.
> 
> (gdb) bt
> #0  0x7f4a71424495 in raise () from /lib64/libc.so.6
> #1  0x7f4a71425c75 in abort () from /lib64/libc.so.6
> #2  0x009dc18a in ExceptionalCondition (conditionName=0xa905d0
> "!(TransactionIdPrecedesOrEquals(oldestXact,
> ShmemVariableCache->oldestXid))",
> errorType=0xa9044f "FailedAssertion", fileName=0xa90448 "clog.c",
> lineNumber=683) at assert.c:54
> #3  0x00524215 in TruncateCLOG (oldestXact=150036635,
> oldestxid_datoid=13164) at clog.c:682

In vac_truncate_clog, TruncateCLOG is called before
SetTransactionIdLimit, which advances
ShmemVariableCache->oldestXid. Given that the assertion in
TruncateCLOG is valid, they should be called in reverse order. I
suppose that CLOG files can be safely truncated after advancing
XID limits.

By the way, the attached patch is made by "git diff --patience".

filterdiff converts it into somewhat wrong shape. Specifically,
the result is missing the addition part of the difference, as the
second attached patch. I'm not sure which of git(2.9.2) or
filterdiff (0.3.3), (or me?) is doing wrong..


> #4  0x006a6be8 in vac_truncate_clog (frozenXID=150036635,
> minMulti=1, lastSaneFrozenXid=200562449, lastSaneMinMulti=1) at
> vacuum.c:1197
> #5  0x006a6948 in vac_update_datfrozenxid () at vacuum.c:1063
> #6  0x007ce0a2 in do_autovacuum () at autovacuum.c:2625
> #7  0x007cc987 in AutoVacWorkerMain (argc=0, argv=0x0) at
> autovacuum.c:1715
> #8  0x007cc562 in StartAutoVacWorker () at autovacuum.c:1512
> #9  0x007e2acd in StartAutovacuumWorker () at postmaster.c:5414
> #10 0x007e257e in sigusr1_handler (postgres_signal_arg=10) at
> postmaster.c:5111
> #11 
> #12 0x7f4a714d3603 in __select_nocancel () from /lib64/libc.so.6
> #13 0x007dde88 in ServerLoop () at postmaster.c:1717
> #14 0x007dd67d in PostmasterMain (argc=3, argv=0x2eb8b00) at
> postmaster.c:1361
> #15 0x0071a218 in main (argc=3, argv=0x2eb8b00) at main.c:228
> (gdb) print ShmemVariableCache->oldestXid
> $3 = 548
> 
> 
> Regards,
> Neha Sharma
> 
> On Fri, Jul 21, 2017 at 11:01 AM, Thomas Munro <
> thomas.mu...@enterprisedb.com> wrote:
> 
> > On Fri, Jul 21, 2017 at 4:16 PM, Neha Sharma
> >  wrote:
> > >
> > > Attached is the core dump file received on PG 10beta2 version.
> >
> > Thanks Neha.  It's be best to post the back trace and if possible
> > print oldestXact and ShmemVariableCache->oldestXid from the stack
> > frame for TruncateCLOG.
> >
> > The failing assertion in TruncateCLOG() has a comment that says
> > "vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
> > calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
> > *after* it calls TruncateCLOG().  What am I missing here?
> >
> > What actually prevents ShmemVariableCache->oldestXid from going
> > backwards anyway?  Suppose there are two or more autovacuum processes
> > that reach vac_truncate_clog() concurrently.  They do a scan of
> > pg_database whose tuples they access without locking through a
> > pointer-to-volatile because they expect concurrent in-place writers,
> > come up with a value for frozenXID, and then arrive at
> > SetTransactionIdLimit() in whatever order and clobber
> > ShmemVariableCache->oldestXid.  What am I missing here?
> >
> > --
> > Thomas Munro
> > http://www.enterprisedb.com
> >

-- 
堀口恭太郎

日本電信電話株式会社 NTTオープンソースソフトウェアセンタ
Phone: 03-5860-5115 / Fax: 03-5463-5490
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index faa1812..cd8be92 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1192,13 +1192,6 @@ vac_truncate_clog(TransactionId frozenXID,
 	AdvanceOldestCommitTsXid(frozenXID);
 
 	/*
-	 * Truncate CLOG, multixact and CommitTs to the oldest computed value.
-	 */
-	TruncateCLOG(frozenXID, oldestxid_datoid);
-	TruncateCommitTs(frozenXID);
-	TruncateMultiXact(minMulti, minmulti_datoid);
-
-	/*
 	 * Update the wrap limit for GetNewTransactionId and creation of new
 	 * MultiXactIds.  Note: these functions will also signal the postmaster
 	 * for an(other) autovac cycle if needed.   XXX should we avoid possibly
@@ -1206,6 +1199,14 @@ vac_truncate_clog(TransactionId frozenXID,
 	 */
 	SetTransactionIdLimit(frozenXID, oldestxid_datoid);
 	SetMultiXactIdLimit(minMulti, minmulti_datoid, false);
+
+	/*
+	 * Truncate CLOG, multixact and CommitTs to the oldest computed value
+	 * after advancing xid limits.
+	 */
+	TruncateCLOG(frozenXID, oldestxid_datoid);
+	TruncateCommitTs(frozenXID);
+	TruncateMultiXact(minMulti, minmulti_datoid);
 }
 
 
*** a/src/backend/commands/vacuum.c
--- b/src/backend/commands/vacuum.c
***
*** 1192,1204  vac_truncate_clog(TransactionId 

Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-07-21 Thread Neha Sharma
Here is the back trace from the core dump attached.

(gdb) bt
#0  0x7f4a71424495 in raise () from /lib64/libc.so.6
#1  0x7f4a71425c75 in abort () from /lib64/libc.so.6
#2  0x009dc18a in ExceptionalCondition (conditionName=0xa905d0
"!(TransactionIdPrecedesOrEquals(oldestXact,
ShmemVariableCache->oldestXid))",
errorType=0xa9044f "FailedAssertion", fileName=0xa90448 "clog.c",
lineNumber=683) at assert.c:54
#3  0x00524215 in TruncateCLOG (oldestXact=150036635,
oldestxid_datoid=13164) at clog.c:682
#4  0x006a6be8 in vac_truncate_clog (frozenXID=150036635,
minMulti=1, lastSaneFrozenXid=200562449, lastSaneMinMulti=1) at
vacuum.c:1197
#5  0x006a6948 in vac_update_datfrozenxid () at vacuum.c:1063
#6  0x007ce0a2 in do_autovacuum () at autovacuum.c:2625
#7  0x007cc987 in AutoVacWorkerMain (argc=0, argv=0x0) at
autovacuum.c:1715
#8  0x007cc562 in StartAutoVacWorker () at autovacuum.c:1512
#9  0x007e2acd in StartAutovacuumWorker () at postmaster.c:5414
#10 0x007e257e in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5111
#11 
#12 0x7f4a714d3603 in __select_nocancel () from /lib64/libc.so.6
#13 0x007dde88 in ServerLoop () at postmaster.c:1717
#14 0x007dd67d in PostmasterMain (argc=3, argv=0x2eb8b00) at
postmaster.c:1361
#15 0x0071a218 in main (argc=3, argv=0x2eb8b00) at main.c:228
(gdb) print ShmemVariableCache->oldestXid
$3 = 548


Regards,
Neha Sharma

On Fri, Jul 21, 2017 at 11:01 AM, Thomas Munro <
thomas.mu...@enterprisedb.com> wrote:

> On Fri, Jul 21, 2017 at 4:16 PM, Neha Sharma
>  wrote:
> >
> > Attached is the core dump file received on PG 10beta2 version.
>
> Thanks Neha.  It's be best to post the back trace and if possible
> print oldestXact and ShmemVariableCache->oldestXid from the stack
> frame for TruncateCLOG.
>
> The failing assertion in TruncateCLOG() has a comment that says
> "vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
> calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
> *after* it calls TruncateCLOG().  What am I missing here?
>
> What actually prevents ShmemVariableCache->oldestXid from going
> backwards anyway?  Suppose there are two or more autovacuum processes
> that reach vac_truncate_clog() concurrently.  They do a scan of
> pg_database whose tuples they access without locking through a
> pointer-to-volatile because they expect concurrent in-place writers,
> come up with a value for frozenXID, and then arrive at
> SetTransactionIdLimit() in whatever order and clobber
> ShmemVariableCache->oldestXid.  What am I missing here?
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-07-20 Thread Thomas Munro
On Fri, Jul 21, 2017 at 4:16 PM, Neha Sharma
 wrote:
>
> Attached is the core dump file received on PG 10beta2 version.

Thanks Neha.  It's be best to post the back trace and if possible
print oldestXact and ShmemVariableCache->oldestXid from the stack
frame for TruncateCLOG.

The failing assertion in TruncateCLOG() has a comment that says
"vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
*after* it calls TruncateCLOG().  What am I missing here?

What actually prevents ShmemVariableCache->oldestXid from going
backwards anyway?  Suppose there are two or more autovacuum processes
that reach vac_truncate_clog() concurrently.  They do a scan of
pg_database whose tuples they access without locking through a
pointer-to-volatile because they expect concurrent in-place writers,
come up with a value for frozenXID, and then arrive at
SetTransactionIdLimit() in whatever order and clobber
ShmemVariableCache->oldestXid.  What am I missing here?

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-07-20 Thread Neha Sharma
Hi,

Attached is the core dump file received on PG 10beta2 version.
​
 myfile.tgz

​

Regards,
Neha Sharma

On Thu, Jul 20, 2017 at 2:45 PM, Neha Sharma 
wrote:

>
>
> Regards,
> Neha Sharma
>
> On Thu, Jul 20, 2017 at 1:28 PM, Craig Ringer 
> wrote:
>
>> On 20 July 2017 at 15:00, Neha Sharma 
>> wrote:
>>
>>> Hi Craig,
>>>
>>> I had done a fresh initdb,the default parameter configuration was used.
>>> I was setting few set of parameters while startup by the below command.
>>>
>>> ./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c
>>> min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c
>>> maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &
>>>
>>> Now I have modified the script a bit with Robert's suggestion as below.
>>> Instead of starting it with postgres binary i have set it in conf file and
>>> starting the server with pg_ctl. I am waiting for the results,once the core
>>> dump is generated will share the details.
>>>
>>
>> Thanks.
>>
>> To verify that you do get a coredump, you might want to consider sending
>> a kill -SEGV to a backend and make sure that it actually dumps core and you
>> can find the core.
>>
>> Ideally you'd actually set the coredumps to include shmem (see
>> coredump_filter in http://man7.org/linux/man-pages/man5/core.5.html),
>> but with 8GB shared_buffers that may not be practical. It'd be very useful
>> if possible.
>>
>> If this is wraparound-related, as it appears to be, you might get faster
>> results by using a custom pgbench script for one or more workers that just
>> runs txid_current() a whole lot. Or jump the server's xid space forward.
>>
> Thanks. Will put together suggestions to get the result.
>
>>
>> I've got a few other things on right now but I'll keep an eye out and
>> hope for a core dump.
>>
>> --
>>  Craig Ringer   http://www.2ndQuadrant.com/
>>  PostgreSQL Development, 24x7 Support, Training & Services
>>
>
>


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-07-20 Thread Neha Sharma
Regards,
Neha Sharma

On Thu, Jul 20, 2017 at 1:28 PM, Craig Ringer  wrote:

> On 20 July 2017 at 15:00, Neha Sharma 
> wrote:
>
>> Hi Craig,
>>
>> I had done a fresh initdb,the default parameter configuration was used. I
>> was setting few set of parameters while startup by the below command.
>>
>> ./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c
>> min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c
>> maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &
>>
>> Now I have modified the script a bit with Robert's suggestion as below.
>> Instead of starting it with postgres binary i have set it in conf file and
>> starting the server with pg_ctl. I am waiting for the results,once the core
>> dump is generated will share the details.
>>
>
> Thanks.
>
> To verify that you do get a coredump, you might want to consider sending a
> kill -SEGV to a backend and make sure that it actually dumps core and you
> can find the core.
>
> Ideally you'd actually set the coredumps to include shmem (see
> coredump_filter in http://man7.org/linux/man-pages/man5/core.5.html), but
> with 8GB shared_buffers that may not be practical. It'd be very useful if
> possible.
>
> If this is wraparound-related, as it appears to be, you might get faster
> results by using a custom pgbench script for one or more workers that just
> runs txid_current() a whole lot. Or jump the server's xid space forward.
>
Thanks. Will put together suggestions to get the result.

>
> I've got a few other things on right now but I'll keep an eye out and hope
> for a core dump.
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-07-20 Thread Craig Ringer
On 20 July 2017 at 15:00, Neha Sharma  wrote:

> Hi Craig,
>
> I had done a fresh initdb,the default parameter configuration was used. I
> was setting few set of parameters while startup by the below command.
>
> ./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c
> min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c
> maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &
>
> Now I have modified the script a bit with Robert's suggestion as below.
> Instead of starting it with postgres binary i have set it in conf file and
> starting the server with pg_ctl. I am waiting for the results,once the core
> dump is generated will share the details.
>

Thanks.

To verify that you do get a coredump, you might want to consider sending a
kill -SEGV to a backend and make sure that it actually dumps core and you
can find the core.

Ideally you'd actually set the coredumps to include shmem (see
coredump_filter in http://man7.org/linux/man-pages/man5/core.5.html), but
with 8GB shared_buffers that may not be practical. It'd be very useful if
possible.

If this is wraparound-related, as it appears to be, you might get faster
results by using a custom pgbench script for one or more workers that just
runs txid_current() a whole lot. Or jump the server's xid space forward.

I've got a few other things on right now but I'll keep an eye out and hope
for a core dump.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-07-20 Thread Neha Sharma
Hi Craig,

I had done a fresh initdb,the default parameter configuration was used. I
was setting few set of parameters while startup by the below command.

./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c
min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c
maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &

Now I have modified the script a bit with Robert's suggestion as below.
Instead of starting it with postgres binary i have set it in conf file and
starting the server with pg_ctl. I am waiting for the results,once the core
dump is generated will share the details.

Test code Snippet :

#Pre condition:
#Create and initialize the database and set the export PGDATA
export PGDATA=/home/centos/PG_sources/postgresql/inst/bin/data
export PGPORT=5432
export
LD_LIBRARY_PATH=/home/centos/PG_sources/postgresql/inst/lib:$LD_LIBRARY_PATH
#for i in "scale_factor shared_buffers time_for_readings no_of_readings
orig_or_patch"
for i in "300 8GB 1800 3"
do
scale_factor=`echo $i | cut -d" " -f1`
shared_bufs=`echo $i | cut -d" " -f2`
time_for_reading=`echo $i | cut -d" " -f3`
no_of_readings=`echo $i | cut -d" " -f4`

# ---

echo "Start of script for $scale_factor $shared_bufs " >>
/home/centos/test_results.txt


echo "== $run_bin =" >>
/home/centos/test_results.txt

for threads in 1 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
#for threads in 8 16
do
#Start taking reading
for ((readcnt = 1 ; readcnt <= $no_of_readings ; readcnt++))
do
echo ""  >>
/home/centos/test_results.txt
echo $scale_factor, $shared_bufs, $threads, $threads,
$time_for_reading Reading - ${readcnt}  >> /home/centos/test_results.txt
#start server
./pg_ctl -D data -c -l logfile start
#./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c
min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c
maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &
sleep 5
#drop and recreate database
./dropdb test
./createdb test
#initialize database
./pgbench -i -s $scale_factor test
sleep 5
# Run pgbench
./pgbench -c $threads -j $threads -T $time_for_reading -M prepared
test  >> /home/centos/test_results.txt
sleep 10
./psql -d test -c "checkpoint" >> /home/centos/test_results.txt
./pg_ctl stop
done;
done;

sleep 1


mv /home/centos/test_results.txt
/home/centos/test_results_list_${scale_factor}_${shared_bufs}_rw.txt
done;



Regards,
Neha Sharma

On Thu, Jul 20, 2017 at 11:23 AM, Craig Ringer 
wrote:

> On 19 July 2017 at 20:26, Neha Sharma 
>> wrote:
>>
>>> Hi,
>>>
>>> I am getting FailedAssertion while executing the attached
>>> script.However,I am not able to produce the core dump for the same,the
>>> script runs in background and takes around a day time to produce the
>>> mentioned error.
>>>
>>> "TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(oldestXact,
>>> ShmemVariableCache->oldestXid))", File: "clog.c", Line: 683)
>>> 2017-07-19 01:16:51.973 GMT [27873] LOG:  server process (PID 28084) was
>>> terminated by signal 6: Aborted
>>> 2017-07-19 01:16:51.973 GMT [27873] DETAIL:  Failed process was running:
>>> autovacuum: VACUUM pg_toast.pg_toast_13029 (to prevent wraparound)"
>>>
>>
>>
> What are the starting conditions of your postgres instance? Does your
> script assume a newly initdb'd instance with no custom configuration? If
> not, what setup steps/configuration precede your script run?
>
>
>
>
>
>> well short of the 2-million mark.
>>
>
> Er, billion.
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-07-19 Thread Craig Ringer
>
> On 19 July 2017 at 20:26, Neha Sharma 
> wrote:
>
>> Hi,
>>
>> I am getting FailedAssertion while executing the attached
>> script.However,I am not able to produce the core dump for the same,the
>> script runs in background and takes around a day time to produce the
>> mentioned error.
>>
>> "TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(oldestXact,
>> ShmemVariableCache->oldestXid))", File: "clog.c", Line: 683)
>> 2017-07-19 01:16:51.973 GMT [27873] LOG:  server process (PID 28084) was
>> terminated by signal 6: Aborted
>> 2017-07-19 01:16:51.973 GMT [27873] DETAIL:  Failed process was running:
>> autovacuum: VACUUM pg_toast.pg_toast_13029 (to prevent wraparound)"
>>
>
>
What are the starting conditions of your postgres instance? Does your
script assume a newly initdb'd instance with no custom configuration? If
not, what setup steps/configuration precede your script run?





> well short of the 2-million mark.
>

Er, billion.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-07-19 Thread Craig Ringer
On 19 July 2017 at 20:26, Neha Sharma  wrote:

> Hi,
>
> I am getting FailedAssertion while executing the attached script.However,I
> am not able to produce the core dump for the same,the script runs in
> background and takes around a day time to produce the mentioned error.
>
> "TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(oldestXact,
> ShmemVariableCache->oldestXid))", File: "clog.c", Line: 683)
> 2017-07-19 01:16:51.973 GMT [27873] LOG:  server process (PID 28084) was
> terminated by signal 6: Aborted
> 2017-07-19 01:16:51.973 GMT [27873] DETAIL:  Failed process was running:
> autovacuum: VACUUM pg_toast.pg_toast_13029 (to prevent wraparound)"
>

It's not much of a stretch to assume this was likely introduced
by ea42cc18c:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ea42cc18c35381f639d45628d792e790ff39e271

so I'll have a look to see if I can spot how it could happen.

Please show your "SELECT version()" and if using a build from git, the "git
rev-parse --short HEAD" for the build you're running.

It's interesting and probably significant that your Pg was nearing
wraparound. Just a tip for that - if you want to create a situation where
you near wraparound quickly and easily, for testing, you can, on a test
instance you don't care about:

- select txid_current();
- stop pg
- pg_resetxlog -n $(( $the_xid + (1<<31) - 1001000 ))
- try to start pg, note the missing clog segment it complains about
- dd if=/dev/zero bs=1 count=262144 of=datadir/pg_clog/$MISSINGSEGNAME
- start Pg


That should put you about 1000 txn's from the 1 million xid limit, assuming
I got my maths right (don't assume that), and assuming your starting xid is
well short of the 2-million mark.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-07-19 Thread Alvaro Herrera
Robert Haas wrote:
> On Wed, Jul 19, 2017 at 8:26 AM, Neha Sharma
>  wrote:
> > I am getting FailedAssertion while executing the attached script.However,I
> > am not able to produce the core dump for the same,the script runs in
> > background and takes around a day time to produce the mentioned error.
> 
> Maybe you should run it for another day with core dumps enabled (try
> the "-c" option to pg_ctl) and then pull a backtrace if you get one.
> This is a pretty generic script so it's going to be hard to guess
> where the problem is otherwise.

Also, please state server version used.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-07-19 Thread Robert Haas
On Wed, Jul 19, 2017 at 8:26 AM, Neha Sharma
 wrote:
> I am getting FailedAssertion while executing the attached script.However,I
> am not able to produce the core dump for the same,the script runs in
> background and takes around a day time to produce the mentioned error.

Maybe you should run it for another day with core dumps enabled (try
the "-c" option to pg_ctl) and then pull a backtrace if you get one.
This is a pretty generic script so it's going to be hard to guess
where the problem is otherwise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers