Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-08-10 Thread Noah Misch
On Thu, Aug 03, 2017 at 10:45:50AM -0400, Robert Haas wrote:
> On Wed, Aug 2, 2017 at 11:47 PM, Noah Misch  wrote:
> > postmaster algorithms rely on the PG_SETMASK() calls preventing that.  
> > Without
> > such protection, duplicate bgworkers are an understandable result.  I caught
> > several other assertions; the PMChildFlags failure is another case of
> > duplicate postmaster children:
> >
> >   6 TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: 
> > "pgstat.c", Line: 871)
> >   3 TRAP: FailedAssertion("!(PMSignalState->PMChildFlags[slot] == 1)", 
> > File: "pmsignal.c", Line: 229)
> >  20 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", 
> > Line: 2523)
> >  21 TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: 
> > "shm_mq.c", Line: 221)
> >  Also, got a few "select() failed in postmaster: Bad address"
> >
> > I suspect a Cygwin signals bug.  I'll try to distill a self-contained test
> > case for the Cygwin hackers.  The lack of failures on buildfarm member 
> > brolga
> > argues that older Cygwin is not affected.
> 
> Nice detective work.

Thanks.  http://marc.info/?t=15018329641 has my upstream report.  The
Cygwin project lead reproduced this, but a fix remained elusive.

I guess we'll ignore weird postmaster-associated lorikeet failures for the
foreseeable future.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-08-03 Thread Robert Haas
On Wed, Aug 2, 2017 at 11:47 PM, Noah Misch  wrote:
> postmaster algorithms rely on the PG_SETMASK() calls preventing that.  Without
> such protection, duplicate bgworkers are an understandable result.  I caught
> several other assertions; the PMChildFlags failure is another case of
> duplicate postmaster children:
>
>   6 TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: 
> "pgstat.c", Line: 871)
>   3 TRAP: FailedAssertion("!(PMSignalState->PMChildFlags[slot] == 1)", 
> File: "pmsignal.c", Line: 229)
>  20 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", 
> Line: 2523)
>  21 TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: 
> "shm_mq.c", Line: 221)
>  Also, got a few "select() failed in postmaster: Bad address"
>
> I suspect a Cygwin signals bug.  I'll try to distill a self-contained test
> case for the Cygwin hackers.  The lack of failures on buildfarm member brolga
> argues that older Cygwin is not affected.

Nice detective work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-08-02 Thread Noah Misch
On Wed, Jun 21, 2017 at 06:44:09PM -0400, Tom Lane wrote:
> Today, lorikeet failed with a new variant on the bgworker start crash:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet=2017-06-21%2020%3A29%3A10
> 
> This one is even more exciting than the last one, because it sure looks
> like the crashing bgworker took the postmaster down with it.  That is
> Not Supposed To Happen.
> 
> Wondering if we broke something here recently, I tried to reproduce it
> on a Linux machine by adding a randomized Assert failure in
> shm_mq_set_sender.  I don't see any such problem, even with EXEC_BACKEND;
> we recover from the crash as-expected.
> 
> So I'm starting to get a distinct feeling that there's something wrong
> with the cygwin port.  But I dunno what.

I think signal blocking broke on Cygwin.

On a system (gcc 5.4.0, CYGWIN_NT-10.0 2.7.0(0.306/5/3) 2017-02-12 13:18
x86_64) that reproduces lorikeet's symptoms, I instrumented the postmaster as
attached.  The patch's small_parallel.sql is a subset of select_parallel.sql
sufficient to reproduce the mq_sender Assert failure and the postmaster silent
exit.  (It occasionally needed hundreds of iterations to do so.)  The parallel
query normally starts four bgworkers; when the mq_sender Assert fired, the
test had started five workers in response to four registrations.

The postmaster.c instrumentation regularly detects sigusr1_handler() calls
while another sigusr1_handler() is already on the stack:

6328 2017-08-02 07:25:42.788 GMT LOG:  forbid signals @ sigusr1_handler
6328 2017-08-02 07:25:42.788 GMT DEBUG:  saw slot-0 registration, want 0
6328 2017-08-02 07:25:42.788 GMT DEBUG:  saw slot-0 registration, want 1
6328 2017-08-02 07:25:42.788 GMT DEBUG:  slot 1 not yet registered
6328 2017-08-02 07:25:42.789 GMT DEBUG:  registering background worker 
"parallel worker for PID 4776" (slot 1)
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-1 registration, want 2
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-0 registration, want 2
6328 2017-08-02 07:25:42.789 GMT DEBUG:  slot 2 not yet registered
6328 2017-08-02 07:25:42.789 GMT DEBUG:  registering background worker 
"parallel worker for PID 4776" (slot 2)
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-2 registration, want 3
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-1 registration, want 3
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-0 registration, want 3
6328 2017-08-02 07:25:42.789 GMT DEBUG:  slot 3 not yet registered
6328 2017-08-02 07:25:42.789 GMT DEBUG:  registering background worker 
"parallel worker for PID 4776" (slot 3)
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-3 registration, want 4
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-2 registration, want 4
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-1 registration, want 4
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-0 registration, want 4
6328 2017-08-02 07:25:42.789 GMT DEBUG:  slot 4 not yet registered
6328 2017-08-02 07:25:42.789 GMT DEBUG:  registering background worker 
"parallel worker for PID 4776" (slot 4)
6328 2017-08-02 07:25:42.789 GMT DEBUG:  starting background worker process 
"parallel worker for PID 4776"
6328 2017-08-02 07:25:42.790 GMT LOG:  forbid signals @ sigusr1_handler
6328 2017-08-02 07:25:42.790 GMT WARNING:  signals already forbidden @ 
sigusr1_handler
6328 2017-08-02 07:25:42.790 GMT LOG:  permit signals @ sigusr1_handler

postmaster algorithms rely on the PG_SETMASK() calls preventing that.  Without
such protection, duplicate bgworkers are an understandable result.  I caught
several other assertions; the PMChildFlags failure is another case of
duplicate postmaster children:

  6 TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: 
"pgstat.c", Line: 871)
  3 TRAP: FailedAssertion("!(PMSignalState->PMChildFlags[slot] == 1)", 
File: "pmsignal.c", Line: 229)
 20 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 
2523)
 21 TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: 
"shm_mq.c", Line: 221)
 Also, got a few "select() failed in postmaster: Bad address"

I suspect a Cygwin signals bug.  I'll try to distill a self-contained test
case for the Cygwin hackers.  The lack of failures on buildfarm member brolga
argues that older Cygwin is not affected.
diff --git a/src/backend/access/transam/parallel.c 
b/src/backend/access/transam/parallel.c
index 17b1038..f42034d 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -985,6 +985,8 @@ ParallelWorkerMain(Datum main_arg)
error_queue_space = shm_toc_lookup(toc, PARALLEL_KEY_ERROR_QUEUE, 
false);
mq = (shm_mq *) (error_queue_space +
 ParallelWorkerNumber * 
PARALLEL_ERROR_QUEUE_SIZE);
+   write_stderr("PID %d claiming queue %d (%p)\n",
+MyProc->pid, ParallelWorkerNumber, mq);
shm_mq_set_sender(mq, MyProc);
mqh = 

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-26 Thread Amit Kapila
On Mon, Jun 26, 2017 at 8:09 PM, Andrew Dunstan
 wrote:
>
>
> On 06/26/2017 10:36 AM, Amit Kapila wrote:
>> On Fri, Jun 23, 2017 at 9:12 AM, Andrew Dunstan
>>  wrote:
>>>
>>> On 06/22/2017 10:24 AM, Tom Lane wrote:
 Andrew Dunstan  writes:
> Please let me know if there are tests I can run. I  missed your earlier
> request in this thread, sorry about that.
 That earlier request is still valid.  Also, if you can reproduce the
 symptom that lorikeet just showed and get a stack trace from the
 (hypothetical) postmaster core dump, that would be hugely valuable.


>>>
>>> See attached log and stacktrace
>>>
>> Is this all the log contents or is there something else?  The attached
>> log looks strange to me in the sense that the first worker that gets
>> invoked is Parallel worker number 2 and it finds that somebody else
>> has already set the sender handle.
>>
>
>
>
> No, it's the end of the log file. I can rerun and get the whole log file
> if you like.
>

Okay, if possible, please share the same.  Another way to get better
information is if we change the code of shm_mq_set_sender such that it
will hang if we hit Assertion condition.  Once it hangs we can attach
a debugger and try to get some information.  Basically, just change
the code of shm_mq_set_sender as below or something like that:

void
shm_mq_set_sender(shm_mq *mq, PGPROC *proc)
{
volatile shm_mq *vmq = mq;
PGPROC   *receiver;

SpinLockAcquire(>mq_mutex);
if (vmq->mq_sender != NULL)
{
while(1)
{
}
}

If we are able to hit the above code, then we can print the values of
mq_sender especially pid and see if the pid is of the current backend.
In theory, it should be of the different backend as this is the first
time we are setting the mq_sender.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-26 Thread Andrew Dunstan


On 06/26/2017 10:45 AM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 06/23/2017 07:47 AM, Andrew Dunstan wrote:
>>> Rerunning with some different settings to see if I can get separate cores.
>> Numerous attempts to get core dumps following methods suggested in
>> Google searches have failed. The latest one is just hanging.
> Well, if it's hung, maybe you could attach to the processes with gdb and
> get stack traces manually?
>
>   



In theory what I have set up is supposed to be doing that, but I'll try.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-26 Thread Tom Lane
Andrew Dunstan  writes:
> On 06/23/2017 07:47 AM, Andrew Dunstan wrote:
>> Rerunning with some different settings to see if I can get separate cores.

> Numerous attempts to get core dumps following methods suggested in
> Google searches have failed. The latest one is just hanging.

Well, if it's hung, maybe you could attach to the processes with gdb and
get stack traces manually?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-26 Thread Andrew Dunstan


On 06/26/2017 10:36 AM, Amit Kapila wrote:
> On Fri, Jun 23, 2017 at 9:12 AM, Andrew Dunstan
>  wrote:
>>
>> On 06/22/2017 10:24 AM, Tom Lane wrote:
>>> Andrew Dunstan  writes:
 Please let me know if there are tests I can run. I  missed your earlier
 request in this thread, sorry about that.
>>> That earlier request is still valid.  Also, if you can reproduce the
>>> symptom that lorikeet just showed and get a stack trace from the
>>> (hypothetical) postmaster core dump, that would be hugely valuable.
>>>
>>>
>>
>> See attached log and stacktrace
>>
> Is this all the log contents or is there something else?  The attached
> log looks strange to me in the sense that the first worker that gets
> invoked is Parallel worker number 2 and it finds that somebody else
> has already set the sender handle.
>



No, it's the end of the log file. I can rerun and get the whole log file
if you like.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-26 Thread Amit Kapila
On Fri, Jun 23, 2017 at 9:12 AM, Andrew Dunstan
 wrote:
>
>
> On 06/22/2017 10:24 AM, Tom Lane wrote:
>> Andrew Dunstan  writes:
>>> Please let me know if there are tests I can run. I  missed your earlier
>>> request in this thread, sorry about that.
>> That earlier request is still valid.  Also, if you can reproduce the
>> symptom that lorikeet just showed and get a stack trace from the
>> (hypothetical) postmaster core dump, that would be hugely valuable.
>>
>>
>
>
> See attached log and stacktrace
>

Is this all the log contents or is there something else?  The attached
log looks strange to me in the sense that the first worker that gets
invoked is Parallel worker number 2 and it finds that somebody else
has already set the sender handle.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-26 Thread Andrew Dunstan


On 06/23/2017 07:47 AM, Andrew Dunstan wrote:
>
> On 06/23/2017 12:11 AM, Tom Lane wrote:
>> Andrew Dunstan  writes:
>>> On 06/22/2017 10:24 AM, Tom Lane wrote:
 That earlier request is still valid.  Also, if you can reproduce the
 symptom that lorikeet just showed and get a stack trace from the
 (hypothetical) postmaster core dump, that would be hugely valuable.
>>> See attached log and stacktrace
>> The stacktrace seems to be from the parallel-query-leader backend.
>> Was there another core file?
>>
>> The lack of any indication in the postmaster log that the postmaster saw
>> the parallel worker's crash sure looks the same as lorikeet's failure.
>> But if the postmaster didn't dump core, then we're still at a loss as to
>> why it didn't respond.
>>
>>  
>
>
> Rerunning with some different settings to see if I can get separate cores.
>


Numerous attempts to get core dumps following methods suggested in
Google searches have failed. The latest one is just hanging.

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-23 Thread Andrew Dunstan


On 06/23/2017 12:11 AM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 06/22/2017 10:24 AM, Tom Lane wrote:
>>> That earlier request is still valid.  Also, if you can reproduce the
>>> symptom that lorikeet just showed and get a stack trace from the
>>> (hypothetical) postmaster core dump, that would be hugely valuable.
>> See attached log and stacktrace
> The stacktrace seems to be from the parallel-query-leader backend.
> Was there another core file?
>
> The lack of any indication in the postmaster log that the postmaster saw
> the parallel worker's crash sure looks the same as lorikeet's failure.
> But if the postmaster didn't dump core, then we're still at a loss as to
> why it didn't respond.
>
>   



Rerunning with some different settings to see if I can get separate cores.

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-22 Thread Tom Lane
Andrew Dunstan  writes:
> On 06/22/2017 10:24 AM, Tom Lane wrote:
>> That earlier request is still valid.  Also, if you can reproduce the
>> symptom that lorikeet just showed and get a stack trace from the
>> (hypothetical) postmaster core dump, that would be hugely valuable.

> See attached log and stacktrace

The stacktrace seems to be from the parallel-query-leader backend.
Was there another core file?

The lack of any indication in the postmaster log that the postmaster saw
the parallel worker's crash sure looks the same as lorikeet's failure.
But if the postmaster didn't dump core, then we're still at a loss as to
why it didn't respond.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-22 Thread Andrew Dunstan


On 06/22/2017 10:24 AM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> Please let me know if there are tests I can run. I  missed your earlier
>> request in this thread, sorry about that.
> That earlier request is still valid.  Also, if you can reproduce the
> symptom that lorikeet just showed and get a stack trace from the
> (hypothetical) postmaster core dump, that would be hugely valuable.
>
>   


See attached log and stacktrace

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

??
??:0
??
??:0
??
??:0
??
??:0
??
??:0
??
??:0
errfinish
/home/andrew/bf64/root/HEAD/pgsql/src/backend/utils/error/elog.c:555
elog_finish
/home/andrew/bf64/root/HEAD/pgsql/src/backend/utils/error/elog.c:1378
s_lock_stuck
/home/andrew/bf64/root/HEAD/pgsql/src/backend/storage/lmgr/s_lock.c:83
s_lock
/home/andrew/bf64/root/HEAD/pgsql/src/backend/storage/lmgr/s_lock.c:98
shm_mq_detach
/home/andrew/bf64/root/HEAD/pgsql/src/backend/storage/ipc/shm_mq.c:783
slist_is_empty
/home/andrew/bf64/root/HEAD/pgsql/src/include/lib/ilist.h:567
DestroyParallelContext
/home/andrew/bf64/root/HEAD/pgsql/src/backend/access/transam/parallel.c:629
ExecParallelCleanup
/home/andrew/bf64/root/HEAD/pgsql/src/backend/executor/execParallel.c:672
ExecShutdownGather
/home/andrew/bf64/root/HEAD/pgsql/src/backend/executor/nodeGather.c:408


pglog
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-22 Thread Amit Kapila
On Thu, Jun 22, 2017 at 7:54 PM, Tom Lane  wrote:
> Andrew Dunstan  writes:
>> Please let me know if there are tests I can run. I  missed your earlier
>> request in this thread, sorry about that.
>
> That earlier request is still valid.
>

Yeah, that makes and also maybe we can try to print dsm_segment handle
value both before launching the worker and in parallel worker
(ParallelWorkerMain).  That might help us in identifying the reason of
error "could not map dynamic shared memory segment".


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-22 Thread Tom Lane
Andrew Dunstan  writes:
> Please let me know if there are tests I can run. I  missed your earlier
> request in this thread, sorry about that.

That earlier request is still valid.  Also, if you can reproduce the
symptom that lorikeet just showed and get a stack trace from the
(hypothetical) postmaster core dump, that would be hugely valuable.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-22 Thread Andrew Dunstan


On 06/21/2017 06:44 PM, Tom Lane wrote:
> Today, lorikeet failed with a new variant on the bgworker start crash:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet=2017-06-21%2020%3A29%3A10
>
> This one is even more exciting than the last one, because it sure looks
> like the crashing bgworker took the postmaster down with it.  That is
> Not Supposed To Happen.
>
> Wondering if we broke something here recently, I tried to reproduce it
> on a Linux machine by adding a randomized Assert failure in
> shm_mq_set_sender.  I don't see any such problem, even with EXEC_BACKEND;
> we recover from the crash as-expected.
>
> So I'm starting to get a distinct feeling that there's something wrong
> with the cygwin port.  But I dunno what.
>
>   


:-(

Please let me know if there are tests I can run. I  missed your earlier
request in this thread, sorry about that.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-21 Thread Tom Lane
Today, lorikeet failed with a new variant on the bgworker start crash:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet=2017-06-21%2020%3A29%3A10

This one is even more exciting than the last one, because it sure looks
like the crashing bgworker took the postmaster down with it.  That is
Not Supposed To Happen.

Wondering if we broke something here recently, I tried to reproduce it
on a Linux machine by adding a randomized Assert failure in
shm_mq_set_sender.  I don't see any such problem, even with EXEC_BACKEND;
we recover from the crash as-expected.

So I'm starting to get a distinct feeling that there's something wrong
with the cygwin port.  But I dunno what.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas
On Thu, Jun 15, 2017 at 5:16 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Jun 15, 2017 at 5:06 PM, Tom Lane  wrote:
>>> ... nodeGather cannot deem the query done until it's seen EOF on
>>> each tuple queue, which it cannot see until each worker has attached
>>> to and then detached from the associated shm_mq.
>
>> Oh.  That's sad.  It definitely has to wait for any tuple queues that
>> have been attached to be detached, but it would be better if it didn't
>> have to wait for processes that haven't even attached yet.
>
> Hm.  We assume they attach before they start taking any of the query
> work?  Seems reasonable, and this would give us some chance of recovering
> from worker fork failure.

Yeah, something like that.  I'm not sure exactly how to implement it,
though.  I think I intended for it to work that way all along, but the
code's not there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane
Robert Haas  writes:
> On Thu, Jun 15, 2017 at 5:06 PM, Tom Lane  wrote:
>> ... nodeGather cannot deem the query done until it's seen EOF on
>> each tuple queue, which it cannot see until each worker has attached
>> to and then detached from the associated shm_mq.

> Oh.  That's sad.  It definitely has to wait for any tuple queues that
> have been attached to be detached, but it would be better if it didn't
> have to wait for processes that haven't even attached yet.

Hm.  We assume they attach before they start taking any of the query
work?  Seems reasonable, and this would give us some chance of recovering
from worker fork failure.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas
On Thu, Jun 15, 2017 at 5:06 PM, Tom Lane  wrote:
> I wrote:
>> Robert Haas  writes:
>>> I think you're right.  So here's a theory:
>
>>> 1. The ERROR mapping the DSM segment is just a case of the worker the
>>> losing a race, and isn't a bug.
>
>> I concur that this is a possibility,
>
> Actually, no, it isn't.  I tried to reproduce the problem by inserting
> a sleep into ParallelWorkerMain, and could not.  After digging around
> in the code, I realize that the leader process *can not* exit the
> parallel query before the workers start, at least not without hitting
> an error first, which is not happening in these examples.  The reason
> is that nodeGather cannot deem the query done until it's seen EOF on
> each tuple queue, which it cannot see until each worker has attached
> to and then detached from the associated shm_mq.

Oh.  That's sad.  It definitely has to wait for any tuple queues that
have been attached to be detached, but it would be better if it didn't
have to wait for processes that haven't even attached yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane
I wrote:
> Robert Haas  writes:
>> I think you're right.  So here's a theory:

>> 1. The ERROR mapping the DSM segment is just a case of the worker the
>> losing a race, and isn't a bug.

> I concur that this is a possibility,

Actually, no, it isn't.  I tried to reproduce the problem by inserting
a sleep into ParallelWorkerMain, and could not.  After digging around
in the code, I realize that the leader process *can not* exit the
parallel query before the workers start, at least not without hitting
an error first, which is not happening in these examples.  The reason
is that nodeGather cannot deem the query done until it's seen EOF on
each tuple queue, which it cannot see until each worker has attached
to and then detached from the associated shm_mq.

(BTW, this also means that the leader is frozen solid if a worker
process fails to start, but we knew that already.)

So we still don't know why lorikeet is sometimes reporting "could not map
dynamic shared memory segment".  It's clear though that once that happens,
the current code has no prayer of recovering cleanly.  It looks from
lorikeet's logs like there is something that is forcing a timeout via
crash after ~150 seconds, but I do not know what that is.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane
Robert Haas  writes:
> I think you're right.  So here's a theory:

> 1. The ERROR mapping the DSM segment is just a case of the worker the
> losing a race, and isn't a bug.

I concur that this is a possibility, but if we expect this to happen,
seems like there should be other occurrences in the buildfarm logs.
I trolled the last three months worth of check/installcheck logs (all runs
not just failures), and could find exactly two cases of "could not map
dynamic shared memory segment":

 sysname  |branch |  snapshot   | stage  |  
   l
 
--+---+-++---
 lorikeet | REL9_6_STABLE | 2017-05-03 10:21:31 | Check  | 2017-05-03 
06:27:32.626 EDT [5909b094.1e28:1] ERROR:  could not map dynamic shared memory 
segment
 lorikeet | HEAD  | 2017-06-13 20:28:33 | InstallCheck-C | 2017-06-13 
16:44:57.247 EDT [59404ec9.2e78:1] ERROR:  could not map dynamic shared memory 
segment
(2 rows)

Now maybe this can be explained away by saying that the worker never loses
the race unless it's subject to cygwin's unusually slow fork() emulation,
but somehow I doubt that.  For one thing, it's not clear why that path
would be slower than EXEC_BACKEND, which would also involve populating
a new process image from scratch.

BTW, that 9.6 failure is worth studying because it looks quite a bit
different from the one on HEAD.  It looks like the worker failed to
launch and then the leader got hung up waiting for the worker.
Eventually other stuff started failing because the select_parallel
test is holding an exclusive lock on tenk1 throughout its session.
(Does it really need to do that ALTER TABLE?)

> 2. But when that happens, parallel_terminate_count is getting bumped
> twice for some reason.
> 3. So then the leader process fails that assertion when it tries to
> launch the parallel workers for the next query.

It seems like this has to trace to some sort of logic error in the
postmaster that's allowing it to mess up parallel_terminate_count,
but I'm not managing to construct a plausible flow of control that
would cause that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas
On Thu, Jun 15, 2017 at 10:21 AM, Amit Kapila  wrote:
> Yes, I think it is for next query.  If you refer the log below from lorikeet:
>
> 2017-06-13 16:44:57.179 EDT [59404ec6.2758:63] LOG:  statement:
> EXPLAIN (analyze, timing off, summary off, costs off) SELECT * FROM
> tenk1;
> 2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR:  could not map
> dynamic shared memory segment
> 2017-06-13 16:44:57.248 EDT [59404dec.2d9c:5] LOG:  worker process:
> parallel worker for PID 10072 (PID 11896) exited with exit code 1
> 2017-06-13 16:44:57.273 EDT [59404ec6.2758:64] LOG:  statement: select
> stringu1::int2 from tenk1 where unique1 = 1;
> TRAP: FailedAssertion("!(BackgroundWorkerData->parallel_register_count
> - BackgroundWorkerData->parallel_terminate_count <= 1024)", File:
> "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c",
> Line: 974)
> 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:6] LOG:  server process
> (PID 10072) was terminated by signal 6: Aborted
> 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:7] DETAIL:  Failed process
> was running: select stringu1::int2 from tenk1 where unique1 = 1;
> 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:8] LOG:  terminating any
> other active server processes
>
> Error "could not map dynamic shared memory segment" is due to query
> "EXPLAIN .. SELECT * FROM tenk1" and Assertion failure is due to
> another statement "select stringu1::int2 from tenk1 where unique1 =
> 1;".

I think you're right.  So here's a theory:

1. The ERROR mapping the DSM segment is just a case of the worker the
losing a race, and isn't a bug.

2. But when that happens, parallel_terminate_count is getting bumped
twice for some reason.

3. So then the leader process fails that assertion when it tries to
launch the parallel workers for the next query.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane
Robert Haas  writes:
> On Thu, Jun 15, 2017 at 10:38 AM, Tom Lane  wrote:
>> ... er, -ENOCAFFEINE.  Nonetheless, there are no checks of
>> EXEC_FLAG_EXPLAIN_ONLY in any parallel-query code, so I think
>> a bet is being missed somewhere.

> ExecGather() is where workers get launched, and that ain't happening
> if EXEC_FLAG_EXPLAIN_ONLY is set, unless I am *very* confused about
> how this works.

Ah, you're right, it's done during first execution not during InitPlan.
Nevermind ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas
On Thu, Jun 15, 2017 at 10:38 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Jun 15, 2017 at 10:32 AM, Tom Lane  wrote:
>>> It's fairly hard to read this other than as telling us that the worker was
>>> launched for the EXPLAIN (although really? why aren't we skipping that if
>>> EXEC_FLAG_EXPLAIN_ONLY?), ...
>
>> Uh, because ANALYZE was used?
>
> ... er, -ENOCAFFEINE.  Nonetheless, there are no checks of
> EXEC_FLAG_EXPLAIN_ONLY in any parallel-query code, so I think
> a bet is being missed somewhere.

ExecGather() is where workers get launched, and that ain't happening
if EXEC_FLAG_EXPLAIN_ONLY is set, unless I am *very* confused about
how this works.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane
Robert Haas  writes:
> On Thu, Jun 15, 2017 at 10:32 AM, Tom Lane  wrote:
>> It's fairly hard to read this other than as telling us that the worker was
>> launched for the EXPLAIN (although really? why aren't we skipping that if
>> EXEC_FLAG_EXPLAIN_ONLY?), ...

> Uh, because ANALYZE was used?

... er, -ENOCAFFEINE.  Nonetheless, there are no checks of
EXEC_FLAG_EXPLAIN_ONLY in any parallel-query code, so I think
a bet is being missed somewhere.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas
On Thu, Jun 15, 2017 at 10:32 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane  wrote:
>>> But we know, from the subsequent failed assertion, that the leader was
>>> still trying to launch parallel workers.  So that particular theory
>>> doesn't hold water.
>
>> Is there any chance that it's already trying to launch parallel
>> workers for the *next* query?
>
> Oh!  Yeah, you might be right, because the trace includes a statement
> LOG entry from the leader in between:
>
> 2017-06-13 16:44:57.179 EDT [59404ec6.2758:63] LOG:  statement: EXPLAIN 
> (analyze, timing off, summary off, costs off) SELECT * FROM tenk1;
> 2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR:  could not map dynamic 
> shared memory segment
> 2017-06-13 16:44:57.248 EDT [59404dec.2d9c:5] LOG:  worker process: parallel 
> worker for PID 10072 (PID 11896) exited with exit code 1
> 2017-06-13 16:44:57.273 EDT [59404ec6.2758:64] LOG:  statement: select 
> stringu1::int2 from tenk1 where unique1 = 1;
> TRAP: FailedAssertion("!(BackgroundWorkerData->parallel_register_count - 
> BackgroundWorkerData->parallel_terminate_count <= 1024)", File: 
> "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c",
>  Line: 974)
> 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:6] LOG:  server process (PID 
> 10072) was terminated by signal 6: Aborted
>
> It's fairly hard to read this other than as telling us that the worker was
> launched for the EXPLAIN (although really? why aren't we skipping that if
> EXEC_FLAG_EXPLAIN_ONLY?), ...

Uh, because ANALYZE was used?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane
Robert Haas  writes:
> On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane  wrote:
>> But we know, from the subsequent failed assertion, that the leader was
>> still trying to launch parallel workers.  So that particular theory
>> doesn't hold water.

> Is there any chance that it's already trying to launch parallel
> workers for the *next* query?

Oh!  Yeah, you might be right, because the trace includes a statement
LOG entry from the leader in between:

2017-06-13 16:44:57.179 EDT [59404ec6.2758:63] LOG:  statement: EXPLAIN 
(analyze, timing off, summary off, costs off) SELECT * FROM tenk1;
2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR:  could not map dynamic 
shared memory segment
2017-06-13 16:44:57.248 EDT [59404dec.2d9c:5] LOG:  worker process: parallel 
worker for PID 10072 (PID 11896) exited with exit code 1
2017-06-13 16:44:57.273 EDT [59404ec6.2758:64] LOG:  statement: select 
stringu1::int2 from tenk1 where unique1 = 1;
TRAP: FailedAssertion("!(BackgroundWorkerData->parallel_register_count - 
BackgroundWorkerData->parallel_terminate_count <= 1024)", File: 
"/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c",
 Line: 974)
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:6] LOG:  server process (PID 10072) 
was terminated by signal 6: Aborted

It's fairly hard to read this other than as telling us that the worker was
launched for the EXPLAIN (although really? why aren't we skipping that if
EXEC_FLAG_EXPLAIN_ONLY?), and then we see a new LOG entry for the next
statement before the leader hits its assertion failure.

> Could be -- but it could also be timing-related.  If we are in fact
> using cygwin's fork emulation, the documentation for it explains that
> it's slow: https://www.cygwin.com/faq.html#faq.api.fork
> Interestingly, it also mentions that making it work requires
> suspending the parent while the child is starting up, which probably
> does not happen on any other platform.  Of course it also makes my
> theory that the child doesn't reach dsm_attach() before the parent
> finishes the query pretty unlikely.

Well, if this was a worker launched during InitPlan() for an EXPLAIN,
the leader would have shut down the query almost immediately after
launching the worker.  So it does fit pretty well as long as you're
willing to believe that the leader got to run before the child.

But what this theory doesn't explain is: why haven't we seen this before?
It now seems like it ought to come up often, since there are several
EXPLAINs for parallel queries in that test.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Amit Kapila
On Thu, Jun 15, 2017 at 7:42 PM, Robert Haas  wrote:
> On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane  wrote:
>>> Well, as Amit points out, there are entirely legitimate ways for that
>>> to happen.  If the leader finishes the whole query itself before the
>>> worker reaches the dsm_attach() call, it will call dsm_detach(),
>>> destroying the segment, and the worker will hit this ERROR.  That
>>> shouldn't happen very often in the real world, because we ought not to
>>> select a parallel plan in the first place unless the query is going to
>>> take a while to run, but the select_parallel test quite deliberately
>>> disarms all of the guards that would tend to discourage such plans.
>>
>> But we know, from the subsequent failed assertion, that the leader was
>> still trying to launch parallel workers.  So that particular theory
>> doesn't hold water.
>
> Is there any chance that it's already trying to launch parallel
> workers for the *next* query?
>

Yes, I think it is for next query.  If you refer the log below from lorikeet:

2017-06-13 16:44:57.179 EDT [59404ec6.2758:63] LOG:  statement:
EXPLAIN (analyze, timing off, summary off, costs off) SELECT * FROM
tenk1;
2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR:  could not map
dynamic shared memory segment
2017-06-13 16:44:57.248 EDT [59404dec.2d9c:5] LOG:  worker process:
parallel worker for PID 10072 (PID 11896) exited with exit code 1
2017-06-13 16:44:57.273 EDT [59404ec6.2758:64] LOG:  statement: select
stringu1::int2 from tenk1 where unique1 = 1;
TRAP: FailedAssertion("!(BackgroundWorkerData->parallel_register_count
- BackgroundWorkerData->parallel_terminate_count <= 1024)", File:
"/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c",
Line: 974)
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:6] LOG:  server process
(PID 10072) was terminated by signal 6: Aborted
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:7] DETAIL:  Failed process
was running: select stringu1::int2 from tenk1 where unique1 = 1;
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:8] LOG:  terminating any
other active server processes

Error "could not map dynamic shared memory segment" is due to query
"EXPLAIN .. SELECT * FROM tenk1" and Assertion failure is due to
another statement "select stringu1::int2 from tenk1 where unique1 =
1;".


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas
On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane  wrote:
>> Well, as Amit points out, there are entirely legitimate ways for that
>> to happen.  If the leader finishes the whole query itself before the
>> worker reaches the dsm_attach() call, it will call dsm_detach(),
>> destroying the segment, and the worker will hit this ERROR.  That
>> shouldn't happen very often in the real world, because we ought not to
>> select a parallel plan in the first place unless the query is going to
>> take a while to run, but the select_parallel test quite deliberately
>> disarms all of the guards that would tend to discourage such plans.
>
> But we know, from the subsequent failed assertion, that the leader was
> still trying to launch parallel workers.  So that particular theory
> doesn't hold water.

Is there any chance that it's already trying to launch parallel
workers for the *next* query?

>> Of course, as Amit also points out, it could also be the result of
>> some bug, but I'm not sure we have any reason to think so.
>
> The fact that we've only seen this on cygwin leads the mind in the
> direction of platform-specific problems.  Both this case and lorikeet's
> earlier symptoms could be explained if the parameters passed from leader
> to workers somehow got corrupted occasionally; so that's what I've been
> thinking about, but I'm not seeing anything.

Could be -- but it could also be timing-related.  If we are in fact
using cygwin's fork emulation, the documentation for it explains that
it's slow: https://www.cygwin.com/faq.html#faq.api.fork

Interestingly, it also mentions that making it work requires
suspending the parent while the child is starting up, which probably
does not happen on any other platform.  Of course it also makes my
theory that the child doesn't reach dsm_attach() before the parent
finishes the query pretty unlikely.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane
Robert Haas  writes:
> On Wed, Jun 14, 2017 at 6:01 PM, Tom Lane  wrote:
>> The lack of any other message before the 'could not map' failure must,
>> then, mean that dsm_attach() couldn't find an entry in shared memory
>> that it wanted to attach to.  But how could that happen?

> Well, as Amit points out, there are entirely legitimate ways for that
> to happen.  If the leader finishes the whole query itself before the
> worker reaches the dsm_attach() call, it will call dsm_detach(),
> destroying the segment, and the worker will hit this ERROR.  That
> shouldn't happen very often in the real world, because we ought not to
> select a parallel plan in the first place unless the query is going to
> take a while to run, but the select_parallel test quite deliberately
> disarms all of the guards that would tend to discourage such plans.

But we know, from the subsequent failed assertion, that the leader was
still trying to launch parallel workers.  So that particular theory
doesn't hold water.

> Of course, as Amit also points out, it could also be the result of
> some bug, but I'm not sure we have any reason to think so.

The fact that we've only seen this on cygwin leads the mind in the
direction of platform-specific problems.  Both this case and lorikeet's
earlier symptoms could be explained if the parameters passed from leader
to workers somehow got corrupted occasionally; so that's what I've been
thinking about, but I'm not seeing anything.

Would someone confirm my recollection that the cygwin build does *not*
use EXEC_BACKEND, but relies on a cygwin-provided emulation of fork()?
How does that emulation work, anyway?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas
On Wed, Jun 14, 2017 at 6:01 PM, Tom Lane  wrote:
> I wrote:
>> But surely the silent treatment should only apply to DSM_OP_CREATE?
>
> Oh ... scratch that, it *does* only apply to DSM_OP_CREATE.
>
> The lack of any other message before the 'could not map' failure must,
> then, mean that dsm_attach() couldn't find an entry in shared memory
> that it wanted to attach to.  But how could that happen?

Well, as Amit points out, there are entirely legitimate ways for that
to happen.  If the leader finishes the whole query itself before the
worker reaches the dsm_attach() call, it will call dsm_detach(),
destroying the segment, and the worker will hit this ERROR.  That
shouldn't happen very often in the real world, because we ought not to
select a parallel plan in the first place unless the query is going to
take a while to run, but the select_parallel test quite deliberately
disarms all of the guards that would tend to discourage such plans.

Of course, as Amit also points out, it could also be the result of
some bug, but I'm not sure we have any reason to think so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Amit Kapila
On Thu, Jun 15, 2017 at 3:31 AM, Tom Lane  wrote:
> I wrote:
>> But surely the silent treatment should only apply to DSM_OP_CREATE?
>
> Oh ... scratch that, it *does* only apply to DSM_OP_CREATE.
>
> The lack of any other message before the 'could not map' failure must,
> then, mean that dsm_attach() couldn't find an entry in shared memory
> that it wanted to attach to.
>

Yes, I also think so.

>  But how could that happen?
>

I could think of a couple of reasons due to which it can happen. (a)
the value of segment handle passed by the master backend to worker
backend somehow got messed up. (b) all other workers along with master
backend exited before one of the workers try to attach. (c) the master
backend has not actually created any dsm segment (like when max
segments have reached)  but still invoked some workers. (d) some
corner case bug in dsm code due to which it can't attach to a valid
segment handle.

Now, of these, I have checked that (c) can't happen because we ensure
that if the segment is not created than we make workers as zero.  I
think (b) shouldn't happen because we wait for all workers to exit
before the query is finished.  Now, I think (a) and (d) are slightly
related and I have looked around in the relevant code but didn't find
any obvious problem, however, it seems to me that it might be
something which happens on Cygwin environment differently.  For
example, I think the value of seg->handle can be different from what
we expect in dsm_create.  Basically, random returns long and we are
storing it in dsm_handle (uint32), so considering long is 8 bytes on
Cygwin [1] and 4 bytes on Windows the value could wrap but even if
that happens it is not clear how that can cause what we are seeing in
this case.


[1] - 
https://stackoverflow.com/questions/384502/what-is-the-bit-size-of-long-on-64-bit-windows

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-14 Thread Tom Lane
I wrote:
> But surely the silent treatment should only apply to DSM_OP_CREATE?

Oh ... scratch that, it *does* only apply to DSM_OP_CREATE.

The lack of any other message before the 'could not map' failure must,
then, mean that dsm_attach() couldn't find an entry in shared memory
that it wanted to attach to.  But how could that happen?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-14 Thread Tom Lane
Robert Haas  writes:
> On Wed, Jun 14, 2017 at 3:33 PM, Tom Lane  wrote:
>> So the first problem here is the lack of supporting information for the
>> 'could not map' failure.

> Hmm.  I think I believed at the time I wrote dsm_attach() that
> somebody might want to try to soldier on after failing to map a DSM,
> but that doesn't seem very likely  any more.

Well, if they do, they shouldn't be passing elevel == ERROR.

>> AFAICS, this must mean either that dsm_attach()
>> returned without ever calling dsm_impl_op() at all, or that
>> dsm_impl_op()'s Windows codepath encountered ERROR_ALREADY_EXISTS or
>> ERROR_ACCESS_DENIED.  It's far from clear why those cases should be
>> treated as a silent fail.

> There's a good reason for that, though.  See
> 419113dfdc4c729f6c763cc30a9b02ee68a7da94.

But surely the silent treatment should only apply to DSM_OP_CREATE?
We're not going to retry anything else.

>> It's even less clear why dsm_attach's early
>> exit cases don't produce any messages.  But since we're left not knowing
>> what happened, the messaging design here is clearly inadequate.

> I don't know what you mean by this.  The function only has one
> early-exit case, the comment for which I quoted above.

OK, s/cases/case/, but the problem remains.  We don't know what happened.
We cannot have more than one case in this code where nothing gets logged.

>> It's not very clear how that happened, but my
>> bet is that the postmaster incremented parallel_terminate_count more than
>> once while cleaning up after the crashed worker.  It looks to me like
>> there's nothing stopping BackgroundWorkerStateChange from incrementing it
>> and then the eventual ForgetBackgroundWorker call from incrementing it
>> again.  I haven't traced through things to identify why this might only
>> occur in a worker-failure scenario, but surely we want to make sure that
>> the counter increment happens once and only once per worker.

> Yeah -- if that can happen, it's definitely a bug.

My first thought about fixing it is that we should remove that code from
BackgroundWorkerStateChange altogether.  The parallel_terminate_count
increment should happen in, and only in, ForgetBackgroundWorker.  There
seems little reason to risk bugs by trying to do it a bit earlier.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-14 Thread Robert Haas
On Wed, Jun 14, 2017 at 3:33 PM, Tom Lane  wrote:
> So the first problem here is the lack of supporting information for the
> 'could not map' failure.

Hmm.  I think I believed at the time I wrote dsm_attach() that
somebody might want to try to soldier on after failing to map a DSM,
but that doesn't seem very likely  any more.  That theory is supported
by this comment:

/*
 * If we didn't find the handle we're looking for in the control segment,
 * it probably means that everyone else who had it mapped, including the
 * original creator, died before we got to this point. It's up to the
 * caller to decide what to do about that.
 */

But in practice, every current caller handles a failure here by bailing out.

> AFAICS, this must mean either that dsm_attach()
> returned without ever calling dsm_impl_op() at all, or that
> dsm_impl_op()'s Windows codepath encountered ERROR_ALREADY_EXISTS or
> ERROR_ACCESS_DENIED.  It's far from clear why those cases should be
> treated as a silent fail.

There's a good reason for that, though.  See
419113dfdc4c729f6c763cc30a9b02ee68a7da94.

> It's even less clear why dsm_attach's early
> exit cases don't produce any messages.  But since we're left not knowing
> what happened, the messaging design here is clearly inadequate.

I don't know what you mean by this.  The function only has one
early-exit case, the comment for which I quoted above.

> The subsequent assertion crash came from here:
>
> /*
>  * If this is a parallel worker, check whether there are already too many
>  * parallel workers; if so, don't register another one.  Our view of
>  * parallel_terminate_count may be slightly stale, but that doesn't really
>  * matter: we would have gotten the same result if we'd arrived here
>  * slightly earlier anyway.  There's no help for it, either, since the
>  * postmaster must not take locks; a memory barrier wouldn't guarantee
>  * anything useful.
>  */
> if (parallel && (BackgroundWorkerData->parallel_register_count -
>  BackgroundWorkerData->parallel_terminate_count) >=
> max_parallel_workers)
> {
> Assert(BackgroundWorkerData->parallel_register_count -
>BackgroundWorkerData->parallel_terminate_count <=
>MAX_PARALLEL_WORKER_LIMIT);
> LWLockRelease(BackgroundWorkerLock);
> return false;
> }
>
> What I suspect happened is that parallel_register_count was less than
> parallel_terminate_count; since those counters are declared as uint32,
> the negative difference would have been treated as a large unsigned value,
> triggering the assertion.

Right, and that's exactly the point of having that assertion.

> It's not very clear how that happened, but my
> bet is that the postmaster incremented parallel_terminate_count more than
> once while cleaning up after the crashed worker.  It looks to me like
> there's nothing stopping BackgroundWorkerStateChange from incrementing it
> and then the eventual ForgetBackgroundWorker call from incrementing it
> again.  I haven't traced through things to identify why this might only
> occur in a worker-failure scenario, but surely we want to make sure that
> the counter increment happens once and only once per worker.

Yeah -- if that can happen, it's definitely a bug.

> I'm also noting that ForgetBackgroundWorker is lacking a memory barrier
> between incrementing parallel_terminate_count and marking the slot free.
> Maybe it doesn't need one, but since there is one in
> BackgroundWorkerStateChange, it looks weird to not have it here.

I noticed that the other day and thought about mentioning it, but
didn't quite muster up the energy.  It seems unlikely to me to be the
cause of any of the problems we're seeing, but I think it can't hurt
to fix the omission.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-14 Thread Tom Lane
Yesterday lorikeet failed the select_parallel test in a new way:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet=2017-06-13%2020%3A28%3A33

2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR:  could not map dynamic 
shared memory segment
2017-06-13 16:44:57.248 EDT [59404dec.2d9c:5] LOG:  worker process: parallel 
worker for PID 10072 (PID 11896) exited with exit code 1
2017-06-13 16:44:57.273 EDT [59404ec6.2758:64] LOG:  statement: select 
stringu1::int2 from tenk1 where unique1 = 1;
TRAP: FailedAssertion("!(BackgroundWorkerData->parallel_register_count - 
BackgroundWorkerData->parallel_terminate_count <= 1024)", File: 
"/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c",
 Line: 974)
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:6] LOG:  server process (PID 10072) 
was terminated by signal 6: Aborted
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:7] DETAIL:  Failed process was 
running: select stringu1::int2 from tenk1 where unique1 = 1;
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:8] LOG:  terminating any other 
active server processes

So the first problem here is the lack of supporting information for the
'could not map' failure.  AFAICS, this must mean either that dsm_attach()
returned without ever calling dsm_impl_op() at all, or that
dsm_impl_op()'s Windows codepath encountered ERROR_ALREADY_EXISTS or
ERROR_ACCESS_DENIED.  It's far from clear why those cases should be
treated as a silent fail.  It's even less clear why dsm_attach's early
exit cases don't produce any messages.  But since we're left not knowing
what happened, the messaging design here is clearly inadequate.

The subsequent assertion crash came from here:

/*
 * If this is a parallel worker, check whether there are already too many
 * parallel workers; if so, don't register another one.  Our view of
 * parallel_terminate_count may be slightly stale, but that doesn't really
 * matter: we would have gotten the same result if we'd arrived here
 * slightly earlier anyway.  There's no help for it, either, since the
 * postmaster must not take locks; a memory barrier wouldn't guarantee
 * anything useful.
 */
if (parallel && (BackgroundWorkerData->parallel_register_count -
 BackgroundWorkerData->parallel_terminate_count) >=
max_parallel_workers)
{
Assert(BackgroundWorkerData->parallel_register_count -
   BackgroundWorkerData->parallel_terminate_count <=
   MAX_PARALLEL_WORKER_LIMIT);
LWLockRelease(BackgroundWorkerLock);
return false;
}

What I suspect happened is that parallel_register_count was less than
parallel_terminate_count; since those counters are declared as uint32,
the negative difference would have been treated as a large unsigned value,
triggering the assertion.  It's not very clear how that happened, but my
bet is that the postmaster incremented parallel_terminate_count more than
once while cleaning up after the crashed worker.  It looks to me like
there's nothing stopping BackgroundWorkerStateChange from incrementing it
and then the eventual ForgetBackgroundWorker call from incrementing it
again.  I haven't traced through things to identify why this might only
occur in a worker-failure scenario, but surely we want to make sure that
the counter increment happens once and only once per worker.

I'm also noting that ForgetBackgroundWorker is lacking a memory barrier
between incrementing parallel_terminate_count and marking the slot free.
Maybe it doesn't need one, but since there is one in
BackgroundWorkerStateChange, it looks weird to not have it here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-07 Thread Tom Lane
Robert Haas  writes:
> On Wed, Jun 7, 2017 at 6:36 AM, Amit Kapila  wrote:
>> I don't think so because this problem has been reported previously as
>> well [1][2] even before the commit in question.
>> 
>> [1] - 
>> https://www.postgresql.org/message-id/1ce5a19f-3b1d-bb1c-4561-0158176f65f1%40dunslane.net
>> [2] - https://www.postgresql.org/message-id/25861.1472215822%40sss.pgh.pa.us

> Oh, good point.  So this is a longstanding bug that has possibly just
> gotten easier to hit.

> I still think figuring out what's going on with the
> ParallelWorkerNumbers is a good plan.

Also, your suggestion in

https://www.postgresql.org/message-id/CA%2BTgmob29v0zASBNfgO1Mq9yJ7_TRoAjL%3DO%2B2rXS0gBZezv%2BrQ%40mail.gmail.com

for a quicker path to reproducing it might still be valid.  Although at
this point, it seems like something we've changed recently has made it
occur often enough in the buildfarm that repeating the standard regression
tests should be good enough.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-07 Thread Robert Haas
On Wed, Jun 7, 2017 at 6:36 AM, Amit Kapila  wrote:
> I don't think so because this problem has been reported previously as
> well [1][2] even before the commit in question.
>
> [1] - 
> https://www.postgresql.org/message-id/1ce5a19f-3b1d-bb1c-4561-0158176f65f1%40dunslane.net
> [2] - https://www.postgresql.org/message-id/25861.1472215822%40sss.pgh.pa.us

Oh, good point.  So this is a longstanding bug that has possibly just
gotten easier to hit.

I still think figuring out what's going on with the
ParallelWorkerNumbers is a good plan.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-07 Thread Amit Kapila
On Wed, Jun 7, 2017 at 12:37 AM, Robert Haas  wrote:
> On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane  wrote:
>>> One thought is that the only places where shm_mq_set_sender() should
>>> be getting invoked during the main regression tests are
>>> ParallelWorkerMain() and ExecParallelGetReceiver, and both of those
>>> places using ParallelWorkerNumber to figure out what address to pass.
>>> So if ParallelWorkerNumber were getting set to the same value in two
>>> different parallel workers - e.g. because the postmaster went nuts and
>>> launched two processes instead of only one - or if
>>> ParallelWorkerNumber were not getting initialized at all or were
>>> getting initialized to some completely bogus value, it could cause
>>> this symptom.
>>
>> Hmm.  With some generous assumptions it'd be possible to think that
>> aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this.  That commit was
>> present in 20 successful lorikeet runs before the first of these failures,
>> which is a bit more than the MTBF after that, but not a huge amount more.
>>
>> That commit in itself looks innocent enough, but could it have exposed
>> some latent bug in bgworker launching?
>
> Hmm, that's a really interesting idea, but I can't quite put together
> a plausible theory around it.  I mean, it seems like that commit could
> make launching bgworkers faster, which could conceivably tickle some
> heretofore-latent timing-related bug.  But it wouldn't, IIUC, make the
> first worker start any faster than before - it would just make them
> more closely-spaced thereafter, and it's not very obvious how that
> would cause a problem.
>
> Another idea is that the commit in question is managing to corrupt
> BackgroundWorkerList somehow.
>

I don't think so because this problem has been reported previously as
well [1][2] even before the commit in question.


[1] - 
https://www.postgresql.org/message-id/1ce5a19f-3b1d-bb1c-4561-0158176f65f1%40dunslane.net
[2] - https://www.postgresql.org/message-id/25861.1472215822%40sss.pgh.pa.us

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-06 Thread Robert Haas
On Tue, Jun 6, 2017 at 4:25 PM, Tom Lane  wrote:
> (I'm tempted to add something like this permanently, at DEBUG1 or DEBUG2
> or so.)

I don't mind adding it permanently, but I think that's too high.
Somebody running a lot of parallel queries could easily get enough
messages to drown out the stuff they actually want to see at that
level.  I wouldn't object to DEBUG3 though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-06 Thread Tom Lane
Robert Haas  writes:
> On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane  wrote:
>> Hmm.  With some generous assumptions it'd be possible to think that
>> aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this.  That commit was
>> present in 20 successful lorikeet runs before the first of these failures,
>> which is a bit more than the MTBF after that, but not a huge amount more.
>> That commit in itself looks innocent enough, but could it have exposed
>> some latent bug in bgworker launching?

> Hmm, that's a really interesting idea, but I can't quite put together
> a plausible theory around it.

Yeah, me either, but we're really theorizing in advance of the data here.
Andrew, could you apply the attached patch on lorikeet and run the
regression tests enough times to get a couple of failures?  Then grepping
the postmaster log for 'parallel worker' should give you results like

2017-06-06 16:20:12.393 EDT [31216] LOG:  starting PID 31216, parallel worker 
for PID 31215, worker number 0
2017-06-06 16:20:12.400 EDT [31216] LOG:  stopping PID 31216, parallel worker 
for PID 31215, worker number 0
2017-06-06 16:20:12.406 EDT [31217] LOG:  starting PID 31217, parallel worker 
for PID 31215, worker number 3
2017-06-06 16:20:12.406 EDT [31218] LOG:  starting PID 31218, parallel worker 
for PID 31215, worker number 2
2017-06-06 16:20:12.406 EDT [31219] LOG:  starting PID 31219, parallel worker 
for PID 31215, worker number 1
2017-06-06 16:20:12.406 EDT [31220] LOG:  starting PID 31220, parallel worker 
for PID 31215, worker number 0
2017-06-06 16:20:12.412 EDT [31218] LOG:  stopping PID 31218, parallel worker 
for PID 31215, worker number 2
2017-06-06 16:20:12.412 EDT [31219] LOG:  stopping PID 31219, parallel worker 
for PID 31215, worker number 1
2017-06-06 16:20:12.412 EDT [31220] LOG:  stopping PID 31220, parallel worker 
for PID 31215, worker number 0
2017-06-06 16:20:12.412 EDT [31217] LOG:  stopping PID 31217, parallel worker 
for PID 31215, worker number 3
... etc etc ...

If it looks different from that in a crash case, we'll have something
to go on.

(I'm tempted to add something like this permanently, at DEBUG1 or DEBUG2
or so.)

regards, tom lane

diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index cb22174..d3cb26c 100644
*** a/src/backend/access/transam/parallel.c
--- b/src/backend/access/transam/parallel.c
*** ParallelWorkerMain(Datum main_arg)
*** 950,955 
--- 950,961 
  	Assert(ParallelWorkerNumber == -1);
  	memcpy(, MyBgworkerEntry->bgw_extra, sizeof(int));
  
+ 	/* Log parallel worker startup. */
+ 	ereport(LOG,
+ 			(errmsg("starting PID %d, %s, worker number %d",
+ 	MyProcPid, MyBgworkerEntry->bgw_name,
+ 	ParallelWorkerNumber)));
+ 
  	/* Set up a memory context and resource owner. */
  	Assert(CurrentResourceOwner == NULL);
  	CurrentResourceOwner = ResourceOwnerCreate(NULL, "parallel toplevel");
*** ParallelWorkerMain(Datum main_arg)
*** 1112,1117 
--- 1118,1129 
  
  	/* Report success. */
  	pq_putmessage('X', NULL, 0);
+ 
+ 	/* Log parallel worker shutdown. */
+ 	ereport(LOG,
+ 			(errmsg("stopping PID %d, %s, worker number %d",
+ 	MyProcPid, MyBgworkerEntry->bgw_name,
+ 	ParallelWorkerNumber)));
  }
  
  /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-06 Thread Robert Haas
On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane  wrote:
>> One thought is that the only places where shm_mq_set_sender() should
>> be getting invoked during the main regression tests are
>> ParallelWorkerMain() and ExecParallelGetReceiver, and both of those
>> places using ParallelWorkerNumber to figure out what address to pass.
>> So if ParallelWorkerNumber were getting set to the same value in two
>> different parallel workers - e.g. because the postmaster went nuts and
>> launched two processes instead of only one - or if
>> ParallelWorkerNumber were not getting initialized at all or were
>> getting initialized to some completely bogus value, it could cause
>> this symptom.
>
> Hmm.  With some generous assumptions it'd be possible to think that
> aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this.  That commit was
> present in 20 successful lorikeet runs before the first of these failures,
> which is a bit more than the MTBF after that, but not a huge amount more.
>
> That commit in itself looks innocent enough, but could it have exposed
> some latent bug in bgworker launching?

Hmm, that's a really interesting idea, but I can't quite put together
a plausible theory around it.  I mean, it seems like that commit could
make launching bgworkers faster, which could conceivably tickle some
heretofore-latent timing-related bug.  But it wouldn't, IIUC, make the
first worker start any faster than before - it would just make them
more closely-spaced thereafter, and it's not very obvious how that
would cause a problem.

Another idea is that the commit in question is managing to corrupt
BackgroundWorkerList somehow.  maybe_start_bgworkers() is using
slist_foreach_modify(), but previously it always returned after
calling do_start_bgworker, and now it doesn't.  So if
do_start_bgworker() did something that could modify the list
structure, then perhaps maybe_start_bgworkers() would get confused.  I
don't really think that this theory has any legs, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-06 Thread Tom Lane
Robert Haas  writes:
> On Mon, Jun 5, 2017 at 10:40 AM, Andrew Dunstan
>  wrote:
>> Buildfarm member lorikeet is failing occasionally with a failed
>> assertion during the select_parallel regression tests like this:

> I don't *think* we've made any relevant code changes lately.  The only
> thing that I can see as looking at all relevant is
> b6dd1271281ce856ab774fc0b491a92878e3b501, but that doesn't really seem
> like it can be to blame.

Yeah, I don't believe that either.  That could have introduced a hard
failure (if something were relying on initializing a field before where
I put the memsets) but it's hard to see how it could produce an
intermittent and platform-specific one.

> One thought is that the only places where shm_mq_set_sender() should
> be getting invoked during the main regression tests are
> ParallelWorkerMain() and ExecParallelGetReceiver, and both of those
> places using ParallelWorkerNumber to figure out what address to pass.
> So if ParallelWorkerNumber were getting set to the same value in two
> different parallel workers - e.g. because the postmaster went nuts and
> launched two processes instead of only one - or if
> ParallelWorkerNumber were not getting initialized at all or were
> getting initialized to some completely bogus value, it could cause
> this symptom.

Hmm.  With some generous assumptions it'd be possible to think that
aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this.  That commit was
present in 20 successful lorikeet runs before the first of these failures,
which is a bit more than the MTBF after that, but not a huge amount more.

That commit in itself looks innocent enough, but could it have exposed
some latent bug in bgworker launching?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-06 Thread Robert Haas
On Mon, Jun 5, 2017 at 10:40 AM, Andrew Dunstan
 wrote:
> Buildfarm member lorikeet is failing occasionally with a failed
> assertion during the select_parallel regression tests like this:
>
>
> 2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG:  statement: select 
> count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0;
> TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: 
> "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c",
>  Line: 221)
>
> I'll see if I can find out why, but if anyone has any ideas why this might be 
> happening (started about 3 weeks ago) that would be helpful.

I don't *think* we've made any relevant code changes lately.  The only
thing that I can see as looking at all relevant is
b6dd1271281ce856ab774fc0b491a92878e3b501, but that doesn't really seem
like it can be to blame.

One thought is that the only places where shm_mq_set_sender() should
be getting invoked during the main regression tests are
ParallelWorkerMain() and ExecParallelGetReceiver, and both of those
places using ParallelWorkerNumber to figure out what address to pass.
So if ParallelWorkerNumber were getting set to the same value in two
different parallel workers - e.g. because the postmaster went nuts and
launched two processes instead of only one - or if
ParallelWorkerNumber were not getting initialized at all or were
getting initialized to some completely bogus value, it could cause
this symptom.

What ought to be happening, if there are N workers launched by a
parallel query, is that ParallelWorkerNumber should be different in
every worker, over the range 0 to N-1.  I think if I were you my first
step would be to verify that ParallelWorkerNumber is in that range in
the crashed backend, and if it is, my second step would be to add some
debugging code to ParallelWorkerMain() to print it out in every worker
that gets launched and make sure they're all in range and different.

All of the above might be going in the wrong direction entirely, but
it's the first thing that comes to mind for me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-05 Thread Tom Lane
Andrew Dunstan  writes:
> Buildfarm member lorikeet is failing occasionally with a failed
> assertion during the select_parallel regression tests like this:

> 2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG:  statement: select 
> count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0;
> TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: 
> "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c",
>  Line: 221)

> I'll see if I can find out why, but if anyone has any ideas why this might be 
> happening (started about 3 weeks ago) that would be helpful.

Well, this seems like it has to indicate an incorrect call of
shm_mq_set_sender.  I have no great insights as to what might be causing
that, but I sure find it to be pretty unacceptable coding practice that
the call sites are not checking for failure returns from shm_toc_lookup.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-05 Thread Andrew Dunstan

Buildfarm member lorikeet is failing occasionally with a failed
assertion during the select_parallel regression tests like this:


2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG:  statement: select 
count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0;
TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: 
"/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c",
 Line: 221)

I'll see if I can find out why, but if anyone has any ideas why this might be 
happening (started about 3 weeks ago) that would be helpful.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers