Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-08-10 Thread Noah Misch
On Thu, Aug 03, 2017 at 10:45:50AM -0400, Robert Haas wrote: > On Wed, Aug 2, 2017 at 11:47 PM, Noah Misch wrote: > > postmaster algorithms rely on the PG_SETMASK() calls preventing that. > > Without > > such protection, duplicate bgworkers are an understandable result. I

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-08-03 Thread Robert Haas
On Wed, Aug 2, 2017 at 11:47 PM, Noah Misch wrote: > postmaster algorithms rely on the PG_SETMASK() calls preventing that. Without > such protection, duplicate bgworkers are an understandable result. I caught > several other assertions; the PMChildFlags failure is another

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-08-02 Thread Noah Misch
On Wed, Jun 21, 2017 at 06:44:09PM -0400, Tom Lane wrote: > Today, lorikeet failed with a new variant on the bgworker start crash: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet=2017-06-21%2020%3A29%3A10 > > This one is even more exciting than the last one, because it sure

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-26 Thread Amit Kapila
On Mon, Jun 26, 2017 at 8:09 PM, Andrew Dunstan wrote: > > > On 06/26/2017 10:36 AM, Amit Kapila wrote: >> On Fri, Jun 23, 2017 at 9:12 AM, Andrew Dunstan >> wrote: >>> >>> On 06/22/2017 10:24 AM, Tom Lane wrote: Andrew Dunstan

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-26 Thread Andrew Dunstan
On 06/26/2017 10:45 AM, Tom Lane wrote: > Andrew Dunstan writes: >> On 06/23/2017 07:47 AM, Andrew Dunstan wrote: >>> Rerunning with some different settings to see if I can get separate cores. >> Numerous attempts to get core dumps following methods suggested in

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-26 Thread Tom Lane
Andrew Dunstan writes: > On 06/23/2017 07:47 AM, Andrew Dunstan wrote: >> Rerunning with some different settings to see if I can get separate cores. > Numerous attempts to get core dumps following methods suggested in > Google searches have failed. The latest one

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-26 Thread Andrew Dunstan
On 06/26/2017 10:36 AM, Amit Kapila wrote: > On Fri, Jun 23, 2017 at 9:12 AM, Andrew Dunstan > wrote: >> >> On 06/22/2017 10:24 AM, Tom Lane wrote: >>> Andrew Dunstan writes: Please let me know if there are tests I can run. I

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-26 Thread Amit Kapila
On Fri, Jun 23, 2017 at 9:12 AM, Andrew Dunstan wrote: > > > On 06/22/2017 10:24 AM, Tom Lane wrote: >> Andrew Dunstan writes: >>> Please let me know if there are tests I can run. I missed your earlier >>> request in this thread,

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-26 Thread Andrew Dunstan
On 06/23/2017 07:47 AM, Andrew Dunstan wrote: > > On 06/23/2017 12:11 AM, Tom Lane wrote: >> Andrew Dunstan writes: >>> On 06/22/2017 10:24 AM, Tom Lane wrote: That earlier request is still valid. Also, if you can reproduce the symptom that lorikeet

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-23 Thread Andrew Dunstan
On 06/23/2017 12:11 AM, Tom Lane wrote: > Andrew Dunstan writes: >> On 06/22/2017 10:24 AM, Tom Lane wrote: >>> That earlier request is still valid. Also, if you can reproduce the >>> symptom that lorikeet just showed and get a stack trace from the >>>

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-22 Thread Tom Lane
Andrew Dunstan writes: > On 06/22/2017 10:24 AM, Tom Lane wrote: >> That earlier request is still valid. Also, if you can reproduce the >> symptom that lorikeet just showed and get a stack trace from the >> (hypothetical) postmaster core dump, that would be hugely

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-22 Thread Andrew Dunstan
On 06/22/2017 10:24 AM, Tom Lane wrote: > Andrew Dunstan writes: >> Please let me know if there are tests I can run. I missed your earlier >> request in this thread, sorry about that. > That earlier request is still valid. Also, if you can reproduce the >

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-22 Thread Amit Kapila
On Thu, Jun 22, 2017 at 7:54 PM, Tom Lane wrote: > Andrew Dunstan writes: >> Please let me know if there are tests I can run. I missed your earlier >> request in this thread, sorry about that. > > That earlier request is still valid. > Yeah,

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-22 Thread Tom Lane
Andrew Dunstan writes: > Please let me know if there are tests I can run. I missed your earlier > request in this thread, sorry about that. That earlier request is still valid. Also, if you can reproduce the symptom that lorikeet just showed and get a stack

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-22 Thread Andrew Dunstan
On 06/21/2017 06:44 PM, Tom Lane wrote: > Today, lorikeet failed with a new variant on the bgworker start crash: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet=2017-06-21%2020%3A29%3A10 > > This one is even more exciting than the last one, because it sure looks > like the

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-21 Thread Tom Lane
Today, lorikeet failed with a new variant on the bgworker start crash: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet=2017-06-21%2020%3A29%3A10 This one is even more exciting than the last one, because it sure looks like the crashing bgworker took the postmaster down with it.

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas
On Thu, Jun 15, 2017 at 5:16 PM, Tom Lane wrote: > Robert Haas writes: >> On Thu, Jun 15, 2017 at 5:06 PM, Tom Lane wrote: >>> ... nodeGather cannot deem the query done until it's seen EOF on >>> each tuple queue, which it cannot

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane
Robert Haas writes: > On Thu, Jun 15, 2017 at 5:06 PM, Tom Lane wrote: >> ... nodeGather cannot deem the query done until it's seen EOF on >> each tuple queue, which it cannot see until each worker has attached >> to and then detached from the

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas
On Thu, Jun 15, 2017 at 5:06 PM, Tom Lane wrote: > I wrote: >> Robert Haas writes: >>> I think you're right. So here's a theory: > >>> 1. The ERROR mapping the DSM segment is just a case of the worker the >>> losing a race, and isn't a bug. > >> I

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane
I wrote: > Robert Haas writes: >> I think you're right. So here's a theory: >> 1. The ERROR mapping the DSM segment is just a case of the worker the >> losing a race, and isn't a bug. > I concur that this is a possibility, Actually, no, it isn't. I tried to reproduce

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane
Robert Haas writes: > I think you're right. So here's a theory: > 1. The ERROR mapping the DSM segment is just a case of the worker the > losing a race, and isn't a bug. I concur that this is a possibility, but if we expect this to happen, seems like there should be

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas
On Thu, Jun 15, 2017 at 10:21 AM, Amit Kapila wrote: > Yes, I think it is for next query. If you refer the log below from lorikeet: > > 2017-06-13 16:44:57.179 EDT [59404ec6.2758:63] LOG: statement: > EXPLAIN (analyze, timing off, summary off, costs off) SELECT * FROM >

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane
Robert Haas writes: > On Thu, Jun 15, 2017 at 10:38 AM, Tom Lane wrote: >> ... er, -ENOCAFFEINE. Nonetheless, there are no checks of >> EXEC_FLAG_EXPLAIN_ONLY in any parallel-query code, so I think >> a bet is being missed somewhere. > ExecGather() is

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas
On Thu, Jun 15, 2017 at 10:38 AM, Tom Lane wrote: > Robert Haas writes: >> On Thu, Jun 15, 2017 at 10:32 AM, Tom Lane wrote: >>> It's fairly hard to read this other than as telling us that the worker was >>> launched for the EXPLAIN

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane
Robert Haas writes: > On Thu, Jun 15, 2017 at 10:32 AM, Tom Lane wrote: >> It's fairly hard to read this other than as telling us that the worker was >> launched for the EXPLAIN (although really? why aren't we skipping that if >>

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas
On Thu, Jun 15, 2017 at 10:32 AM, Tom Lane wrote: > Robert Haas writes: >> On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane wrote: >>> But we know, from the subsequent failed assertion, that the leader was >>> still trying to launch

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane
Robert Haas writes: > On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane wrote: >> But we know, from the subsequent failed assertion, that the leader was >> still trying to launch parallel workers. So that particular theory >> doesn't hold water. > Is there

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Amit Kapila
On Thu, Jun 15, 2017 at 7:42 PM, Robert Haas wrote: > On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane wrote: >>> Well, as Amit points out, there are entirely legitimate ways for that >>> to happen. If the leader finishes the whole query itself before the

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas
On Thu, Jun 15, 2017 at 10:05 AM, Tom Lane wrote: >> Well, as Amit points out, there are entirely legitimate ways for that >> to happen. If the leader finishes the whole query itself before the >> worker reaches the dsm_attach() call, it will call dsm_detach(), >> destroying

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Tom Lane
Robert Haas writes: > On Wed, Jun 14, 2017 at 6:01 PM, Tom Lane wrote: >> The lack of any other message before the 'could not map' failure must, >> then, mean that dsm_attach() couldn't find an entry in shared memory >> that it wanted to attach to. But

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Robert Haas
On Wed, Jun 14, 2017 at 6:01 PM, Tom Lane wrote: > I wrote: >> But surely the silent treatment should only apply to DSM_OP_CREATE? > > Oh ... scratch that, it *does* only apply to DSM_OP_CREATE. > > The lack of any other message before the 'could not map' failure must, > then,

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-15 Thread Amit Kapila
On Thu, Jun 15, 2017 at 3:31 AM, Tom Lane wrote: > I wrote: >> But surely the silent treatment should only apply to DSM_OP_CREATE? > > Oh ... scratch that, it *does* only apply to DSM_OP_CREATE. > > The lack of any other message before the 'could not map' failure must, > then,

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-14 Thread Tom Lane
I wrote: > But surely the silent treatment should only apply to DSM_OP_CREATE? Oh ... scratch that, it *does* only apply to DSM_OP_CREATE. The lack of any other message before the 'could not map' failure must, then, mean that dsm_attach() couldn't find an entry in shared memory that it wanted to

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-14 Thread Tom Lane
Robert Haas writes: > On Wed, Jun 14, 2017 at 3:33 PM, Tom Lane wrote: >> So the first problem here is the lack of supporting information for the >> 'could not map' failure. > Hmm. I think I believed at the time I wrote dsm_attach() that > somebody

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-14 Thread Robert Haas
On Wed, Jun 14, 2017 at 3:33 PM, Tom Lane wrote: > So the first problem here is the lack of supporting information for the > 'could not map' failure. Hmm. I think I believed at the time I wrote dsm_attach() that somebody might want to try to soldier on after failing to map a

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-14 Thread Tom Lane
Yesterday lorikeet failed the select_parallel test in a new way: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet=2017-06-13%2020%3A28%3A33 2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR: could not map dynamic shared memory segment 2017-06-13 16:44:57.248 EDT

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-07 Thread Tom Lane
Robert Haas writes: > On Wed, Jun 7, 2017 at 6:36 AM, Amit Kapila wrote: >> I don't think so because this problem has been reported previously as >> well [1][2] even before the commit in question. >> >> [1] - >>

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-07 Thread Robert Haas
On Wed, Jun 7, 2017 at 6:36 AM, Amit Kapila wrote: > I don't think so because this problem has been reported previously as > well [1][2] even before the commit in question. > > [1] - >

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-07 Thread Amit Kapila
On Wed, Jun 7, 2017 at 12:37 AM, Robert Haas wrote: > On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane wrote: >>> One thought is that the only places where shm_mq_set_sender() should >>> be getting invoked during the main regression tests are >>>

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-06 Thread Robert Haas
On Tue, Jun 6, 2017 at 4:25 PM, Tom Lane wrote: > (I'm tempted to add something like this permanently, at DEBUG1 or DEBUG2 > or so.) I don't mind adding it permanently, but I think that's too high. Somebody running a lot of parallel queries could easily get enough messages to

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-06 Thread Tom Lane
Robert Haas writes: > On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane wrote: >> Hmm. With some generous assumptions it'd be possible to think that >> aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this. That commit was >> present in 20 successful

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-06 Thread Robert Haas
On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane wrote: >> One thought is that the only places where shm_mq_set_sender() should >> be getting invoked during the main regression tests are >> ParallelWorkerMain() and ExecParallelGetReceiver, and both of those >> places using

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-06 Thread Tom Lane
Robert Haas writes: > On Mon, Jun 5, 2017 at 10:40 AM, Andrew Dunstan > wrote: >> Buildfarm member lorikeet is failing occasionally with a failed >> assertion during the select_parallel regression tests like this: > I don't *think* we've

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-06 Thread Robert Haas
On Mon, Jun 5, 2017 at 10:40 AM, Andrew Dunstan wrote: > Buildfarm member lorikeet is failing occasionally with a failed > assertion during the select_parallel regression tests like this: > > > 2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG: statement:

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-05 Thread Tom Lane
Andrew Dunstan writes: > Buildfarm member lorikeet is failing occasionally with a failed > assertion during the select_parallel regression tests like this: > 2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG: statement: select > count(*) from tenk1, tenk2

[HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-05 Thread Andrew Dunstan
Buildfarm member lorikeet is failing occasionally with a failed assertion during the select_parallel regression tests like this: 2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG: statement: select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0; TRAP: