Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-05 Thread Tomas Vondra
On 03/05/2018 09:37 PM, Thomas Munro wrote: > On Tue, Mar 6, 2018 at 9:17 AM, Robert Haas wrote: >> The optimistic approach seems a little bit less likely to slow this >> down on systems where barriers are expensive, so I committed that one. >> Thanks for debugging this; I hope this fixes it, bu

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-05 Thread Thomas Munro
On Tue, Mar 6, 2018 at 9:17 AM, Robert Haas wrote: > The optimistic approach seems a little bit less likely to slow this > down on systems where barriers are expensive, so I committed that one. > Thanks for debugging this; I hope this fixes it, but I guess we'll > see. Thanks. For the record, th

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-05 Thread Robert Haas
On Sun, Mar 4, 2018 at 4:46 PM, Thomas Munro wrote: > Thanks! Here are a couple of patches. I'm not sure which I prefer. > The "pessimistic" one looks simpler and is probably the way to go, but > the "optimistic" one avoids doing an extra read until it has actually > run out of data and seen mq_

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-05 Thread Thomas Munro
On Tue, Mar 6, 2018 at 5:04 AM, Alvaro Herrera wrote: > Thomas Munro wrote: >> On Sun, Mar 4, 2018 at 10:46 PM, Magnus Hagander wrote: >> > Um. Have you actually seen the "mail archive app" cut long threads off in >> > other cases? Because it's certainly not supposed to do that... >> >> Hi Magnus

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-05 Thread Alvaro Herrera
Thomas Munro wrote: > On Sun, Mar 4, 2018 at 10:46 PM, Magnus Hagander wrote: > > Um. Have you actually seen the "mail archive app" cut long threads off in > > other cases? Because it's certainly not supposed to do that... > > Hi Magnus, > > I mean the "flat" thread view: > > https://www.postgr

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-04 Thread Thomas Munro
On Mon, Mar 5, 2018 at 4:05 AM, Tomas Vondra wrote: > On 03/04/2018 10:27 AM, Thomas Munro wrote: >> I can fix it with the following patch, which writes XXX out to the log >> where it would otherwise miss a final message sent just before >> detaching with sufficiently bad timing/memory ordering.

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-04 Thread Tomas Vondra
On 03/04/2018 10:27 AM, Thomas Munro wrote: > On Sun, Mar 4, 2018 at 5:40 PM, Thomas Munro > wrote: >> Could shm_mq_detach_internal() need a pg_write_barrier() before it >> writes mq_detached = true, to make sure that anyone who observes that >> can also see the most recent increase of mq_bytes_

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-04 Thread Thomas Munro
On Sun, Mar 4, 2018 at 10:46 PM, Magnus Hagander wrote: > Um. Have you actually seen the "mail archive app" cut long threads off in > other cases? Because it's certainly not supposed to do that... Hi Magnus, I mean the "flat" thread view: https://www.postgresql.org/message-id/flat/CAFjFpRfQ8GrQ

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-04 Thread Magnus Hagander
On Sun, Mar 4, 2018 at 3:51 AM, Thomas Munro wrote: > On Sun, Mar 4, 2018 at 3:48 PM, Thomas Munro > wrote: > > I've seen it several times on Travis CI. (So I would normally have > > been able to tell you about this problem before the was committed, > > except that the email thread was too long

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-04 Thread Thomas Munro
On Sun, Mar 4, 2018 at 5:40 PM, Thomas Munro wrote: > Could shm_mq_detach_internal() need a pg_write_barrier() before it > writes mq_detached = true, to make sure that anyone who observes that > can also see the most recent increase of mq_bytes_written? I can reproduce both failure modes (missing

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-03 Thread Thomas Munro
On Sun, Mar 4, 2018 at 4:37 PM, Thomas Munro wrote: > Could it be that a concurrency bug causes tuples to be lost on the > tuple queue, and also sometimes causes X (terminate) messages to be > lost from the error queue, so that the worker appears to go away > unexpectedly? Could shm_mq_detach_int

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-03 Thread Thomas Munro
On Sun, Mar 4, 2018 at 4:17 PM, Tomas Vondra wrote: > On 03/04/2018 04:11 AM, Thomas Munro wrote: >> On Sun, Mar 4, 2018 at 4:07 PM, Tomas Vondra >> wrote: >>> ! ERROR: lost connection to parallel worker >> >> That sounds like the new defences from >> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f. >

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-03 Thread Tomas Vondra
On 03/04/2018 04:11 AM, Thomas Munro wrote: > On Sun, Mar 4, 2018 at 4:07 PM, Tomas Vondra > wrote: >> I've started "make check" with parallel_schedule tweaked to contain many >> select_parallel runs, and so far I've seen a couple of failures like >> this (about 10 failures out of 1500 runs): >>

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-03 Thread Thomas Munro
On Sun, Mar 4, 2018 at 4:07 PM, Tomas Vondra wrote: > I've started "make check" with parallel_schedule tweaked to contain many > select_parallel runs, and so far I've seen a couple of failures like > this (about 10 failures out of 1500 runs): > > select count(*) from tenk1, tenk2 where tenk1.hun

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-03 Thread Tomas Vondra
On 03/04/2018 03:40 AM, Andres Freund wrote: > > > On March 3, 2018 6:36:51 PM PST, Tomas Vondra > wrote: >> On 03/04/2018 03:20 AM, Thomas Munro wrote: >>> Hi, >>> >>> I saw a one-off failure like this: >>> >>> QUERY PLAN >>> >> ---

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-03 Thread Thomas Munro
On Sun, Mar 4, 2018 at 3:48 PM, Thomas Munro wrote: > I've seen it several times on Travis CI. (So I would normally have > been able to tell you about this problem before the was committed, > except that the email thread was too long and the mail archive app > cuts long threads off!) (Correction

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-03 Thread Thomas Munro
On Sun, Mar 4, 2018 at 3:40 PM, Andres Freund wrote: > On March 3, 2018 6:36:51 PM PST, Tomas Vondra > wrote: >>On 03/04/2018 03:20 AM, Thomas Munro wrote: >>> Hi, >>> >>> I saw a one-off failure like this: >>> >>> QUERY PLAN >>> >>--

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-03 Thread Andres Freund
On March 3, 2018 6:36:51 PM PST, Tomas Vondra wrote: >On 03/04/2018 03:20 AM, Thomas Munro wrote: >> Hi, >> >> I saw a one-off failure like this: >> >> QUERY PLAN >> >-- >>Aggregate

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-03 Thread Tomas Vondra
On 03/04/2018 03:20 AM, Thomas Munro wrote: > Hi, > > I saw a one-off failure like this: > > QUERY PLAN > -- >Aggregate (actual rows=1 loops=1) > !-> Nested Loop (actual rows=98000

select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

2018-03-03 Thread Thomas Munro
Hi, I saw a one-off failure like this: QUERY PLAN -- Aggregate (actual rows=1 loops=1) !-> Nested Loop (actual rows=98000 loops=1) -> Seq Scan on tenk2 (actual rows=10 l