Re: [HACKERS] [sqlsmith] Failed assertions on parallel worker shutdown

2016-06-06 Thread Robert Haas
On Fri, Jun 3, 2016 at 11:13 PM, Robert Haas  wrote:
>> I think the workers should stop processing tuples after the tuple queues got
>> detached.  This will not only handle above situation gracefully, but will
>> allow to speed up the queries where Limit clause is present on top of Gather
>> node.  Patch for the same is attached with mail (this was part of original
>> parallel seq scan patch, but was not applied and the reason as far as I
>> remember was we thought such an optimization might not be required for
>> initial version).
>
> This is very likely a good idea, but...
>
>> Another approach to fix this issue could be to reset mqh_partial_bytes and
>> mqh_length_word_complete in shm_mq_sendv  in case of SHM_MQ_DETACHED.  These
>> are currently reset only incase of success.
>
> ...I think we should do this too

Patches for both of these things are now committed.  Barring problems
with those commits, I believe this resolves this open item.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed assertions on parallel worker shutdown

2016-06-06 Thread Robert Haas
On Sat, Jun 4, 2016 at 2:57 AM, Amit Kapila  wrote:
> In the above change, you are first adding bytes_written and then doing the
> SHM_MQ_DETACHED check, whereas other place the check is done first which
> seems to be right.  I think it doesn't matter either way, but it is better
> to be consistent.  Also isn't it better to set mqh_length_word_complete as
> false as next time this API is called, it should first try to write length
> into buffer.

OK, committed after (I hope) fixing those issues.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed assertions on parallel worker shutdown

2016-06-04 Thread Amit Kapila
On Sat, Jun 4, 2016 at 8:43 AM, Robert Haas  wrote:
>
> On Thu, May 26, 2016 at 5:57 AM, Amit Kapila 
wrote:
> >
> > I am able to reproduce the assertion (it occurs once in two to three
times,
> > but always at same place) you have reported upthread with the above
query.
> > It seems to me, issue here is that while workers are writing tuples in
the
> > tuple queue, the master backend has detached from the queues.  The
reason
> > why master backend has detached from tuple queues is because of Limit
> > clause, basically after processing required tuples as specified by Limit
> > clause, it calls shutdown of nodes in below part of code:
>
> I can't reproduce this assertion failure on master.  I tried running
> 'make installcheck' and then running this query repeatedly in the
> regression database with and without
> parallel_setup_cost=parallel_tuple_cost=0, and got nowhere.  Does that
> work for you, or do you have some other set of steps?
>

I have tried by populating pg_statistic table after running make
installcheck.  The way to populate pg_statistic is to create lot of tables
and insert few rows in each table as mentioned in the end of mail upthread.
https://www.postgresql.org/message-id/CAA4eK1KOKGqmz9bGu%2BZ42qhRwMbm4R5rfnqsLCNqFs9j14jzEA%40mail.gmail.com

Today, again I tried reproducing it using same steps, but could not
reproduce it.  This is a timing issue and difficult to reproduce, last time
also I have spent quite some time to reproduce it.  I think we can fix the
issue as per analysis done by me last time and then let Andreas run his
tests to see if he could see the issue again.

> > I think the workers should stop processing tuples after the tuple
queues got
> > detached.  This will not only handle above situation gracefully, but
will
> > allow to speed up the queries where Limit clause is present on top of
Gather
> > node.  Patch for the same is attached with mail (this was part of
original
> > parallel seq scan patch, but was not applied and the reason as far as I
> > remember was we thought such an optimization might not be required for
> > initial version).
>
> This is very likely a good idea, but...
>
> > Another approach to fix this issue could be to reset mqh_partial_bytes
and
> > mqh_length_word_complete in shm_mq_sendv  in case of SHM_MQ_DETACHED.
These
> > are currently reset only incase of success.
>
> ...I think we should do this too, because it's intended that calling
> shm_mq_sendv again after it previously returned SHM_MQ_DETACHED should
> again return SHM_MQ_DETACHED, not fail an assertion.
>

  res = shm_mq_send_bytes(mqh, j, tmpbuf, nowait, _written);
  mqh->mqh_partial_bytes += bytes_written;
+ if (res == SHM_MQ_DETACHED)
+ {
+ mqh->mqh_partial_bytes = 0;
+ return res;
+ }

In the above change, you are first adding bytes_written and then doing the
SHM_MQ_DETACHED check, whereas other place the check is done first which
seems to be right.  I think it doesn't matter either way, but it is better
to be consistent.  Also isn't it better to set mqh_length_word_complete as
false as next time this API is called, it should first try to write length
into buffer.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [sqlsmith] Failed assertions on parallel worker shutdown

2016-06-03 Thread Robert Haas
On Thu, May 26, 2016 at 5:57 AM, Amit Kapila  wrote:
> On Tue, May 24, 2016 at 6:36 PM, Andreas Seltenreich 
> wrote:
>>
>>
>> Each of the sent plans was collected when a worker dumped core due to
>> the failed assertion.  More core dumps than plans were actually
>> observed, since with this failed assertion, multiple workers usually
>> trip on and dump core simultaneously.
>>
>> The following query corresponds to plan2:
>>
>> --8<---cut here---start->8---
>> select
>>   pg_catalog.pg_stat_get_bgwriter_requested_checkpoints() as c0,
>>   subq_0.c3 as c1, subq_0.c1 as c2, 31 as c3, 18 as c4,
>>   (select unique1 from public.bprime limit 1 offset 9) as c5,
>>   subq_0.c2 as c6
>> from
>> (select ref_0.tablename as c0, ref_0.inherited as c1,
>> ref_0.histogram_bounds as c2, 100 as c3
>>   from
>> pg_catalog.pg_stats as ref_0
>>   where 49 is not NULL limit 55) as subq_0
>> where true
>> limit 58;
>> --8<---cut here---end--->8---
>>
>
> I am able to reproduce the assertion (it occurs once in two to three times,
> but always at same place) you have reported upthread with the above query.
> It seems to me, issue here is that while workers are writing tuples in the
> tuple queue, the master backend has detached from the queues.  The reason
> why master backend has detached from tuple queues is because of Limit
> clause, basically after processing required tuples as specified by Limit
> clause, it calls shutdown of nodes in below part of code:

I can't reproduce this assertion failure on master.  I tried running
'make installcheck' and then running this query repeatedly in the
regression database with and without
parallel_setup_cost=parallel_tuple_cost=0, and got nowhere.  Does that
work for you, or do you have some other set of steps?

> I think the workers should stop processing tuples after the tuple queues got
> detached.  This will not only handle above situation gracefully, but will
> allow to speed up the queries where Limit clause is present on top of Gather
> node.  Patch for the same is attached with mail (this was part of original
> parallel seq scan patch, but was not applied and the reason as far as I
> remember was we thought such an optimization might not be required for
> initial version).

This is very likely a good idea, but...

> Another approach to fix this issue could be to reset mqh_partial_bytes and
> mqh_length_word_complete in shm_mq_sendv  in case of SHM_MQ_DETACHED.  These
> are currently reset only incase of success.

...I think we should do this too, because it's intended that calling
shm_mq_sendv again after it previously returned SHM_MQ_DETACHED should
again return SHM_MQ_DETACHED, not fail an assertion.  Can you see
whether the attached patch fixes this for you?

(Status update for Noah: I will provide another update regarding this
issue no later than Monday COB, US time.  I assume that Amit will have
responded by then, and it should hopefully be clear what the next step
is at that point.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


dont-fail-mq-assert-v1.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed assertions on parallel worker shutdown

2016-06-01 Thread Noah Misch
On Thu, May 26, 2016 at 03:27:40PM +0530, Amit Kapila wrote:
> I think the workers should stop processing tuples after the tuple queues
> got detached.  This will not only handle above situation gracefully, but
> will allow to speed up the queries where Limit clause is present on top of
> Gather node.  Patch for the same is attached with mail (this was part of
> original parallel seq scan patch, but was not applied and the reason as far
> as I remember was we thought such an optimization might not be required for
> initial version).
> 
> Another approach to fix this issue could be to reset mqh_partial_bytes and
> mqh_length_word_complete in shm_mq_sendv  in case of SHM_MQ_DETACHED.
> These are currently reset only incase of success.

[Action required within 72 hours.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item.  Robert,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
9.6 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within 72 hours of this
message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping 9.6rc1.  Consequently, I will appreciate your
efforts toward speedy resolution.  Thanks.

[1] 
http://www.postgresql.org/message-id/20160527025039.ga447...@tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed assertions on parallel worker shutdown

2016-05-26 Thread Amit Kapila
On Tue, May 24, 2016 at 6:36 PM, Andreas Seltenreich 
wrote:
>
>
> Each of the sent plans was collected when a worker dumped core due to
> the failed assertion.  More core dumps than plans were actually
> observed, since with this failed assertion, multiple workers usually
> trip on and dump core simultaneously.
>
> The following query corresponds to plan2:
>
> --8<---cut here---start->8---
> select
>   pg_catalog.pg_stat_get_bgwriter_requested_checkpoints() as c0,
>   subq_0.c3 as c1, subq_0.c1 as c2, 31 as c3, 18 as c4,
>   (select unique1 from public.bprime limit 1 offset 9) as c5,
>   subq_0.c2 as c6
> from
> (select ref_0.tablename as c0, ref_0.inherited as c1,
> ref_0.histogram_bounds as c2, 100 as c3
>   from
> pg_catalog.pg_stats as ref_0
>   where 49 is not NULL limit 55) as subq_0
> where true
> limit 58;
> --8<---cut here---end--->8---
>
>
I am able to reproduce the assertion (it occurs once in two to three times,
but always at same place) you have reported upthread with the above query.
It seems to me, issue here is that while workers are writing tuples in the
tuple queue, the master backend has detached from the queues.  The reason
why master backend has detached from tuple queues is because of Limit
clause, basically after processing required tuples as specified by Limit
clause, it calls shutdown of nodes in below part of code:

ExecutePlan()
{
..
if (TupIsNull(slot))
{
/* Allow nodes to release or shut down resources. */
(void) ExecShutdownNode(planstate);
break;
}
..
}

Now, it is quite possible that the worker has written part of it's data,
after which the queue got detached.  The callers of shm_mq
(tqueueReceiveSlot/shm_mq_send) doesn't handle SHM_MQ_DETACHED due to which
it keeps on sending more data (next tuple) which leads to the assertion in
below code:
shm_mq_sendv()
{
..
/* Write the actual data bytes into the buffer. */
Assert(mqh->mqh_partial_bytes <= nbytes);
..
}

I think the workers should stop processing tuples after the tuple queues
got detached.  This will not only handle above situation gracefully, but
will allow to speed up the queries where Limit clause is present on top of
Gather node.  Patch for the same is attached with mail (this was part of
original parallel seq scan patch, but was not applied and the reason as far
as I remember was we thought such an optimization might not be required for
initial version).

Another approach to fix this issue could be to reset mqh_partial_bytes and
mqh_length_word_complete in shm_mq_sendv  in case of SHM_MQ_DETACHED.
These are currently reset only incase of success.

I have added this issue in list of PostgreSQL 9.6 open items @
https://wiki.postgresql.org/wiki/PostgreSQL_9.6_Open_Items

The steps to reproduce it on regression database is:
1. We need to create enough rows in pg_statistic, so that parallel plan can
be selected.
2. Used below procedure to ensure sufficient rows gets created in
pg_statistic
CREATE OR REPLACE FUNCTION populate_pg_stat() RETURNS int AS
$BODY$
DECLARE
count int;
BEGIN
FOR count IN 1 .. 5
LOOP
Execute 'Create table t' || count || '(c1 int)';
Execute 'insert into t' || count || ' values (generate_series(1,10))';
Execute 'Analyze t' || count;
End Loop;
Return 1;
END
$BODY$
LANGUAGE plpgsql;
3.
set parallel_tuple_cost = 0
set parallel_setup_cost = 0
execute query -
Explain Analyze select
  pg_catalog.pg_stat_get_bgwriter_requested_checkpoints() as c0,
  subq_0.c3 as c1, subq_0.c1 as c2, 31 as c3, 18 as c4,
  (select unique1 from public.bprime limit 1 offset 9) as c5,
  subq_0.c2 as c6
from
(select ref_0.tablename as c0, ref_0.inherited as c1,
ref_0.histogram_bounds as c2, 100 as c3
  from
pg_catalog.pg_stats as ref_0
  where 49 is not NULL limit 55) as subq_0
where true
limit 58;


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


stop_processing_tuples_detached_queue_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed assertions on parallel worker shutdown

2016-05-24 Thread Andreas Seltenreich
Amit Kapila writes:

> On Mon, May 23, 2016 at 4:48 PM, Andreas Seltenreich 
> wrote:
>> plan6 corresponds to this query:
>>
> Are you sure that the core dumps you are seeing are due to plan6?

Each of the plans sent was harvested from a controlling process when the
above assertion failed in its workers.  I do not know whether the plans
themselves really are at fault, as most of the collected plans look ok
to me.  The backtrace in the controlling process always look like the
one reported. (Except when the coredumping took so long as to trigger a
statement_timeout in the still-running master. There are no
plans/queries available in this case, as the the state is no longer
available in an aborted transaction.)

> I have tried to generate a parallel plan for above query and it seems to me 
> that
> after applying the patches (avoid_restricted_clause_below_gather_v1.patch
> and prohibit_parallel_clause_below_rel_v1.patch), the plan it generates
> doesn't have subplan below gather node [1].

> Without patch avoid_restricted_clause_below_gather_v1.patch, it will allow to 
> push
> subplan below gather node, so I think either there is some other plan
> (query) due to which you are seeing core dumps or the above two patches
> haven't been applied before testing.

According to my notes, the patches were applied in the instance that
crashed.  The fact that I do not see the other variants of the crashes
the patches fix anymore, and the probability for this failed assertion
per random query is reduced by about a factor of 20 in contrast to
testing with the patches not applied, I'm pretty certain that this is
not a bookkeeping error on my part.

> Is it possible that core dump is due to plan2 or some other similar
> plan (I am not sure at this stage about the cause of the problem you
> are seeing, but if due to some reason PARAM_EXEC params are pushed
> below gather, then such a plan might not work)?  If you think plan
> other than plan6 can cause such a problem, then can you share the
> query for plan2?

Each of the sent plans was collected when a worker dumped core due to
the failed assertion.  More core dumps than plans were actually
observed, since with this failed assertion, multiple workers usually
trip on and dump core simultaneously.

The following query corresponds to plan2:

--8<---cut here---start->8---
select
  pg_catalog.pg_stat_get_bgwriter_requested_checkpoints() as c0,
  subq_0.c3 as c1, subq_0.c1 as c2, 31 as c3, 18 as c4,
  (select unique1 from public.bprime limit 1 offset 9) as c5,
  subq_0.c2 as c6
from
(select ref_0.tablename as c0, ref_0.inherited as c1,
ref_0.histogram_bounds as c2, 100 as c3
  from
pg_catalog.pg_stats as ref_0
  where 49 is not NULL limit 55) as subq_0
where true
limit 58;
--8<---cut here---end--->8---

regards,
Andreas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed assertions on parallel worker shutdown

2016-05-24 Thread Amit Kapila
On Mon, May 23, 2016 at 4:48 PM, Andreas Seltenreich 
wrote:
>
> Amit Kapila writes:
>
> > Earlier problems were due to the reason that some unsafe/restricted
> > expressions were pushed below Gather node as part of target list
whereas in
> > the plan6, it seems some unsafe node is pushed below Gather node. It
will
> > be helpful if you can share the offending query?
>
> plan6 corresponds to this query:
>
> select
> pg_catalog.anyarray_out(
> cast((select most_common_vals from pg_catalog.pg_stats limit 1 offset
41)
>  as anyarray)) as c0
>  from
> public.quad_point_tbl as ref_0 where ref_0.p ~= ref_0.p;
>

Are you sure that the core dumps you are seeing are due to plan6?  I have
tried to generate a parallel plan for above query and it seems to me that
after applying the patches (avoid_restricted_clause_below_gather_v1.patch
and prohibit_parallel_clause_below_rel_v1.patch), the plan it generates
doesn't have subplan below gather node [1].  Without patch
avoid_restricted_clause_below_gather_v1.patch, it will allow to push
subplan below gather node, so I think either there is some other plan
(query) due to which you are seeing core dumps or the above two patches
haven't been applied before testing.  Is it possible that core dump is due
to plan2 or some other similar plan (I am not sure at this stage about the
cause of the problem you are seeing, but if due to some reason PARAM_EXEC
params are pushed below gather, then such a plan might not work)?  If you
think plan other than plan6 can cause such a problem, then can you share
the query for plan2?



[1]
 QUERY PLAN









 Gather  (cost=543.59..5041.17 rows=1761 width=32)
   Output: anyarray_out($0)
   Workers Planned: 1
   InitPlan 1 (returns $0)
 ->  Limit  (cost=491.56..543.59 rows=1 width=32)
   Output: pg_stats.most_common_vals
   ->  Subquery Scan on pg_stats  (cost=127.30..491.56 rows=7
width=32)
 Output: pg_stats.most_common_vals
 ->  Hash Join  (cost=127.30..491.49 rows=7 width=401)
   Output: NULL::name, NULL::name, NULL::name,
NULL::boolean, NULL::real, NULL::integer, NULL::real, CASE WHEN (s.stakind1
= 1) THEN s.stavalues1 WHEN (s.stakind2 = 1) THEN s.stavalues2 WHEN
(s.stakind3 = 1) THEN s.stavalues3 WHEN (s.stakind4 = 1) THEN s.stavalues4
WHEN (s.stakind5 = 1) THEN s.stavalues5 ELSE NULL::anyarray END,
NULL::real[], NULL::anyarray, NULL::real, NULL::anyarray, NULL::real[],
NULL::real[]
   Hash Cond: ((a.attrelid = c.oid) AND (a.attnum =
s.staattnum))
   Join Filter: has_column_privilege(c.oid, a.attnum,
'select'::text)
   ->  Seq Scan on pg_catalog.pg_attribute a
 (cost=0.00..318.11 rows=6097 width=6)
 Output: a.attrelid, a.attname, a.atttypid,
a.attstattarget, a.attlen, a.attnum, a.attndims, a.attcacheoff,
a.atttypmod, a.attbyval, a.attstorage, a.attalign, a.attnotnull,
a.atthasdef, a.attisdropped, a.attislocal, a.attinhcount, a.attcollation,
a.attacl, a.attoptions, a.attfdwoptions
 Filter: (NOT a.attisdropped)
   ->  Hash  (cost=117.59..117.59 rows=647 width=180)
 Output: s.stakind1, s.stavalues1, s.stakind2,
s.stavalues2, s.stakind3, s.stavalues3, s.stakind4, s.stavalues4,
s.stakind5, s.stava
lues5, s.starelid, s.staattnum, c.oid
 ->  Hash Join  (cost=57.21..117.59 rows=647
width=180)
   Output: s.stakind1, s.stavalues1,
s.stakind2, s.stavalues2, s.stakind3, s.stavalues3, s.stakind4,
s.stavalues4, s.stakind5, s
.stavalues5, s.starelid, s.staattnum, c.oid
   Hash Cond: (s.starelid = c.oid)
   ->  Seq Scan on pg_catalog.pg_statistic
s  (cost=0.00..51.48 rows=648 width=176)
 Output: s.starelid, s.staattnum,
s.stainherit, s.stanullfrac, s.stawidth, s.stadistinct, s.stakind1,
s.stakind2, s.stakind3, s.stakind4, s.stakind5, s.staop1, s.staop2,
s.staop3, s.staop4, s.staop5,s.stanumbers1, s.stanumbers2, s.stanumbers3,
s.stanumbers4, s.stanumbers5, s.sta
values1, s.stavalues2, s.stavalues3, s.stavalues4, s.stavalues5
   ->  Hash  (cost=45.61..45.61 rows=928
width=8)
 Output: c.oid, c.relnamespace
 ->  Seq Scan on
pg_catalog.pg_class 

Re: [HACKERS] [sqlsmith] Failed assertions on parallel worker shutdown

2016-05-23 Thread Andreas Seltenreich
Amit Kapila writes:

> Earlier problems were due to the reason that some unsafe/restricted
> expressions were pushed below Gather node as part of target list whereas in
> the plan6, it seems some unsafe node is pushed below Gather node. It will
> be helpful if you can share the offending query?

plan6 corresponds to this query:

select
pg_catalog.anyarray_out(
cast((select most_common_vals from pg_catalog.pg_stats limit 1 offset 41)
 as anyarray)) as c0
 from
public.quad_point_tbl as ref_0 where ref_0.p ~= ref_0.p;

regards,
Andreas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed assertions on parallel worker shutdown

2016-05-23 Thread Amit Kapila
On Mon, May 23, 2016 at 2:19 PM, Andreas Seltenreich 
wrote:
>
> I wrote:
> >> There's another class of parallel worker core dumps when testing master
> >> with sqlsmith.  In these cases, the following assertion fails for all
> >> workers simulataneously:
> >>
> >> TRAP: FailedAssertion("!(mqh->mqh_partial_bytes <= nbytes)", File:
"shm_mq.c", Line: 386)
> >
> > I no longer observe these after applying these two patches by Amit
> > Kapila
>
> I spoke too soon: These still occur with the patches applied, but with
> much lower probability. (one core dump per 20e6 random queries instead
> of 1e6).
>
> Most of the collected plans look inconspicuous to me now, except for one
> that again had a subplan below a gather node (plan6).
>

This problem looks different from the previous problems you have reported.
Earlier problems were due to the reason that some unsafe/restricted
expressions were pushed below Gather node as part of target list whereas in
the plan6, it seems some unsafe node is pushed below Gather node. It will
be helpful if you can share the offending query?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [sqlsmith] Failed assertions on parallel worker shutdown

2016-05-23 Thread Andreas Seltenreich
I wrote:
>> There's another class of parallel worker core dumps when testing master
>> with sqlsmith.  In these cases, the following assertion fails for all
>> workers simulataneously:
>>
>> TRAP: FailedAssertion("!(mqh->mqh_partial_bytes <= nbytes)", File: 
>> "shm_mq.c", Line: 386)
>
> I no longer observe these after applying these two patches by Amit
> Kapila

I spoke too soon: These still occur with the patches applied, but with
much lower probability. (one core dump per 20e6 random queries instead
of 1e6).

Most of the collected plans look inconspicuous to me now, except for one
that again had a subplan below a gather node (plan6).  Tarball of all
collected plans attached.

regards,
Andreas



plans.tar.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed assertions on parallel worker shutdown

2016-05-22 Thread Andreas Seltenreich
I wrote:

> There's another class of parallel worker core dumps when testing master
> with sqlsmith.  In these cases, the following assertion fails for all
> workers simulataneously:
>
> TRAP: FailedAssertion("!(mqh->mqh_partial_bytes <= nbytes)", File: 
> "shm_mq.c", Line: 386)

I no longer observe these after applying these two patches by Amit
Kapila:

avoid_restricted_clause_below_gather_v1.patch
Message-ID: 

[HACKERS] [sqlsmith] Failed assertions on parallel worker shutdown

2016-05-22 Thread Andreas Seltenreich
There's another class of parallel worker core dumps when testing master
with sqlsmith.  In these cases, the following assertion fails for all
workers simulataneously:

TRAP: FailedAssertion("!(mqh->mqh_partial_bytes <= nbytes)", File: "shm_mq.c", 
Line: 386)

The backtraces of the controlling process is always in
ExecShutdownGatherWorkers.  The queries always work fine on re-running,
so I guess there is some race condition on worker shutdown?  Backtraces
below.

regards
andreas

Core was generated by `postgres: bgworker: parallel worker for PID 30525   
'.
Program terminated with signal SIGABRT, Aborted.
#0  0x7f5a3df91067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
56  ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x7f5a3df91067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x7f5a3df92448 in __GI_abort () at abort.c:89
#2  0x007eabe1 in ExceptionalCondition 
(conditionName=conditionName@entry=0x984e10 "!(mqh->mqh_partial_bytes <= 
nbytes)", errorType=errorType@entry=0x82a75d "FailedAssertion", 
fileName=fileName@entry=0x984b8c "shm_mq.c", lineNumber=lineNumber@entry=386) 
at assert.c:54
#3  0x006d8042 in shm_mq_sendv (mqh=0x25f17b8, 
iov=iov@entry=0x7ffc6352af00, iovcnt=iovcnt@entry=1, nowait=) at 
shm_mq.c:386
#4  0x006d807d in shm_mq_send (mqh=, nbytes=, data=, nowait=) at shm_mq.c:327
#5  0x005d96b9 in ExecutePlan (dest=0x25f1850, direction=, numberTuples=0, sendTuples=, operation=CMD_SELECT, 
use_parallel_mode=, planstate=0x2612da8, estate=0x2612658) at 
execMain.c:1596
#6  standard_ExecutorRun (queryDesc=0x261a660, direction=, 
count=0) at execMain.c:338
#7  0x005dc7cf in ParallelQueryMain (seg=, 
toc=0x7f5a3ea6c000) at execParallel.c:735
#8  0x004e617b in ParallelWorkerMain (main_arg=) at 
parallel.c:1035
#9  0x00683862 in StartBackgroundWorker () at bgworker.c:726
#10 0x0068e9a2 in do_start_bgworker (rw=0x2590760) at postmaster.c:5531
#11 maybe_start_bgworker () at postmaster.c:5706
#12 0x0046cbba in ServerLoop () at postmaster.c:1762
#13 0x0069081e in PostmasterMain (argc=argc@entry=4, 
argv=argv@entry=0x256d580) at postmaster.c:1298
#14 0x0046d80d in main (argc=4, argv=0x256d580) at main.c:228
(gdb) attach 30525
0x7f5a3e044e33 in __epoll_wait_nocancel () at 
../sysdeps/unix/syscall-template.S:81
81  ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x7f5a3e044e33 in __epoll_wait_nocancel () at 
../sysdeps/unix/syscall-template.S:81
#1  0x006d1b4e in WaitEventSetWaitBlock (nevents=1, 
occurred_events=0x7ffc6352aec0, cur_timeout=-1, set=0x44251c0) at latch.c:981
#2  WaitEventSetWait (set=set@entry=0x44251c0, timeout=timeout@entry=-1, 
occurred_events=occurred_events@entry=0x7ffc6352aec0, nevents=nevents@entry=1) 
at latch.c:935
#3  0x006d1f96 in WaitLatchOrSocket (latch=0x7f5a3d898494, 
wakeEvents=wakeEvents@entry=1, sock=sock@entry=-1, timeout=timeout@entry=-1) at 
latch.c:347
#4  0x006d205d in WaitLatch (latch=, 
wakeEvents=wakeEvents@entry=1, timeout=timeout@entry=-1) at latch.c:302
#5  0x004e6d64 in WaitForParallelWorkersToFinish (pcxt=0x442d4e8) at 
parallel.c:537
#6  0x005dcf84 in ExecParallelFinish (pei=0x441cab8) at 
execParallel.c:541
#7  0x005eeead in ExecShutdownGatherWorkers (node=node@entry=0x3e3a070) 
at nodeGather.c:416
#8  0x005ef389 in ExecShutdownGather (node=0x3e3a070) at 
nodeGather.c:430
#9  0x005dd03d in ExecShutdownNode (node=0x3e3a070) at 
execProcnode.c:807
#10 0x0061ad73 in planstate_tree_walker (planstate=0x3e361a8, 
walker=0x5dd010 , context=0x0) at nodeFuncs.c:3442
#11 0x0061ad73 in planstate_tree_walker (planstate=0xf323c30, 
walker=0x5dd010 , context=0x0) at nodeFuncs.c:3442
#12 0x0061ad73 in planstate_tree_walker (planstate=0xf323960, 
walker=0x5dd010 , context=0x0) at nodeFuncs.c:3442
#13 0x005d96da in ExecutePlan (dest=0xb826868, direction=, numberTuples=0, sendTuples=, operation=CMD_SELECT, 
use_parallel_mode=, planstate=0xf323960, estate=0xf322b28) at 
execMain.c:1576
#14 standard_ExecutorRun (queryDesc=0xddca888, direction=, 
count=0) at execMain.c:338
#15 0x006f6e88 in PortalRunSelect (portal=portal@entry=0x258ccc8, 
forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, 
dest=dest@entry=0xb826868) at pquery.c:946
#16 0x006f83ae in PortalRun (portal=0x258ccc8, 
count=9223372036854775807, isTopLevel=, dest=0xb826868, 
altdest=0xb826868, completionTag=0x7ffc6352b3d0 "") at pquery.c:787
#17 0x006f5c63 in exec_simple_query (query_string=) at 
postgres.c:1094
#18 PostgresMain (argc=39374024, argv=0x25ed130, dbname=0x256e480 "regression", 
username=0x25ed308 "0\321^\002") at postgres.c:4059
#19 0x0046c8b2 in BackendRun (port=0x25935d0) at postmaster.c:4258
#20