Re: dsa_allocate() faliure

2018-08-28 Thread Sand Stone
I attached a query (and its query plan) that caused the crash:
"dsa_allocate could not find 13 free pages" on one of the worker nodes. I
anonymised the query text a bit.  Interestingly, this time only one (same
one) of the nodes is crashing. Since this is a production environment, I
cannot get the stack trace. Once turned off parallel execution for this
node. The whole query finished just fine. So the parallel query plan is
from one of the nodes not crashed, hopefully the same plan would have been
executed on the crashed node. In theory, every worker node has the same
bits, and very similar data.

===
psql (10.4)
\dx
   List of installed extensions
  Name  | Version |   Schema   |Description
+-++---
 citus  | 7.4-3   | pg_catalog | Citus distributed database
 hll| 2.10| public | type for storing hyperloglog data
plpgsql| 1.0 | pg_catalog | PL/pgSQL procedural language


On Sat, Aug 25, 2018 at 7:46 AM Sand Stone  wrote:

> >Can you still see the problem with Citus 7.4?
> Hi, Thomas. I actually went back to the cluster with Citus7.4 and
> PG10.4. And modified the parallel param. So far, I haven't seen any
> server crash.
>
> The main difference between crashes observed and no crash, is the set
> of Linux TCP time out parameters (to release the ports faster).
> Unfortunately, I cannot "undo" the Linux params and run the stress
> tests anymore, as this is a multi-million $ cluster and people are
> doing more useful things on it. I will keep an eye on any parallel
> execution issue.
>
>
> On Wed, Aug 15, 2018 at 3:43 PM Thomas Munro
>  wrote:
> >
> > On Thu, Aug 16, 2018 at 8:32 AM, Sand Stone 
> wrote:
> > > Just as a follow up. I tried the parallel execution again (in a stress
> > > test environment). Now the crash seems gone. I will keep an eye on
> > > this for the next few weeks.
> >
> > Thanks for the report.  That's great news, but it'd be good to
> > understand why it was happening.
> >
> > > My theory is that the Citus cluster created and shut down a lot of TCP
> > > connections between coordinator and workers. If running on untuned
> > > Linux machines, the TCP ports might run out.
> >
> > I'm not sure how that's relevant, unless perhaps it causes executor
> > nodes to be invoked in a strange sequence that commit fd7c0fa7 didn't
> > fix?  I wonder if there could be something different about the control
> > flow with custom scans, or something about the way Citus worker nodes
> > invoke plan fragments, or some error path that I failed to consider...
> > It's a clue that all of your worker nodes reliably crashed at the same
> > time on the same/similar queries (presumably distributed query
> > fragments for different shards), making it seem more like a
> > common-or-garden bug rather than some kind of timing-based heisenbug.
> > If you ever manage to reproduce it, an explain plan and a back trace
> > would be very useful.
> >
> > > Of course, I am using "newer" PG10 bits and Citus7.5 this time.
> >
> > Hmm.  There weren't any relevant commits to REL_10_STABLE that I can
> > think of.  And (with the proviso that I know next to nothing about
> > Citus) I just cloned https://github.com/citusdata/citus.git and
> > skimmed through "git diff origin/release-7.4..origin/release-7.5", and
> > nothing is jumping out at me.  Can you still see the problem with
> > Citus 7.4?
> >
> > --
> > Thomas Munro
> > http://www.enterprisedb.com
>
explain(analyze, verbose) select
atimecol as atimecol, count(1) as count, sum(avg) as sum
from
(
select
 partitionid, atimecol as atimecol, avg(sum/count) as avg
from atable
where atimecol >= '7/31/2018' and
atimecol <= '8/28/2018' and
aText = 'Foo' and
(jsoncol ->> 'X') = '-' and id=12345 and acol = 2 and btimecol='7/31/2018' and 
btimecol<'8/29/2018'
group by partitionid, atimecol
) as subquery
group by atimecol;


QUERY PLAN

---
---
 HashAggregate  (co

Re: dsa_allocate() faliure

2018-08-15 Thread Sand Stone
Just as a follow up. I tried the parallel execution again (in a stress
test environment). Now the crash seems gone. I will keep an eye on
this for the next few weeks.

My theory is that the Citus cluster created and shut down a lot of TCP
connections between coordinator and workers. If running on untuned
Linux machines, the TCP ports might run out.

Of course, I am using "newer" PG10 bits and Citus7.5 this time.
On Wed, May 23, 2018 at 7:06 AM Sand Stone  wrote:
>
> >> At which commit ID?
> 83fcc615020647268bb129cbf86f7661feee6412 (5/6)
>
> >>do you mean that these were separate PostgreSQL clusters, and they were all 
> >>running the same query and they all crashed like this?
> A few worker nodes, a table is hash partitioned by "aTable.did" by
> Citus, and further partitioned by PG10 by time range on field "ts". As
> far as I could tell, Citus just does a query rewrite, and execute the
> same type of queries to all nodes.
>
> >>so this happened at the same time or at different times?
> At the same time. The queries are simple count and sum queries, here
> is the relevant part from one of the worker nodes:
> 2018-05-23 01:24:01.492 UTC [130536] ERROR:  dsa_allocate could not
> find 7 free pages
> 2018-05-23 01:24:01.492 UTC [130536] CONTEXT:  parallel worker
> STATEMENT:  COPY (SELECT count(1) AS count, sum(worker_column_1) AS
> sum FROM (SELECT subquery.avg AS worker_column_1 FROM (SELECT
> aTable.did, avg((aTable.sum OPERATOR(pg_catalog./)
> (aTable.count)::double precision)) AS avg FROM public.aTable_102117
> aTable WHERE ((aTable.ts OPERATOR(pg_catalog.>=) '2018-04-25
> 00:00:00+00'::timestamp with time zone) AND (aTable.ts
> OPERATOR(pg_catalog.<=) '2018-04-30 00:00:00+00'::timestamp with time
> zone) AND (aTable.v OPERATOR(pg_catalog.=) 12345)) GROUP BY
> aTable.did) subquery) worker_subquery) TO STDOUT WITH (FORMAT binary)
>
>
> >> a parallel worker process
> I think this is more of PG10 parallel bg worker issue. I don't think
> Citus just lets each worker PG server do its own planning.
>
> I will try to do more experiments about this, and see if there is any
> specific query to cause the parallel query execution to fail. As far
> as I can tell, the level of concurrency triggered this issue. That is
> executing 10s of queries as shown on the worker nodes, depending on
> the stats, the PG10 core may or may not spawn more bg workers.
>
> Thanks for your time!
>
>
>
>
>
> On Tue, May 22, 2018 at 9:44 PM, Thomas Munro
>  wrote:
> > On Wed, May 23, 2018 at 4:10 PM, Sand Stone  wrote:
> >>>>dsa_allocate could not find 7 free pages
> >> I just this error message again on all of my worker nodes (I am using
> >> Citus 7.4 rel). The PG core is my own build of release_10_stable
> >> (10.4) out of GitHub on Ubuntu.
> >
> > At which commit ID?
> >
> > All of your worker nodes... so this happened at the same time or at
> > different times?  I don't know much about Citus -- do you mean that
> > these were separate PostgreSQL clusters, and they were all running the
> > same query and they all crashed like this?
> >
> >> What's the best way to debug this? I am running pre-production tests
> >> for the next few days, so I could gather info. if necessary (I cannot
> >> pinpoint a query to repro this yet, as we have 10K queries running
> >> concurrently).
> >
> > Any chance of an EXPLAIN plan for the query that crashed like this?
> > Do you know if it's using multiple Gather[Merge] nodes and parallel
> > bitmap heap scans?  Was it a regular backend process or a parallel
> > worker process (or a Citus worker process, if that is a thing?) that
> > raised the error?
> >
> > --
> > Thomas Munro
> > http://www.enterprisedb.com



Re: dsa_allocate() faliure

2018-05-22 Thread Sand Stone
>>dsa_allocate could not find 7 free pages
I just this error message again on all of my worker nodes (I am using
Citus 7.4 rel). The PG core is my own build of release_10_stable
(10.4) out of GitHub on Ubuntu.

What's the best way to debug this? I am running pre-production tests
for the next few days, so I could gather info. if necessary (I cannot
pinpoint a query to repro this yet, as we have 10K queries running
concurrently).




On Mon, Jan 29, 2018 at 1:35 PM, Rick Otten  wrote:
> If I do a "set max_parallel_workers_per_gather=0;" before I run the query in
> that session, it runs just fine.
> If I set it to 2, the query dies with the dsa_allocate error.
>
> I'll use that as a work around until 10.2 comes out.  Thanks!  I have
> something that will help.
>
>
> On Mon, Jan 29, 2018 at 3:52 PM, Thomas Munro
>  wrote:
>>
>> On Tue, Jan 30, 2018 at 5:37 AM, Tom Lane  wrote:
>> > Rick Otten  writes:
>> >> I'm wondering if there is anything I can tune in my PG 10.1 database to
>> >> avoid these errors:
>> >
>> >> $  psql -f failing_query.sql
>> >> psql:failing_query.sql:46: ERROR:  dsa_allocate could not find 7 free
>> >> pages
>> >> CONTEXT:  parallel worker
>> >
>> > Hmm.  There's only one place in the source code that emits that message
>> > text:
>> >
>> > /*
>> >  * Ask the free page manager for a run of pages.  This should
>> > always
>> >  * succeed, since both get_best_segment and make_new_segment
>> > should
>> >  * only return a non-NULL pointer if it actually contains enough
>> >  * contiguous freespace.  If it does fail, something in our
>> > backend
>> >  * private state is out of whack, so use FATAL to kill the
>> > process.
>> >  */
>> > if (!FreePageManagerGet(segment_map->fpm, npages, _page))
>> > elog(FATAL,
>> >  "dsa_allocate could not find %zu free pages", npages);
>> >
>> > Now maybe that comment is being unreasonably optimistic, but it sure
>> > appears that this is supposed to be a can't-happen case, in which case
>> > you've found a bug.
>>
>> This is probably the bug fixed here:
>>
>>
>> https://www.postgresql.org/message-id/E1eQzIl-0004wM-K3%40gemulon.postgresql.org
>>
>> That was back patched, so 10.2 will contain the fix.  The bug was not
>> in dsa.c itself, but in the parallel query code that mixed up DSA
>> areas, corrupting them.  The problem comes up when the query plan has
>> multiple Gather nodes (and a particular execution pattern) -- is that
>> the case here, in the EXPLAIN output?  That seems plausible given the
>> description of a 50-branch UNION.  The only workaround until 10.2
>> would be to reduce max_parallel_workers_per_gather to 0 to prevent
>> parallelism completely for this query.
>>
>> --
>> Thomas Munro
>> http://www.enterprisedb.com
>
>