Re: dsa_allocate() faliure
I attached a query (and its query plan) that caused the crash: "dsa_allocate could not find 13 free pages" on one of the worker nodes. I anonymised the query text a bit. Interestingly, this time only one (same one) of the nodes is crashing. Since this is a production environment, I cannot get the stack trace. Once turned off parallel execution for this node. The whole query finished just fine. So the parallel query plan is from one of the nodes not crashed, hopefully the same plan would have been executed on the crashed node. In theory, every worker node has the same bits, and very similar data. === psql (10.4) \dx List of installed extensions Name | Version | Schema |Description +-++--- citus | 7.4-3 | pg_catalog | Citus distributed database hll| 2.10| public | type for storing hyperloglog data plpgsql| 1.0 | pg_catalog | PL/pgSQL procedural language On Sat, Aug 25, 2018 at 7:46 AM Sand Stone wrote: > >Can you still see the problem with Citus 7.4? > Hi, Thomas. I actually went back to the cluster with Citus7.4 and > PG10.4. And modified the parallel param. So far, I haven't seen any > server crash. > > The main difference between crashes observed and no crash, is the set > of Linux TCP time out parameters (to release the ports faster). > Unfortunately, I cannot "undo" the Linux params and run the stress > tests anymore, as this is a multi-million $ cluster and people are > doing more useful things on it. I will keep an eye on any parallel > execution issue. > > > On Wed, Aug 15, 2018 at 3:43 PM Thomas Munro > wrote: > > > > On Thu, Aug 16, 2018 at 8:32 AM, Sand Stone > wrote: > > > Just as a follow up. I tried the parallel execution again (in a stress > > > test environment). Now the crash seems gone. I will keep an eye on > > > this for the next few weeks. > > > > Thanks for the report. That's great news, but it'd be good to > > understand why it was happening. > > > > > My theory is that the Citus cluster created and shut down a lot of TCP > > > connections between coordinator and workers. If running on untuned > > > Linux machines, the TCP ports might run out. > > > > I'm not sure how that's relevant, unless perhaps it causes executor > > nodes to be invoked in a strange sequence that commit fd7c0fa7 didn't > > fix? I wonder if there could be something different about the control > > flow with custom scans, or something about the way Citus worker nodes > > invoke plan fragments, or some error path that I failed to consider... > > It's a clue that all of your worker nodes reliably crashed at the same > > time on the same/similar queries (presumably distributed query > > fragments for different shards), making it seem more like a > > common-or-garden bug rather than some kind of timing-based heisenbug. > > If you ever manage to reproduce it, an explain plan and a back trace > > would be very useful. > > > > > Of course, I am using "newer" PG10 bits and Citus7.5 this time. > > > > Hmm. There weren't any relevant commits to REL_10_STABLE that I can > > think of. And (with the proviso that I know next to nothing about > > Citus) I just cloned https://github.com/citusdata/citus.git and > > skimmed through "git diff origin/release-7.4..origin/release-7.5", and > > nothing is jumping out at me. Can you still see the problem with > > Citus 7.4? > > > > -- > > Thomas Munro > > http://www.enterprisedb.com > explain(analyze, verbose) select atimecol as atimecol, count(1) as count, sum(avg) as sum from ( select partitionid, atimecol as atimecol, avg(sum/count) as avg from atable where atimecol >= '7/31/2018' and atimecol <= '8/28/2018' and aText = 'Foo' and (jsoncol ->> 'X') = '-' and id=12345 and acol = 2 and btimecol='7/31/2018' and btimecol<'8/29/2018' group by partitionid, atimecol ) as subquery group by atimecol; QUERY PLAN --- --- HashAggregate (cost=0.00..0.00 rows=0 width=0) (actual time=535.553..535.553 rows=0 loops=1) Output: remote_scan.atimecol, COALESCE((pg_catalog.sum(remote_scan.count))::bigint, '0'::bigint), sum(remote_scan.sum) Group Key: remote_scan.atimecol -> Custom Scan (Citus Real-Time) (cost=0.00..0.00 rows=0 width=0) (actual time=535.551..535.551 rows=0 loops=1)
Extremely slow when query uses GIST exclusion index
Hi. My databases make heavy use of timestamp ranges, and they rely on GIST exclusion constraints to ensure that the ranges are disjoint. I've noticed that queries that hit the GIST indexes are EXTREMELY slow, and the queries run much faster if I make trivial changes to avoid the GIST indexes. Here's the setup for a test case. (Timings were collected on PostgreSQL 9.5.4 on x86_64-pc-linux-gnu, Intel Xeon E5 in a VM with 3.5 GB RAM): CREATE TABLE app ( pk SERIAL PRIMARY KEY, group_id TEXT NOT NULL, app_time TIMESTAMPTZ NOT NULL ); CREATE TABLE group_span ( pk SERIAL PRIMARY KEY, group_id TEXT NOT NULL, valid_period TSTZRANGE NOT NULL, EXCLUDE USING GIST (group_id WITH =, valid_period WITH &&) ); CREATE TABLE member_span ( pk SERIAL PRIMARY KEY, member_id TEXT NOT NULL, group_id TEXT NOT NULL, valid_period TSTZRANGE NOT NULL, EXCLUDE USING GIST (member_id WITH =, group_id WITH =, valid_period WITH &&) ); -- Fill tables with some random data INSERT INTO app (group_id, app_time) SELECT MD5(CONCAT(GENERATE_SERIES(1, 1), RANDOM())), DATE_TRUNC('month', TIMESTAMPTZ '2000-01-01' + INTERVAL '3 years' * RANDOM()); -- Give groups a 1-year span, and give some groups a 2nd-year span: INSERT INTO group_span (group_id, valid_period) (SELECT group_id, TSTZRANGE(app_time, app_time + INTERVAL '1 year') FROM app) UNION ALL (SELECT group_id, TSTZRANGE(app_time + INTERVAL '1 year', app_time + INTERVAL '2 year') FROM app LIMIT 2000); -- Create members with a random span within their group_span: INSERT INTO member_span (member_id, group_id, valid_period) SELECT MD5(RANDOM()::TEXT), group_id, TSTZRANGE( LOWER(valid_period), UPPER(valid_period) - DATE_TRUNC( 'days', (UPPER(valid_period) - LOWER(valid_period)) * RANDOM() ) ) FROM group_span; Given this setup, here's a query that hits the GIST exclusion index on the "member_span" table. It takes 38 sec on my machine: SELECT * FROM app JOIN group_span ON app.group_id = group_span.group_id AND app.app_time <@ group_span.valid_period JOIN member_span ON group_span.group_id = member_span.group_id AND group_span.valid_period && member_span.valid_period; Here's the query plan for that query: Nested Loop (cost=319.27..776.39 rows=1 width=196) (actual time=15.370..38406.466 rows=1 loops=1) Join Filter: (app.group_id = member_span.group_id) -> Hash Join (cost=319.00..771.00 rows=12 width=104) (actual time=5.790..130.613 rows=1 loops=1) Hash Cond: (group_span.group_id = app.group_id) Join Filter: (app.app_time <@ group_span.valid_period) Rows Removed by Join Filter: 2000 -> Seq Scan on group_span (cost=0.00..257.00 rows=12000 width=59) (actual time=0.005..16.282 rows=12000 loops=1) -> Hash (cost=194.00..194.00 rows=1 width=45) (actual time=5.758..5.758 rows=1 loops=1) Buckets: 16384 Batches: 1 Memory Usage: 910kB -> Seq Scan on app (cost=0.00..194.00 rows=1 width=45) (actual time=0.002..2.426 rows=1 loops=1) -> Index Scan using member_span_member_id_group_id_valid_period_excl on member_span (cost=0.28..0.44 rows=1 width=92) (actual time=1.988..3.817 rows=1 loops=1) Index Cond: ((group_id = group_span.group_id) AND (group_span.valid_period && valid_period)) Planning time: 0.784 ms Execution time: 38410.227 ms We can make a small tweak to the query to make it complicated enough that the execution planner avoids the GIST index. In this particular case, we can replace "app.app_time <@ group_span.valid_period" with the equivalent "app.app_time >= LOWER(group_span.valid_period) AND app.app_time < UPPER(group_span.valid_period)". This equivalent query is MUCH faster: SELECT * FROM app JOIN group_span ON app.group_id = group_span.group_id AND app.app_time >= LOWER(group_span.valid_period) AND app.app_time < UPPER(group_span.valid_period) JOIN member_span ON group_span.group_id = member_span.group_id AND group_span.valid_period && member_span.valid_period; It only takes 86 ms, even though it's doing 3 seq scans instead of using the index: Hash Join (cost=953.71..1186.65 rows=8 width=196) (actual time=58.364..84.706 rows=1 loops=1) Hash Cond: (app.group_id = group_span.group_id) Join Filter: ((app.app_time >= lower(group_span.valid_period)) AND (app.app_time < upper(group_span.valid_period))) Rows Removed by Join Filter: 2000 -> Seq Scan on app (cost=0.00..194.00 rows=1 width=45) (actual time=0.007..2.391 rows=1 loops=1) -> Hash (cost=952.81..952.81 rows=72 width=151) (actual time=58.343..58.343 rows=12000 loops=1) Buckets: 16384 (originally 1024) Batches: 1 (originally 1) Memory Usage: 2285kB -> Hash Join (cost=407.00..952.81 rows=72 width=151) (actual time=15.048..44.103 rows=12000 loops=1) Hash Cond: (member_span.group_id = group_span.group_id) Join Filter: