Re: dsa_allocate() faliure

2018-08-28 Thread Sand Stone
I attached a query (and its query plan) that caused the crash:
"dsa_allocate could not find 13 free pages" on one of the worker nodes. I
anonymised the query text a bit.  Interestingly, this time only one (same
one) of the nodes is crashing. Since this is a production environment, I
cannot get the stack trace. Once turned off parallel execution for this
node. The whole query finished just fine. So the parallel query plan is
from one of the nodes not crashed, hopefully the same plan would have been
executed on the crashed node. In theory, every worker node has the same
bits, and very similar data.

===
psql (10.4)
\dx
   List of installed extensions
  Name  | Version |   Schema   |Description
+-++---
 citus  | 7.4-3   | pg_catalog | Citus distributed database
 hll| 2.10| public | type for storing hyperloglog data
plpgsql| 1.0 | pg_catalog | PL/pgSQL procedural language


On Sat, Aug 25, 2018 at 7:46 AM Sand Stone  wrote:

> >Can you still see the problem with Citus 7.4?
> Hi, Thomas. I actually went back to the cluster with Citus7.4 and
> PG10.4. And modified the parallel param. So far, I haven't seen any
> server crash.
>
> The main difference between crashes observed and no crash, is the set
> of Linux TCP time out parameters (to release the ports faster).
> Unfortunately, I cannot "undo" the Linux params and run the stress
> tests anymore, as this is a multi-million $ cluster and people are
> doing more useful things on it. I will keep an eye on any parallel
> execution issue.
>
>
> On Wed, Aug 15, 2018 at 3:43 PM Thomas Munro
>  wrote:
> >
> > On Thu, Aug 16, 2018 at 8:32 AM, Sand Stone 
> wrote:
> > > Just as a follow up. I tried the parallel execution again (in a stress
> > > test environment). Now the crash seems gone. I will keep an eye on
> > > this for the next few weeks.
> >
> > Thanks for the report.  That's great news, but it'd be good to
> > understand why it was happening.
> >
> > > My theory is that the Citus cluster created and shut down a lot of TCP
> > > connections between coordinator and workers. If running on untuned
> > > Linux machines, the TCP ports might run out.
> >
> > I'm not sure how that's relevant, unless perhaps it causes executor
> > nodes to be invoked in a strange sequence that commit fd7c0fa7 didn't
> > fix?  I wonder if there could be something different about the control
> > flow with custom scans, or something about the way Citus worker nodes
> > invoke plan fragments, or some error path that I failed to consider...
> > It's a clue that all of your worker nodes reliably crashed at the same
> > time on the same/similar queries (presumably distributed query
> > fragments for different shards), making it seem more like a
> > common-or-garden bug rather than some kind of timing-based heisenbug.
> > If you ever manage to reproduce it, an explain plan and a back trace
> > would be very useful.
> >
> > > Of course, I am using "newer" PG10 bits and Citus7.5 this time.
> >
> > Hmm.  There weren't any relevant commits to REL_10_STABLE that I can
> > think of.  And (with the proviso that I know next to nothing about
> > Citus) I just cloned https://github.com/citusdata/citus.git and
> > skimmed through "git diff origin/release-7.4..origin/release-7.5", and
> > nothing is jumping out at me.  Can you still see the problem with
> > Citus 7.4?
> >
> > --
> > Thomas Munro
> > http://www.enterprisedb.com
>
explain(analyze, verbose) select
atimecol as atimecol, count(1) as count, sum(avg) as sum
from
(
select
 partitionid, atimecol as atimecol, avg(sum/count) as avg
from atable
where atimecol >= '7/31/2018' and
atimecol <= '8/28/2018' and
aText = 'Foo' and
(jsoncol ->> 'X') = '-' and id=12345 and acol = 2 and btimecol='7/31/2018' and 
btimecol<'8/29/2018'
group by partitionid, atimecol
) as subquery
group by atimecol;


QUERY PLAN

---
---
 HashAggregate  (cost=0.00..0.00 rows=0 width=0) (actual time=535.553..535.553 
rows=0 loops=1)
   Output: remote_scan.atimecol, 
COALESCE((pg_catalog.sum(remote_scan.count))::bigint, '0'::bigint), 
sum(remote_scan.sum)
   Group Key: remote_scan.atimecol
   ->  Custom Scan (Citus Real-Time)  (cost=0.00..0.00 rows=0 width=0) (actual 
time=535.551..535.551 rows=0 loops=1)
   

Extremely slow when query uses GIST exclusion index

2018-08-28 Thread David
Hi. My databases make heavy use of timestamp ranges, and they rely on GIST
exclusion constraints to ensure that the ranges are disjoint. I've noticed
that queries that hit the GIST indexes are EXTREMELY slow, and the queries
run much faster if I make trivial changes to avoid the GIST indexes.

Here's the setup for a test case. (Timings were collected on PostgreSQL
9.5.4 on x86_64-pc-linux-gnu, Intel Xeon E5 in a VM with 3.5 GB RAM):

CREATE TABLE app (
  pk SERIAL PRIMARY KEY,
  group_id TEXT NOT NULL,
  app_time TIMESTAMPTZ NOT NULL
);

CREATE TABLE group_span (
  pk SERIAL PRIMARY KEY,
  group_id TEXT NOT NULL,
  valid_period TSTZRANGE NOT NULL,
  EXCLUDE USING GIST (group_id WITH =, valid_period WITH &&)
);

CREATE TABLE member_span (
  pk SERIAL PRIMARY KEY,
  member_id TEXT NOT NULL,
  group_id TEXT NOT NULL,
  valid_period TSTZRANGE NOT NULL,
  EXCLUDE USING GIST
(member_id WITH =, group_id WITH =, valid_period WITH &&)
);

-- Fill tables with some random data

INSERT INTO app (group_id, app_time)
SELECT
  MD5(CONCAT(GENERATE_SERIES(1, 1), RANDOM())),
  DATE_TRUNC('month', TIMESTAMPTZ '2000-01-01' +
 INTERVAL '3 years' * RANDOM());

-- Give groups a 1-year span, and give some groups a 2nd-year span:
INSERT INTO group_span (group_id, valid_period)
(SELECT
   group_id,
   TSTZRANGE(app_time, app_time + INTERVAL '1 year')
 FROM app)
UNION ALL
(SELECT
   group_id,
   TSTZRANGE(app_time + INTERVAL '1 year',
 app_time + INTERVAL '2 year')
 FROM app LIMIT 2000);

-- Create members with a random span within their group_span:
INSERT INTO member_span (member_id, group_id, valid_period)
SELECT
  MD5(RANDOM()::TEXT),
  group_id,
  TSTZRANGE(
LOWER(valid_period),
UPPER(valid_period) - DATE_TRUNC(
  'days',
  (UPPER(valid_period) - LOWER(valid_period)) * RANDOM()
)
  )
FROM group_span;


Given this setup, here's a query that hits the GIST exclusion index on the
"member_span" table. It takes 38 sec on my machine:

SELECT *
FROM app
JOIN group_span ON
  app.group_id = group_span.group_id AND
  app.app_time <@ group_span.valid_period
JOIN member_span ON
  group_span.group_id = member_span.group_id AND
  group_span.valid_period && member_span.valid_period;

Here's the query plan for that query:

Nested Loop  (cost=319.27..776.39 rows=1 width=196) (actual
time=15.370..38406.466 rows=1 loops=1)
  Join Filter: (app.group_id = member_span.group_id)
  ->  Hash Join  (cost=319.00..771.00 rows=12 width=104) (actual
time=5.790..130.613 rows=1 loops=1)
Hash Cond: (group_span.group_id = app.group_id)
Join Filter: (app.app_time <@ group_span.valid_period)
Rows Removed by Join Filter: 2000
->  Seq Scan on group_span  (cost=0.00..257.00 rows=12000 width=59)
(actual time=0.005..16.282 rows=12000 loops=1)
->  Hash  (cost=194.00..194.00 rows=1 width=45) (actual
time=5.758..5.758 rows=1 loops=1)
  Buckets: 16384  Batches: 1  Memory Usage: 910kB
  ->  Seq Scan on app  (cost=0.00..194.00 rows=1 width=45)
(actual time=0.002..2.426 rows=1 loops=1)
  ->  Index Scan using member_span_member_id_group_id_valid_period_excl on
member_span  (cost=0.28..0.44 rows=1 width=92) (actual time=1.988..3.817
rows=1 loops=1)
Index Cond: ((group_id = group_span.group_id) AND
(group_span.valid_period && valid_period))
Planning time: 0.784 ms
Execution time: 38410.227 ms

We can make a small tweak to the query to make it complicated enough that
the execution planner avoids the GIST index. In this particular case, we
can replace "app.app_time <@ group_span.valid_period" with the
equivalent "app.app_time
>= LOWER(group_span.valid_period) AND app.app_time <
UPPER(group_span.valid_period)". This equivalent query is MUCH faster:

SELECT *
FROM app
JOIN group_span ON
  app.group_id = group_span.group_id AND
  app.app_time >= LOWER(group_span.valid_period) AND
  app.app_time < UPPER(group_span.valid_period)
JOIN member_span ON
  group_span.group_id = member_span.group_id AND
  group_span.valid_period && member_span.valid_period;

It only takes 86 ms, even though it's doing 3 seq scans instead of using
the index:

Hash Join  (cost=953.71..1186.65 rows=8 width=196) (actual
time=58.364..84.706 rows=1 loops=1)
  Hash Cond: (app.group_id = group_span.group_id)
  Join Filter: ((app.app_time >= lower(group_span.valid_period)) AND
(app.app_time < upper(group_span.valid_period)))
  Rows Removed by Join Filter: 2000
  ->  Seq Scan on app  (cost=0.00..194.00 rows=1 width=45) (actual
time=0.007..2.391 rows=1 loops=1)
  ->  Hash  (cost=952.81..952.81 rows=72 width=151) (actual
time=58.343..58.343 rows=12000 loops=1)
Buckets: 16384 (originally 1024)  Batches: 1 (originally 1)  Memory
Usage: 2285kB
->  Hash Join  (cost=407.00..952.81 rows=72 width=151) (actual
time=15.048..44.103 rows=12000 loops=1)
  Hash Cond: (member_span.group_id = group_span.group_id)
  Join Filter: