Re: [HACKERS] Accounting for toast in query planner. (gin/gist indexes).

2011-12-01 Thread jesper
 Jesper Krogh jes...@krogh.cc writes:
 I have currently hit a problem which I dug into finding the cause for,
 in
 particular, searching in GIN indices seems in some situations to
 un-fairly favor Sequential Scans.

 Googling a bit I found this page:
 http://postgis.refractions.net/docs/ch06.html#id2635817
 Describing the excact problem.

 It seemed to be discussed back in the pre 8.1 days and wasn't
 solved there, is there a chance someone may address it in 9.2 ?
 http://archives.postgresql.org/pgsql-performance/2005-02/msg00041.php

 Don't hold your breath.  There's a huge amount of work to do there,
 and nobody's even thinking about it.  The problem has actually gotten
 worse since 8.1, because the index access patterns are more complex
 and the planner has less not more info about them (because we pushed
 the determination of what's lossy to run time).  Accumulating stats
 about how toasty the heap tuples are would be only the first step
 towards producing better cost estimates here --- you'd also need some
 way of estimating how lossy an index is, for instance, so you could
 guess how many heap tuples will be visited.  And on top of that,
 for a lot of these complicated operators even our estimates of the final
 number of matches are pretty crummy.  I'm not sure if having an explicit
 model of what's happening inside the index would help.  Maybe it would,
 since the condition actually being tested by the index is probably
 simpler than the operator proper, but right now the planner has no clue
 that they're different.

I dont understand the details about lossy-ness, but even with the changes
for getting a cost estimate for GIN indexes that went into 9.1 (which
made the situation quite a lot better) we still favor Sequential Scans
in situations where it is a really bad idea. The cost of doing filtering
on something that is huge and relies in toast are not accounted for.

Taking the same dataset to the worst situation is a query like: this
key exists in all of the dataset, how many are there.. the alternative
query-plans are here:

2011-12-01 09:59:46.428 jk=# explain (analyze on, buffers on) select
count(id) from ftstest where fts @@ to_tsquery('english','test1');
  QUERY PLAN
--
 Aggregate  (cost=9591.79..9591.80 rows=1 width=4) (actual
time=109.084..109.084 rows=1 loops=1)
   Buffers: shared hit=1 read=4595
   -  Bitmap Heap Scan on ftstest  (cost=2149.59..9091.92 rows=199947
width=4) (actual time=38.613..87.784 rows=199939 loops=1)
 Recheck Cond: (fts @@ '''test1'''::tsquery)
 Buffers: shared hit=1 read=4595
 -  Bitmap Index Scan on ftstest_fts_idx1  (cost=0.00..2099.60
rows=199947 width=0) (actual time=37.847..37.847 rows=199939
loops=1)
   Index Cond: (fts @@ '''test1'''::tsquery)
   Buffers: shared hit=1 read=152
 Total runtime: 109.129 ms
(9 rows)

Time: 109.842 ms
2011-12-01 10:02:19.047 jk=# set enable_seqscan to on;
SET
Time: 0.122 ms
2011-12-01 10:02:23.657 jk=# explain (analyze on, buffers on) select
count(id) from ftstest where fts @@ to_tsquery('english','test1');
   QUERY PLAN

 Aggregate  (cost=7442.87..7442.88 rows=1 width=4) (actual
time=25266.047..25266.048 rows=1 loops=1)
   Buffers: shared hit=651448 read=561776
   -  Seq Scan on ftstest  (cost=0.00..6943.00 rows=199947 width=4)
(actual time=0.015..25177.959 rows=199939 loops=1)
 Filter: (fts @@ '''test1'''::tsquery)
 Rows Removed by Filter: 61
 Buffers: shared hit=651448 read=561776
 Total runtime: 25266.080 ms
(7 rows)

Time: 25266.740 ms
2011-12-01 10:02:49.936 jk=#

For FTS it hits double, since the tsvector coloum typically is in TOAST
(since it is large) and rarely is in memory (since it is rarely returned
to the user or used for anything except from building the index).

As can be seen, it hits 651448 buffers in the latter (default) query
and only 4596 in the former. Would just naively accounting for the
need of TOAST-access level that x10 difference out?

Secondly I could bump the default cost of ts_match_vq/ts_match_qv a
bit up, since the cost of doing that computation is probably not as
cheap as a ordinary boolean function.

I'm not trying to argue that the solution is simple, since my insights
are limited there, only that the problem forces me to drop in
set enable_seqscan to off in random places in the code. (which
i definately would  prefer to go without).

Jesper

-- 
Jesper


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Accounting for toast in query planner. (gin/gist indexes).

2011-12-01 Thread Tom Lane
jes...@krogh.cc writes:
 Secondly I could bump the default cost of ts_match_vq/ts_match_qv a
 bit up, since the cost of doing that computation is probably not as
 cheap as a ordinary boolean function.

Actually, you could try bumping their costs up by more than a bit.
It's a tad unfair to blame the toast access costs on the functions,
but it might get close enough to what you need.  If memory serves,
we don't charge those costs against index scans.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Accounting for toast in query planner. (gin/gist indexes).

2011-11-30 Thread Jesper Krogh

Hi list.

I have currently hit a problem which I dug into finding the cause for, in
particular, searching in GIN indices seems in some situations to
un-fairly favor Sequential Scans.

Googling a bit I found this page:
http://postgis.refractions.net/docs/ch06.html#id2635817

Describing the excact problem.

It seemed to be discussed back in the pre 8.1 days and wasn't
solved there, is there a chance someone may address it in 9.2 ?

http://archives.postgresql.org/pgsql-performance/2005-02/msg00041.php

Would you coin it a hard task or can a fairly naive C-coder, with
a fair amount of PG experience approach it?

Test-dataset can be created with:

CREATE table ftstest (id serial unique, fts tsvector);

DO
$$DECLARE r RECORD;
BEGIN
FOR r in SELECT generate_series(1,5000)
LOOP insert into ftstest(fts) (select 
strip(to_tsvector('english',string_agg(test,' '))) from (select 'test' 
|| generate_series(1,(select (random()*1)::int)) as test ) as foo);

END LOOP;
END;
$$;

CREATE INDEX ON ftstest using gin(fts);
ANALYZE;



2011-11-30 21:13:30.302 jktest=# explain ( buffers on, analyze on  ) 
select count(id) from ftstest where fts @@ to_tsquery('english','test500');

QUERY PLAN
--
 Aggregate  (cost=122.37..122.38 rows=1 width=4) (actual 
time=1114.096..1114.097 rows=1 loops=1)

   Buffers: shared hit=13384 read=24445 written=3002
   -  Seq Scan on ftstest  (cost=0.00..110.50 rows=4748 width=4) 
(actual time=0.567..1112.447 rows=4748 loops=1)

 Filter: (fts @@ '''test500'''::tsquery)
 Rows Removed by Filter: 252
 Buffers: shared hit=13384 read=24445 written=3002
 Total runtime: 1114.134 ms
(7 rows)

Time: 1114.945 ms
2011-11-30 21:14:30.382 jktest=# set enable_seqscan to off;
SET
Time: 0.132 ms
2011-11-30 21:14:50.965 jktest=# explain ( buffers on, analyze on  ) 
select count(id) from ftstest where fts @@ to_tsquery('english','test500');

 QUERY PLAN
-
 Aggregate  (cost=184.02..184.03 rows=1 width=4) (actual 
time=2.502..2.502 rows=1 loops=1)

   Buffers: shared hit=1 read=56 written=3
   -  Bitmap Heap Scan on ftstest  (cost=64.80..172.15 rows=4748 
width=4) (actual time=1.160..1.989 rows=4748 loops=1)

 Recheck Cond: (fts @@ '''test500'''::tsquery)
 Buffers: shared hit=1 read=56 written=3
 -  Bitmap Index Scan on ftstest_fts_idx  (cost=0.00..63.61 
rows=4748 width=0) (actual time=1.137..1.137 rows=4748 loops=1)

   Index Cond: (fts @@ '''test500'''::tsquery)
   Buffers: shared hit=1 read=8
 Total runtime: 2.529 ms
(9 rows)

Time: 3.016 ms


--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Accounting for toast in query planner. (gin/gist indexes).

2011-11-30 Thread Tom Lane
Jesper Krogh jes...@krogh.cc writes:
 I have currently hit a problem which I dug into finding the cause for, in
 particular, searching in GIN indices seems in some situations to
 un-fairly favor Sequential Scans.

 Googling a bit I found this page:
 http://postgis.refractions.net/docs/ch06.html#id2635817
 Describing the excact problem.

 It seemed to be discussed back in the pre 8.1 days and wasn't
 solved there, is there a chance someone may address it in 9.2 ?
 http://archives.postgresql.org/pgsql-performance/2005-02/msg00041.php

Don't hold your breath.  There's a huge amount of work to do there,
and nobody's even thinking about it.  The problem has actually gotten
worse since 8.1, because the index access patterns are more complex
and the planner has less not more info about them (because we pushed
the determination of what's lossy to run time).  Accumulating stats
about how toasty the heap tuples are would be only the first step
towards producing better cost estimates here --- you'd also need some
way of estimating how lossy an index is, for instance, so you could
guess how many heap tuples will be visited.  And on top of that,
for a lot of these complicated operators even our estimates of the final
number of matches are pretty crummy.  I'm not sure if having an explicit
model of what's happening inside the index would help.  Maybe it would,
since the condition actually being tested by the index is probably
simpler than the operator proper, but right now the planner has no clue
that they're different.

 Would you coin it a hard task or can a fairly naive C-coder, with
 a fair amount of PG experience approach it?

We're a long way from needing C coding here.  The first thing would be
to come up with a design for what we're going to model, what statistical
estimates have to be made along the way, and how we can compartmentalize
the knowledge needed given that we want operator classes and index
access methods to be pluggable.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers