Re: [PERFORM] FTS performance issue - planner problem identified (but only partially resolved)

2013-07-29 Thread Kevin Grittner
Stefan Keller sfkel...@gmail.com wrote:

 Finally, setting random_page_cost to 1 helps also - but I don't
 like this setting neither.

Well, you should learn to like whichever settings best model your
actual costs given your level of caching and your workload.  ;-)
FWIW, I have found page costs less volatile and easier to tune
with cpu_tuple_cost increased.  I just always start by bumping
that to 0.03.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] FTS performance issue - planner problem identified (but only partially resolved)

2013-07-29 Thread Stefan Keller
Hi Kevin

Well, you're right :-) But my use cases are un-specific by design
since I'm using FTS as a general purpose function.

So I still propose to enhance the planner too as Tom Lane and your
colleague suggest based on repeated similar complaints [1].

Yours, Stefan

[1] 
http://www.postgresql.org/message-id/ca+tgmozgqbeu2kn305hwds+axw7yp0yn9vzwbsbwa8unst+...@mail.gmail.com


2013/7/29 Kevin Grittner kgri...@ymail.com:
 Stefan Keller sfkel...@gmail.com wrote:

 Finally, setting random_page_cost to 1 helps also - but I don't
 like this setting neither.

 Well, you should learn to like whichever settings best model your
 actual costs given your level of caching and your workload.  ;-)
 FWIW, I have found page costs less volatile and easier to tune
 with cpu_tuple_cost increased.  I just always start by bumping
 that to 0.03.

 --
 Kevin Grittner
 EDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] FTS performance issue - planner problem identified (but only partially resolved)

2013-07-20 Thread Marc Mamin



Von: Stefan Keller [sfkel...@gmail.com]
Gesendet: Samstag, 20. Juli 2013 01:55

Hi Marc

Thanks a lot for your hint!

You mean doing a SET track_counts (true); for the whole session?

No, 
I mean 

ALTER TABLE table ALTER  ts_vector_column SET STATISTICS 0;

And remove existing statistics

DELETE FROM pg_catalog.pg_statistic 
where starelid='table':: regclass
AND staattnum = (SELECT attnum FROM pg_attribute
WHERE attrelid = 'table':: regclass
AND  attname  =  'ts_vector_column'::name
)

But you should first try to find out which proportion of your ts queries are 
faster 
when using a table scan as they will probably not happen anymore afterwards !
(Except if further columns on your table 'FullTextSearch' are considered by the 
planner)




That would be ok if it would be possible just for the gin index.

It's obviously an issue of the planner estimation costs.
The data I'm speaking about (movies) has a text attribute which has
a length of more than 8K so it's obviously having to do with
detoasting.
But the thoughts about @@ operators together with this GIN index seem
also to be valid.

I hope this issue is being tracked in preparation for 9.3.

Regards, Stefan


2013/7/19 Marc Mamin m.ma...@intershop.de:

 SELECT * FROM FullTextSearch WHERE content_tsv_gin @@
 plainto_tsquery('english', 'good');

 It's slow ( 30 sec.) for some GB (27886 html files, originally 73 MB 
 zipped).
 The planner obviously always chooses table scan


 Hello,

 A probable reason for the time difference is the cost for decompressing 
 toasted content.
 At least in 8.3, the planner was not good at estimating it.

 I'm getting better overall performances since I've stopped collect statistic 
 on tsvectors.
 An alternative would have been to disallow compression on them.

 I'm aware this is a drastic way and would not recommend it without testing. 
 The benefit may depend on the type of data you are indexing.
 In our use case these are error logs with many java stack traces, hence with 
 many lexemes poorly discriminative.

 see: http://www.postgresql.org/message-id/27953.1329434...@sss.pgh.pa.us
 as a comment on
 http://www.postgresql.org/message-id/c4dac901169b624f933534a26ed7df310861b...@jenmail01.ad.intershop.net

 regards,

 Marc Mamin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] FTS performance issue - planner problem identified (but only partially resolved)

2013-07-19 Thread Marc Mamin

 SELECT * FROM FullTextSearch WHERE content_tsv_gin @@
 plainto_tsquery('english', 'good');

 It's slow ( 30 sec.) for some GB (27886 html files, originally 73 MB zipped).
 The planner obviously always chooses table scan


Hello,

A probable reason for the time difference is the cost for decompressing toasted 
content.
At lest in 8.3, the planner was not good at estimating it.

I'm getting better overall performances since I've stopped collect statistic on 
tsvectors.
An alternative would have been to disallow compression on them.

I'm aware this is a drastic way and would not recommend it without testing. The 
benefit may depend on the type of data you are indexing.
In our use case these are error logs with many java stack traces, hence with 
many lexemes poorly discriminative.

see: http://www.postgresql.org/message-id/27953.1329434...@sss.pgh.pa.us
as a comment on
http://www.postgresql.org/message-id/c4dac901169b624f933534a26ed7df310861b...@jenmail01.ad.intershop.net

regards,

Marc Mamin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] FTS performance issue - planner problem identified (but only partially resolved)

2013-07-19 Thread Stefan Keller
Hi Marc

Thanks a lot for your hint!

You mean doing a SET track_counts (true); for the whole session?
That would be ok if it would be possible just for the gin index.

It's obviously an issue of the planner estimation costs.
The data I'm speaking about (movies) has a text attribute which has
a length of more than 8K so it's obviously having to do with
detoasting.
But the thoughts about @@ operators together with this GIN index seem
also to be valid.

I hope this issue is being tracked in preparation for 9.3.

Regards, Stefan


2013/7/19 Marc Mamin m.ma...@intershop.de:

 SELECT * FROM FullTextSearch WHERE content_tsv_gin @@
 plainto_tsquery('english', 'good');

 It's slow ( 30 sec.) for some GB (27886 html files, originally 73 MB 
 zipped).
 The planner obviously always chooses table scan


 Hello,

 A probable reason for the time difference is the cost for decompressing 
 toasted content.
 At least in 8.3, the planner was not good at estimating it.

 I'm getting better overall performances since I've stopped collect statistic 
 on tsvectors.
 An alternative would have been to disallow compression on them.

 I'm aware this is a drastic way and would not recommend it without testing. 
 The benefit may depend on the type of data you are indexing.
 In our use case these are error logs with many java stack traces, hence with 
 many lexemes poorly discriminative.

 see: http://www.postgresql.org/message-id/27953.1329434...@sss.pgh.pa.us
 as a comment on
 http://www.postgresql.org/message-id/c4dac901169b624f933534a26ed7df310861b...@jenmail01.ad.intershop.net

 regards,

 Marc Mamin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] FTS performance issue - planner problem identified (but only partially resolved)

2013-07-18 Thread Stefan Keller
Hi

At 2013/2/8 I wrote:
 I have problems with the performance of FTS in a query like this:

 SELECT * FROM FullTextSearch WHERE content_tsv_gin @@
 plainto_tsquery('english', 'good');

 It's slow ( 30 sec.) for some GB (27886 html files, originally 73 MB zipped).
 The planner obviously always chooses table scan

Now, I've identified (but only partially resolved) the issue: Here are
my comments:

Thats the query in question (see commented log below):

select id,title,left(content,100)
from fulltextsearch
where plainto_tsquery('pg_catalog.english','good') @@
to_tsvector('pg_catalog.english',content);

After having created the GIN index, the FTS query unexpectedly is fast
because planner chooses Bitmap Index Scan. After the index
statistics have been updated, the same query becomes slow. Only when
using the trick with the function in the WHERE clause. I think GIST
does'nt change anything.

select id,title,left(content,100)
from fulltextsearch, plainto_tsquery('pg_catalog.english','good') query
where query @@ to_tsvector('pg_catalog.english',content);

= This hint should mentioned in the docs!

Then, setting enable_seqscan to off makes original query fast again.
But that's a setting I want to avoid in a multi-user database.
Finally, setting random_page_cost to 1 helps also - but I don't like
this setting neither.

= To me the planner should be updated to recognize immutable
plainto_tsquery() function in the WHERE clause and choose Bitmap
Index Scan at the first place.

What do you think?

Yours, Stefan



Lets look at table fulltextsearch:

movies=# \d fulltextsearch
  Table public.fulltextsearch
 Column  |  Type   |  Modifiers
-+-+-
 id  | integer | not null default nextval('fulltextsearch_id_seq'::regclass)
 docid   | integer | default 0
 title   | text|
 content | text| not null

movies=# CREATE INDEX fulltextsearch_gincontent ON fulltextsearch
USING gin(to_tsvector('pg_catalog.english',content));

movies=# SELECT * FROM pg_class c WHERE relname LIKE 'fullt%';
  oid   |   name| kind |   tuples| pages |
allvisible | toastrelid | hasindex
+---+--+-+---+++--
 476289 | fulltextsearch| r|   27886 |   555 |
 0 | 476293 | t
 503080 | fulltextsearch_gincontent | i| 8.97135e+06 | 11133 |
 0 |  0 | f
 476296 | fulltextsearch_id_seq | S|   1 | 1 |
 0 |  0 | f
 503075 | fulltextsearch_pkey   | i|   27886 |79 |
 0 |  0 | f
(4 rows)

= fulltextsearch_gincontent has an arbitrary large number of tuples
(statistics is wrong and not yet updated)

movies=#
explain (analyze,costs,timing,buffers)
select id,title,left(content,100)
from fulltextsearch
where plainto_tsquery('pg_catalog.english','good') @@
to_tsvector('pg_catalog.english',content);
= Unexpectedly, the query is fast!
See query plan http://explain.depesz.com/s/ewn

Let's update the statistics:

movies=# VACUUM ANALYZE VERBOSE fulltextsearch ;

SELECT * FROM pg_class c WHERE relname LIKE 'fullt%';
  oid   |   name| kind | tuples | pages |
allvisible | toastrelid | hasindex
+---+--++---+++--
 476289 | fulltextsearch| r|  27886 |   555 |
555 | 476293 | t
 503080 | fulltextsearch_gincontent | i|  27886 | 11133 |
0 |  0 | f
 476296 | fulltextsearch_id_seq | S|  1 | 1 |
0 |  0 | f
 503075 | fulltextsearch_pkey   | i|  27886 |79 |
0 |  0 | f
(4 rows)

= Now after having update statistics (see especially tuples of
fulltextsearch_gincontent ) the original query is slow!
See query plan http://explain.depesz.com/s/MQ60

Now, let's reformulate the original query and move the function call
to plainto_tsquery to the FROM clause:

movies=# explain (analyze,costs,timing,buffers)
select id,title,left(content,100)
from fulltextsearch, plainto_tsquery('pg_catalog.english','good') query
where query @@ to_tsvector('pg_catalog.english',content);
= This special query is fast again!  See query plan
http://explain.depesz.com/s/FVT

Setting enable_seqscan to off makes query fast again: See query plan
http://explain.depesz.com/s/eOr

Finally, setting random_page_cost to 1 helps also (default is 4):

movies=# set enable_seqscan to default;
movies=# set random_page_cost to 1.0;
= Query is fast. See query plan http://explain.depesz.com/s/M5Ke




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance