[PERFORM] Full text search with ORDER BY performance issue

2009-07-18 Thread Krade

Hello,

I'm having a bit of an issue with full text search (using tsvectors) on 
PostgreSQL 8.4. I have a rather large table (around 12M rows) and want 
to use full text search in it (just for one of the columns). Just doing 
a plainto_tsquery works reasonably fast (I have a GIN index on the 
column in question, "comment_tsv"), but it becomes unbearably slow when 
I want to make it order by another field ("timestamp").


Here's an example query:
SELECT * FROM a WHERE comment_tsv @@ plainto_tsquery('love') ORDER BY 
timestamp DESC LIMIT 24 OFFSET 0;


I asked in #postgresql and was told that there are two possible plans 
for this query; the first scans the BTREE timestamp index, gets the 
ordering and then filters out the rows using text search; the second 
finds all rows matching the text search using the GIN index and then 
sorts them according to that field -- this much I already knew, in fact, 
I had to drop the BTREE index on "timestamp" to prevent the planner from 
choosing the first, since the first plan is completely useless to me, 
considering the table is so huge (suggestions on how to prevent the 
planner from picking the "wrong" plan are also appreciated).


Obviously, this gets really slow when I try to search for common words 
and full text search returns a lot of rows to be ordered.


I tried to make a GIN index on ("timestamp", "comment_tsv"), (using 
btree_gin from contrib) but it doesn't really do anything -- I was told 
on IRC this is because GIN doesn't provide support for ORDER BY, only 
BTREE can do that.


Here's a couple of queries:

archive=> explain analyze select * from a where  comment_tsv @@ 
plainto_tsquery('love') order by timestamp desc limit 24 offset 0;


QUERY PLAN
--
 Limit  (cost=453248.73..453248.79 rows=24 width=281) (actual 
time=188441.047..188441.148 rows=24 loops=1)
   ->  Sort  (cost=453248.73..453882.82 rows=253635 width=281) (actual 
time=188441.043..188441.079 rows=24 loops=1)

 Sort Key: "timestamp"
 Sort Method:  top-N heapsort  Memory: 42kB
 ->  Bitmap Heap Scan on a  (cost=17782.16..446166.02 
rows=253635 width=281) (actual time=2198.930..187948.050 rows=256378 
loops=1)

   Recheck Cond: (comment_tsv @@ plainto_tsquery('love'::text))
   ->  Bitmap Index Scan on timestamp_comment_gin  
(cost=0.00..17718.75 rows=253635 width=0) (actual 
time=2113.664..2113.664 rows=259828 loops=1)
 Index Cond: (comment_tsv @@ 
plainto_tsquery('love'::text))

 Total runtime: 188442.617 ms
(9 rows)

archive=> explain analyze select * from a where  comment_tsv @@ 
plainto_tsquery('love') limit 24 offset 0;


QUERY PLAN
--
 Limit  (cost=0.00..66.34 rows=24 width=281) (actual 
time=14.632..53.647 rows=24 loops=1)
   ->  Seq Scan on a  (cost=0.00..701071.49 rows=253635 width=281) 
(actual time=14.629..53.588 rows=24 loops=1)

 Filter: (comment_tsv @@ plainto_tsquery('love'::text))
 Total runtime: 53.731 ms
(4 rows)

First one runs painfully slow.

Is there really no way to have efficient full text search results 
ordered by a separate field? I'm really open to all possibilities, at 
this point.


Thanks.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Full text search with ORDER BY performance issue

2009-07-20 Thread Krade

Hello, thanks for your replies.


On 7/20/2009 13:12, Oleg Bartunov wrote:

Hmm, everything is already written in explain :) In the first query 
253635 rows should be readed from disk and sorted, while in the
second query only 24 (random) rows readed from disk, so there is 4 
magnitudes
difference and in the worst case you should expected time for the 1st 
query

about 53*10^4 ms.
Yes, I do realize the first query is retrieving all the rows that match 
the full text search and sorting them, that's what I wanted to avoid. :) 
Since I only want 24 results at a time, I wanted to avoid having to get 
all the rows and sort them. I was wondering if there was any way to use, 
say, some index combination I'm not aware of, cluster the table 
according to an index or using a different query to get the same results.


Well, to take advantage of the gin index on (timestamp, comment_tsv), I 
suppose could do something like this:
archive=> explain analyze select * from a where comment_tsv @@ 
plainto_tsquery('love') and timestamp > cast(floor(extract(epoch from 
CURRENT_TIMESTAMP) - 864000) as integer) order by timestamp limit 24 
offset 0;


QUERY PLAN
--
Limit (cost=17326.69..17326.75 rows=24 width=281) (actual 
time=3249.192..3249.287 rows=24 loops=1)
-> Sort (cost=17326.69..17337.93 rows=4499 width=281) (actual 
time=3249.188..3249.221 rows=24 loops=1)

Sort Key: "timestamp"
Sort Method: top-N heapsort Memory: 39kB
-> Bitmap Heap Scan on a (cost=408.80..17201.05 rows=4499 width=281) 
(actual time=3223.890..3240.484 rows=5525 loops=1)
Recheck Cond: (("timestamp" > (floor((date_part('epoch'::text, now()) - 
864000::double precision)))::integer) AND (comment_tsv @@ 
plainto_tsquery('love'::text)))
-> Bitmap Index Scan on timestamp_comment_gin (cost=0.00..407.67 
rows=4499 width=0) (actual time=3222.769..3222.769 rows=11242 loops=1)
Index Cond: (("timestamp" > (floor((date_part('epoch'::text, now()) - 
864000::double precision)))::integer) AND (comment_tsv @@ 
plainto_tsquery('love'::text)))

Total runtime: 3249.957 ms
(9 rows)

Which only looks at the last 10 days and is considerably faster. Not 
perfect, but a lot better. But this requires a lot of application logic, 
for example, if I didn't get 24 results in the first query, I'd have to 
reissue the query with a larger time interval and it gets worse pretty 
fast. It strikes me as a really dumb thing to do.


I'm really hitting a brick wall here, I can't seem to be able to provide 
reasonably efficient full text search that is ordered by date rather 
than random results from the database.


On 7/20/2009 13:22, Marcin Stępnicki wrote:

What happens if you make it:


select * from (
select * from a where comment_tsv @@plainto_tsquery('love')
) xx

order by xx.timestamp desc
limit 24 offset 0;

?

Same query plan, I'm afraid.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Full text search with ORDER BY performance issue

2009-07-20 Thread Krade

Hello,

On 7/20/2009 22:42, Kevin Grittner wrote:

Have you considered keeping rows "narrow" until you've identified your
24 rows?  Something like:

SELECT * FROM a
   WHERE id in
 (
   SELECT id FROM a
 WHERE comment_tsv @@ plainto_tsquery('love')
 ORDER BY timestamp DESC
 LIMIT 24 OFFSET 0
 )
   ORDER BY timestamp DESC
;
   
Good idea, but it doesn't really seem to do much. The query times are 
roughly the same.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Full text search with ORDER BY performance issue

2009-07-20 Thread Krade

On 7/21/2009 2:13, Devin Ben-Hur wrote:
Have you tried make the full-text condition in a subselect with 
"offset 0" to stop the plan reordering?


eg:

select *
from (
  select * from a where comment_tsv @@ plainto_tsquery('love')
  offset 0
) xx
order by timestamp DESC
limit 24
offset 0;


See http://blog.endpoint.com/2009/04/offset-0-ftw.html
Yes, that does force the planner to always pick the full text index 
first rather than the timestamp index. I managed to force that by doing 
something a lot more drastic, I just dropped my timestamp index 
altogether, since I never used it for anything else. (I mentioned this 
in my original post)


Though, that comment did make me try to readd it. I was pretty 
surprised, the planner was only doing backward searches on the timestamp 
index for very common words (therefore turning multi-minute queries into 
very fast ones), as opposed to trying to use the timestamp index for all 
queries. I wonder if this is related to tweaks to the planner in 8.4 or 
if it was just my statistics that got balanced out.


I'm not entirely happy, because I still easily get minute long queries 
on common words, but where the planner choses to not use the timestamp 
index. The planner can't guess right all the time.


But I think I might just do:
select * from a where comment_tsv @@ plainto_tsquery('query') and 
timestamp > cast(floor(extract(epoch from CURRENT_TIMESTAMP) - 864000) 
as integer) order by timestamp desc limit 24 offset 0;


And if I get less than 24 rows, issue the regular query:

select * from a where comment_tsv @@ plainto_tsquery('query') order by 
timestamp desc limit 24 offset 0;


I pay the price of doing two queries when I could have done just one, 
and it does make almost all queries about 200 ms slower, but it does so 
at the expense of turning the few very slow queries into quick ones.


Thanks for all the help.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Full text search with ORDER BY performance issue

2009-07-21 Thread Krade

On 7/21/2009 11:32, valgog wrote:

Hi,

There is a problem with GIN and GIST indexes, that they cannot be used
by the ORDER BY. Maybe it will be a nice idea to ask Oleg to make it
possible to use the b-tree columns in GIST or GIN to make the sort
easier, but I have no idea how difficult it will be to implement it in
current GIN or GIST structures. I think Oleg or even Tom will be the
right people to ask it :) But even if it is possible it will not be
implemented at least until 8.5 that will need a year to come, so until
then...
   
Unfortunately, it's not even just the lack of ORDER BY support, 
btree_gin indexes seem to be broken under some circumstances. So I can't 
even use my idea to limit searches to the last 10 days.


See this:
http://pgsql.privatepaste.com/5219TutUMk

The first query gives bogus results. It's not using the index correctly.

timestamp_comment_gin is a GIN index on timestamp, comment_tsv. The 
timestamp column is an integer. The queries work right if I drop the 
index. Is this a bug in btree_gin?

It is possible to strip your table in several smaller ones putting
them on different machines and then splitting your query with DBLINK.
This will distribute the burden of sorting to several machines that
will have to sort smaller parts as well. After you have your 25 ids
from each of the machines, you can merge them, sort again and limit as
you wish. Doing large offsets will be still problematic but faster
anyway in most reasonable offset ranges. (Load balancing tools like
pg_pool can automate this task, but I do not have practical experience
using them for that purposes)

Yet another very interesting technology -- sphinx search (http://
www.sphinxsearch.com/). It can distribute data on several machines
automatically, but it will be probably too expensive to start using
(if your task is not your main one :)) as they do not have standard
automation scripts, it does not support live updates (so you will
always have some minutes delay), and this is a standalone service,
that needs to be maintained and configured and synchronized with our
main database separately (though you can use pg/python to access it
from postgres).

Good luck with your task :)
Yeah, I don't really have that sort of resources. This is a small hobby 
project (ie: no budget) that is growing a bit too large. I might just 
have to do text searches without time ordering.


On 7/21/2009 5:06, Scott Marlowe wrote:

Couldn't you do tge second query as a with query then run another
query to limit that result to everything greater than now()-xdays ?
   
I suppose I could, but I have no way to do a fast query that does both a 
full text match and a < or > in the same WHERE due to the issue I 
described above, so my original plan won't work. A separate BTREE 
timestamp index obviously does nothing.


And again, thank you for all the help.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance