[PERFORM] Full text search with ORDER BY performance issue
Hello, I'm having a bit of an issue with full text search (using tsvectors) on PostgreSQL 8.4. I have a rather large table (around 12M rows) and want to use full text search in it (just for one of the columns). Just doing a plainto_tsquery works reasonably fast (I have a GIN index on the column in question, "comment_tsv"), but it becomes unbearably slow when I want to make it order by another field ("timestamp"). Here's an example query: SELECT * FROM a WHERE comment_tsv @@ plainto_tsquery('love') ORDER BY timestamp DESC LIMIT 24 OFFSET 0; I asked in #postgresql and was told that there are two possible plans for this query; the first scans the BTREE timestamp index, gets the ordering and then filters out the rows using text search; the second finds all rows matching the text search using the GIN index and then sorts them according to that field -- this much I already knew, in fact, I had to drop the BTREE index on "timestamp" to prevent the planner from choosing the first, since the first plan is completely useless to me, considering the table is so huge (suggestions on how to prevent the planner from picking the "wrong" plan are also appreciated). Obviously, this gets really slow when I try to search for common words and full text search returns a lot of rows to be ordered. I tried to make a GIN index on ("timestamp", "comment_tsv"), (using btree_gin from contrib) but it doesn't really do anything -- I was told on IRC this is because GIN doesn't provide support for ORDER BY, only BTREE can do that. Here's a couple of queries: archive=> explain analyze select * from a where comment_tsv @@ plainto_tsquery('love') order by timestamp desc limit 24 offset 0; QUERY PLAN -- Limit (cost=453248.73..453248.79 rows=24 width=281) (actual time=188441.047..188441.148 rows=24 loops=1) -> Sort (cost=453248.73..453882.82 rows=253635 width=281) (actual time=188441.043..188441.079 rows=24 loops=1) Sort Key: "timestamp" Sort Method: top-N heapsort Memory: 42kB -> Bitmap Heap Scan on a (cost=17782.16..446166.02 rows=253635 width=281) (actual time=2198.930..187948.050 rows=256378 loops=1) Recheck Cond: (comment_tsv @@ plainto_tsquery('love'::text)) -> Bitmap Index Scan on timestamp_comment_gin (cost=0.00..17718.75 rows=253635 width=0) (actual time=2113.664..2113.664 rows=259828 loops=1) Index Cond: (comment_tsv @@ plainto_tsquery('love'::text)) Total runtime: 188442.617 ms (9 rows) archive=> explain analyze select * from a where comment_tsv @@ plainto_tsquery('love') limit 24 offset 0; QUERY PLAN -- Limit (cost=0.00..66.34 rows=24 width=281) (actual time=14.632..53.647 rows=24 loops=1) -> Seq Scan on a (cost=0.00..701071.49 rows=253635 width=281) (actual time=14.629..53.588 rows=24 loops=1) Filter: (comment_tsv @@ plainto_tsquery('love'::text)) Total runtime: 53.731 ms (4 rows) First one runs painfully slow. Is there really no way to have efficient full text search results ordered by a separate field? I'm really open to all possibilities, at this point. Thanks. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Full text search with ORDER BY performance issue
Hello, thanks for your replies. On 7/20/2009 13:12, Oleg Bartunov wrote: Hmm, everything is already written in explain :) In the first query 253635 rows should be readed from disk and sorted, while in the second query only 24 (random) rows readed from disk, so there is 4 magnitudes difference and in the worst case you should expected time for the 1st query about 53*10^4 ms. Yes, I do realize the first query is retrieving all the rows that match the full text search and sorting them, that's what I wanted to avoid. :) Since I only want 24 results at a time, I wanted to avoid having to get all the rows and sort them. I was wondering if there was any way to use, say, some index combination I'm not aware of, cluster the table according to an index or using a different query to get the same results. Well, to take advantage of the gin index on (timestamp, comment_tsv), I suppose could do something like this: archive=> explain analyze select * from a where comment_tsv @@ plainto_tsquery('love') and timestamp > cast(floor(extract(epoch from CURRENT_TIMESTAMP) - 864000) as integer) order by timestamp limit 24 offset 0; QUERY PLAN -- Limit (cost=17326.69..17326.75 rows=24 width=281) (actual time=3249.192..3249.287 rows=24 loops=1) -> Sort (cost=17326.69..17337.93 rows=4499 width=281) (actual time=3249.188..3249.221 rows=24 loops=1) Sort Key: "timestamp" Sort Method: top-N heapsort Memory: 39kB -> Bitmap Heap Scan on a (cost=408.80..17201.05 rows=4499 width=281) (actual time=3223.890..3240.484 rows=5525 loops=1) Recheck Cond: (("timestamp" > (floor((date_part('epoch'::text, now()) - 864000::double precision)))::integer) AND (comment_tsv @@ plainto_tsquery('love'::text))) -> Bitmap Index Scan on timestamp_comment_gin (cost=0.00..407.67 rows=4499 width=0) (actual time=3222.769..3222.769 rows=11242 loops=1) Index Cond: (("timestamp" > (floor((date_part('epoch'::text, now()) - 864000::double precision)))::integer) AND (comment_tsv @@ plainto_tsquery('love'::text))) Total runtime: 3249.957 ms (9 rows) Which only looks at the last 10 days and is considerably faster. Not perfect, but a lot better. But this requires a lot of application logic, for example, if I didn't get 24 results in the first query, I'd have to reissue the query with a larger time interval and it gets worse pretty fast. It strikes me as a really dumb thing to do. I'm really hitting a brick wall here, I can't seem to be able to provide reasonably efficient full text search that is ordered by date rather than random results from the database. On 7/20/2009 13:22, Marcin Stępnicki wrote: What happens if you make it: select * from ( select * from a where comment_tsv @@plainto_tsquery('love') ) xx order by xx.timestamp desc limit 24 offset 0; ? Same query plan, I'm afraid. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Full text search with ORDER BY performance issue
Hello, On 7/20/2009 22:42, Kevin Grittner wrote: Have you considered keeping rows "narrow" until you've identified your 24 rows? Something like: SELECT * FROM a WHERE id in ( SELECT id FROM a WHERE comment_tsv @@ plainto_tsquery('love') ORDER BY timestamp DESC LIMIT 24 OFFSET 0 ) ORDER BY timestamp DESC ; Good idea, but it doesn't really seem to do much. The query times are roughly the same. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Full text search with ORDER BY performance issue
On 7/21/2009 2:13, Devin Ben-Hur wrote: Have you tried make the full-text condition in a subselect with "offset 0" to stop the plan reordering? eg: select * from ( select * from a where comment_tsv @@ plainto_tsquery('love') offset 0 ) xx order by timestamp DESC limit 24 offset 0; See http://blog.endpoint.com/2009/04/offset-0-ftw.html Yes, that does force the planner to always pick the full text index first rather than the timestamp index. I managed to force that by doing something a lot more drastic, I just dropped my timestamp index altogether, since I never used it for anything else. (I mentioned this in my original post) Though, that comment did make me try to readd it. I was pretty surprised, the planner was only doing backward searches on the timestamp index for very common words (therefore turning multi-minute queries into very fast ones), as opposed to trying to use the timestamp index for all queries. I wonder if this is related to tweaks to the planner in 8.4 or if it was just my statistics that got balanced out. I'm not entirely happy, because I still easily get minute long queries on common words, but where the planner choses to not use the timestamp index. The planner can't guess right all the time. But I think I might just do: select * from a where comment_tsv @@ plainto_tsquery('query') and timestamp > cast(floor(extract(epoch from CURRENT_TIMESTAMP) - 864000) as integer) order by timestamp desc limit 24 offset 0; And if I get less than 24 rows, issue the regular query: select * from a where comment_tsv @@ plainto_tsquery('query') order by timestamp desc limit 24 offset 0; I pay the price of doing two queries when I could have done just one, and it does make almost all queries about 200 ms slower, but it does so at the expense of turning the few very slow queries into quick ones. Thanks for all the help. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Full text search with ORDER BY performance issue
On 7/21/2009 11:32, valgog wrote: Hi, There is a problem with GIN and GIST indexes, that they cannot be used by the ORDER BY. Maybe it will be a nice idea to ask Oleg to make it possible to use the b-tree columns in GIST or GIN to make the sort easier, but I have no idea how difficult it will be to implement it in current GIN or GIST structures. I think Oleg or even Tom will be the right people to ask it :) But even if it is possible it will not be implemented at least until 8.5 that will need a year to come, so until then... Unfortunately, it's not even just the lack of ORDER BY support, btree_gin indexes seem to be broken under some circumstances. So I can't even use my idea to limit searches to the last 10 days. See this: http://pgsql.privatepaste.com/5219TutUMk The first query gives bogus results. It's not using the index correctly. timestamp_comment_gin is a GIN index on timestamp, comment_tsv. The timestamp column is an integer. The queries work right if I drop the index. Is this a bug in btree_gin? It is possible to strip your table in several smaller ones putting them on different machines and then splitting your query with DBLINK. This will distribute the burden of sorting to several machines that will have to sort smaller parts as well. After you have your 25 ids from each of the machines, you can merge them, sort again and limit as you wish. Doing large offsets will be still problematic but faster anyway in most reasonable offset ranges. (Load balancing tools like pg_pool can automate this task, but I do not have practical experience using them for that purposes) Yet another very interesting technology -- sphinx search (http:// www.sphinxsearch.com/). It can distribute data on several machines automatically, but it will be probably too expensive to start using (if your task is not your main one :)) as they do not have standard automation scripts, it does not support live updates (so you will always have some minutes delay), and this is a standalone service, that needs to be maintained and configured and synchronized with our main database separately (though you can use pg/python to access it from postgres). Good luck with your task :) Yeah, I don't really have that sort of resources. This is a small hobby project (ie: no budget) that is growing a bit too large. I might just have to do text searches without time ordering. On 7/21/2009 5:06, Scott Marlowe wrote: Couldn't you do tge second query as a with query then run another query to limit that result to everything greater than now()-xdays ? I suppose I could, but I have no way to do a fast query that does both a full text match and a < or > in the same WHERE due to the issue I described above, so my original plan won't work. A separate BTREE timestamp index obviously does nothing. And again, thank you for all the help. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance