Re: [PERFORM] pg_trgm performance
On 2/24/07, Steinar H. Gunderson [EMAIL PROTECTED] wrote: Thanks for your time. GiN version, short: - Bitmap Heap Scan on tags (cost=8.64..151.79 rows=41 width=0) (actual time=5.555..30.157 rows=7 loops=1) Filter: (title % 'foo'::text) - Bitmap Index Scan on trgm_idx (cost=0.00..8.63 rows=41 width=0) (actual time=2.857..2.857 rows= loops=1) Index Cond: (title % 'foo'::text) This is currently the worst case in the gist - gin comparison because in the index scan, gin version doesn't have the length of the indexed string. So it returns a lot of rows which have every trigram of your search string but has in fact a low similarity due to the length of the indexed string ( rows - 7 rows). It cannot be fixed at the moment due to the way GIN indexes work. So, the GiN version seems to be a bit faster for long queries, but it's still too slow -- in fact, _unindexed_ versions give 141ms, 342ms, 725ms for these three queries, so for the longer queries, the gain is only about a factor two. (By the way, I would like to stress that this is not my personal music collection! :-P) The fact is that pg_trgm is designed to index words and not to index long sentences. I'm not that surprised it's slow in your case. It's also my case but following the instructions in README.pg_trgm I created a dictionary of words using tsearch2 (stat function) and I use pg_trgm on this dictionary to find similar words in my dictionary. For example, I rewrite the search: auberge cevenes as: (auberge | auberges | aubberge | auberg) (ceven | cene | cevenol | cevennes) using pg_trgm and my query can find Auberge des Cévennes (currently it's limited to the 4th most similar words but I can change it easily). -- Guillaume ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] pg_trgm performance
On Mon, 26 Feb 2007, Guillaume Smet wrote: On 2/24/07, Steinar H. Gunderson [EMAIL PROTECTED] wrote: Thanks for your time. GiN version, short: - Bitmap Heap Scan on tags (cost=8.64..151.79 rows=41 width=0) (actual time=5.555..30.157 rows=7 loops=1) Filter: (title % 'foo'::text) - Bitmap Index Scan on trgm_idx (cost=0.00..8.63 rows=41 width=0) (actual time=2.857..2.857 rows= loops=1) Index Cond: (title % 'foo'::text) This is currently the worst case in the gist - gin comparison because in the index scan, gin version doesn't have the length of the indexed string. So it returns a lot of rows which have every trigram of your search string but has in fact a low similarity due to the length of the indexed string ( rows - 7 rows). It cannot be fixed at the moment due to the way GIN indexes work. So, the GiN version seems to be a bit faster for long queries, but it's still too slow -- in fact, _unindexed_ versions give 141ms, 342ms, 725ms for these three queries, so for the longer queries, the gain is only about a factor two. (By the way, I would like to stress that this is not my personal music collection! :-P) The fact is that pg_trgm is designed to index words and not to index long sentences. I'm not that surprised it's slow in your case. It's also my case but following the instructions in README.pg_trgm I created a dictionary of words using tsearch2 (stat function) and I use pg_trgm on this dictionary to find similar words in my dictionary. For example, I rewrite the search: auberge cevenes as: (auberge | auberges | aubberge | auberg) (ceven | cene | cevenol | cevennes) using pg_trgm and my query can find Auberge des C?vennes (currently it's limited to the 4th most similar words but I can change it easily). Did you rewrite query manually or use rewrite feature of tsearch2 ? -- Guillaume ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] pg_trgm performance
On 2/26/07, Oleg Bartunov oleg@sai.msu.su wrote: Did you rewrite query manually or use rewrite feature of tsearch2 ? Currently, it's manual. I perform a pg_trgm query for each word of the search words (a few stop words excluded) and I generate the ts_query with the similar words instead of using the search words. Is there any benefit of using rewrite() in this case? -- Guillaume ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] pg_trgm performance
On Mon, 26 Feb 2007, Guillaume Smet wrote: On 2/26/07, Oleg Bartunov oleg@sai.msu.su wrote: Did you rewrite query manually or use rewrite feature of tsearch2 ? Currently, it's manual. I perform a pg_trgm query for each word of the search words (a few stop words excluded) and I generate the ts_query with the similar words instead of using the search words. Is there any benefit of using rewrite() in this case? not really, just a matter of possible interesting architectural design. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] pg_trgm performance
On Sat, Feb 24, 2007 at 02:04:36AM +0100, Guillaume Smet wrote: Could you post EXPLAIN ANALYZE for both queries (after 2 or 3 runs)? GIST version, short: amarok=# explain analyze select count(*) from tags where title % 'foo'; QUERY PLAN -- Aggregate (cost=147.84..147.85 rows=1 width=0) (actual time=16.873..16.875 rows=1 loops=1) - Bitmap Heap Scan on tags (cost=4.59..147.74 rows=41 width=0) (actual time=16.828..16.850 rows=7 loops=1) Recheck Cond: (title % 'foo'::text) - Bitmap Index Scan on trgm_idx (cost=0.00..4.58 rows=41 width=0) (actual time=16.818..16.818 rows=7 loops=1) Index Cond: (title % 'foo'::text) Total runtime: 16.935 ms (6 rows) GiN version, short: amarok=# explain analyze select count(*) from tags where title % 'foo'; QUERY PLAN --- Aggregate (cost=151.89..151.90 rows=1 width=0) (actual time=30.197..30.199 rows=1 loops=1) - Bitmap Heap Scan on tags (cost=8.64..151.79 rows=41 width=0) (actual time=5.555..30.157 rows=7 loops=1) Filter: (title % 'foo'::text) - Bitmap Index Scan on trgm_idx (cost=0.00..8.63 rows=41 width=0) (actual time=2.857..2.857 rows= loops=1) Index Cond: (title % 'foo'::text) Total runtime: 30.292 ms (6 rows) GIST version, medium: amarok=# explain analyze select count(*) from tags where title % 'chestnuts roasting on an 0pen fire'; QUERY PLAN Aggregate (cost=147.84..147.85 rows=1 width=0) (actual time=216.149..216.151 rows=1 loops=1) - Bitmap Heap Scan on tags (cost=4.59..147.74 rows=41 width=0) (actual time=216.135..216.137 rows=1 loops=1) Recheck Cond: (title % 'chestnuts roasting on an 0pen fire'::text) - Bitmap Index Scan on trgm_idx (cost=0.00..4.58 rows=41 width=0) (actual time=216.124..216.124 rows=1 loops=1) Index Cond: (title % 'chestnuts roasting on an 0pen fire'::text) Total runtime: 216.214 ms (6 rows) amarok=# explain analyze select count(*) from tags where title % 'chestnuts roasting on an 0pen fire'; QUERY PLAN - Aggregate (cost=151.89..151.90 rows=1 width=0) (actual time=156.310..156.312 rows=1 loops=1) - Bitmap Heap Scan on tags (cost=8.64..151.79 rows=41 width=0) (actual time=156.205..156.299 rows=1 loops=1) Filter: (title % 'chestnuts roasting on an 0pen fire'::text) - Bitmap Index Scan on trgm_idx (cost=0.00..8.63 rows=41 width=0) (actual time=155.748..155.748 rows=36 loops=1) Index Cond: (title % 'chestnuts roasting on an 0pen fire'::text) Total runtime: 156.376 ms (6 rows) GIST version, long: amarok=# explain analyze select count(*) from tags where title % 'Donaueschingen (Peter Kruders Donaudampfschifffahrtsgesellschaftskapitänskajütenremix)'; ; QUERY PLAN -- Aggregate (cost=147.84..147.85 rows=1 width=0) (actual time=597.115..597.117 rows=1 loops=1) - Bitmap Heap Scan on tags (cost=4.59..147.74 rows=41 width=0) (actual time=597.102..597.104 rows=1 loops=1) Recheck Cond: (title % 'Donaueschingen (Peter Kruders Donaudampfschifffahrtsgesellschaftskapitänskajütenremix)'::text) - Bitmap Index Scan on trgm_idx (cost=0.00..4.58 rows=41 width=0) (actual time=597.093..597.093 rows=1 loops=1) Index Cond: (title % 'Donaueschingen (Peter Kruders Donaudampfschifffahrtsgesellschaftskapitänskajütenremix)'::text) Total runtime: 597.173 ms (6 rows) GiN version, long: amarok=# explain analyze select count(*) from tags where title % 'Donaueschingen (Peter Kruders Donaudampfschifffahrtsgesellschaftskapitänskajütenremix)'; ; QUERY PLAN
Re: [PERFORM] pg_trgm performance
Florian, Steinar, Could you try to see if the GIN implementation of pg_trgm is faster in your cases? Florian, instead of using WHERE similarity(...) 0.4, you should use set_limit (SELECT set_limit(0.4);). I posted it on -patches and it is available here: http://people.openwide.fr/~gsmet/postgresql/pg_trgm_gin3.diff . The patch is against HEAD but It applies on 8.2 without any problem. After applying this patch and installing pg_trgm.sql (you should uninstall pg_trgm before compiling using the old uninstall script), then you can create: CREATE INDEX idx_table_word ON table USING gin(word gin_trgm_ops); 17 characters is quite long so I'm not sure it will help you because it usually has to recheck a high number of rows due to the GIN implementation but I'd like to know if it's faster or slower in this case. If your data are not private and you don't have the time to test it, I can test it here without any problem. Thanks. -- Guillaume ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] pg_trgm performance
On Sat, Feb 24, 2007 at 12:09:41AM +0100, Guillaume Smet wrote: Could you try to see if the GIN implementation of pg_trgm is faster in your cases? I'm sorry, I can no longer remember where I needed pg_trgm. Simple testing of your patch seems to indicate that the GiN version is about 65% _slower_ (18ms vs. 30ms) for a test data set I found lying around, but I remember that on the data set I needed it, the GIST version was a lot slower than that (think 3-400ms). The 18 vs. 30ms test is a random Amarok database, on 8.2.3 (Debian). Sorry I couldn't be of more help. /* Steinar */ -- Homepage: http://www.sesse.net/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] pg_trgm performance
Hi Steinar, On 2/24/07, Steinar H. Gunderson [EMAIL PROTECTED] wrote: I'm sorry, I can no longer remember where I needed pg_trgm. Simple testing of your patch seems to indicate that the GiN version is about 65% _slower_ (18ms vs. 30ms) for a test data set I found lying around, but I remember that on the data set I needed it, the GIST version was a lot slower than that (think 3-400ms). The 18 vs. 30ms test is a random Amarok database, on 8.2.3 (Debian). Could you post EXPLAIN ANALYZE for both queries (after 2 or 3 runs)? And if you can provide EXPLAIN ANALYZE for a couple of searches (short length, medium length and long) in both cases, it could be nice too. The GiN version is not selective enough currently compared to GiST. It generally finds the matching rows faster but it has a slower recheck cond so it's sometimes interesting (in my case) and sometimes not that interesting (it seems to be your case). Thanks. -- Guillaume ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[PERFORM] pg_trgm performance
I've got a table with a few million rows, consisting of a single text column. The average length is about 17 characters. For the sake of an experiment, I put a trigram index on that table. Unfortunately, % queries without smallish LIMITs are ridiculously slow (they take longer than an hour). A full table scan with a WHERE similarity(...) = 0.4 clause completes in just a couple of minutes. The queries only select a few hundred rows, so an index scan has got a real chance to be faster than a sequential scan. Am I missing something? Or are trigrams just a poor match for my data set? Are the individual strings too long, maybe? (This is with PostgreSQL 8.2.0, BTW.) -- Florian Weimer[EMAIL PROTECTED] BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] pg_trgm performance
On Mon, Jan 15, 2007 at 11:16:36AM +0100, Florian Weimer wrote: Am I missing something? Or are trigrams just a poor match for my data set? Are the individual strings too long, maybe? FWIW, I've seen the same results with 8.1.x. /* Steinar */ -- Homepage: http://www.sesse.net/ ---(end of broadcast)--- TIP 6: explain analyze is your friend