Re: [PERFORM] pg_trgm performance

2007-02-26 Thread Guillaume Smet

On 2/24/07, Steinar H. Gunderson [EMAIL PROTECTED] wrote:

Thanks for your time.


GiN version, short:
   -  Bitmap Heap Scan on tags  (cost=8.64..151.79 rows=41 width=0) (actual 
time=5.555..30.157 rows=7 loops=1)
 Filter: (title % 'foo'::text)
 -  Bitmap Index Scan on trgm_idx  (cost=0.00..8.63 rows=41 width=0) 
(actual time=2.857..2.857 rows= loops=1)
   Index Cond: (title % 'foo'::text)


This is currently the worst case in the gist - gin comparison because
in the index scan, gin version doesn't have the length of the indexed
string. So it returns a lot of rows which have every trigram of your
search string but has in fact a low similarity due to the length of
the indexed string ( rows - 7 rows).
It cannot be fixed at the moment due to the way GIN indexes work.


So, the GiN version seems to be a bit faster for long queries, but it's still
too slow -- in fact, _unindexed_ versions give 141ms, 342ms, 725ms for these
three queries, so for the longer queries, the gain is only about a factor
two. (By the way, I would like to stress that this is not my personal music
collection! :-P)


The fact is that pg_trgm is designed to index words and not to index
long sentences. I'm not that surprised it's slow in your case.

It's also my case but following the instructions in README.pg_trgm I
created a dictionary of words using tsearch2 (stat function) and I use
pg_trgm on this dictionary to find similar words in my dictionary.

For example, I rewrite the search:
auberge cevenes
as:
(auberge | auberges | aubberge | auberg)  (ceven | cene | cevenol | cevennes)
using pg_trgm and my query can find Auberge des Cévennes (currently
it's limited to the 4th most similar words but I can change it
easily).

--
Guillaume

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] pg_trgm performance

2007-02-26 Thread Oleg Bartunov

On Mon, 26 Feb 2007, Guillaume Smet wrote:


On 2/24/07, Steinar H. Gunderson [EMAIL PROTECTED] wrote:

Thanks for your time.


GiN version, short:
   -  Bitmap Heap Scan on tags  (cost=8.64..151.79 rows=41 width=0) 
(actual time=5.555..30.157 rows=7 loops=1)

 Filter: (title % 'foo'::text)
 -  Bitmap Index Scan on trgm_idx  (cost=0.00..8.63 rows=41 
width=0) (actual time=2.857..2.857 rows= loops=1)

   Index Cond: (title % 'foo'::text)


This is currently the worst case in the gist - gin comparison because
in the index scan, gin version doesn't have the length of the indexed
string. So it returns a lot of rows which have every trigram of your
search string but has in fact a low similarity due to the length of
the indexed string ( rows - 7 rows).
It cannot be fixed at the moment due to the way GIN indexes work.

So, the GiN version seems to be a bit faster for long queries, but it's 
still
too slow -- in fact, _unindexed_ versions give 141ms, 342ms, 725ms for 
these

three queries, so for the longer queries, the gain is only about a factor
two. (By the way, I would like to stress that this is not my personal music
collection! :-P)


The fact is that pg_trgm is designed to index words and not to index
long sentences. I'm not that surprised it's slow in your case.

It's also my case but following the instructions in README.pg_trgm I
created a dictionary of words using tsearch2 (stat function) and I use
pg_trgm on this dictionary to find similar words in my dictionary.

For example, I rewrite the search:
auberge cevenes
as:
(auberge | auberges | aubberge | auberg)  (ceven | cene | cevenol | 
cevennes)

using pg_trgm and my query can find Auberge des C?vennes (currently
it's limited to the 4th most similar words but I can change it
easily).


Did you rewrite query manually or use rewrite feature of tsearch2 ?



--
Guillaume

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] pg_trgm performance

2007-02-26 Thread Guillaume Smet

On 2/26/07, Oleg Bartunov oleg@sai.msu.su wrote:

Did you rewrite query manually or use rewrite feature of tsearch2 ?


Currently, it's manual. I perform a pg_trgm query for each word of the
search words (a few stop words excluded) and I generate the ts_query
with the similar words instead of using the search words.
Is there any benefit of using rewrite() in this case?

--
Guillaume

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] pg_trgm performance

2007-02-26 Thread Oleg Bartunov

On Mon, 26 Feb 2007, Guillaume Smet wrote:


On 2/26/07, Oleg Bartunov oleg@sai.msu.su wrote:

Did you rewrite query manually or use rewrite feature of tsearch2 ?


Currently, it's manual. I perform a pg_trgm query for each word of the
search words (a few stop words excluded) and I generate the ts_query
with the similar words instead of using the search words.
Is there any benefit of using rewrite() in this case?


not really, just a matter of possible interesting architectural design.

Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] pg_trgm performance

2007-02-24 Thread Steinar H. Gunderson
On Sat, Feb 24, 2007 at 02:04:36AM +0100, Guillaume Smet wrote:
 Could you post EXPLAIN ANALYZE for both queries (after 2 or 3 runs)?

GIST version, short:

amarok=# explain analyze select count(*) from tags where title % 'foo';
QUERY PLAN  
  
--
 Aggregate  (cost=147.84..147.85 rows=1 width=0) (actual time=16.873..16.875 
rows=1 loops=1)
   -  Bitmap Heap Scan on tags  (cost=4.59..147.74 rows=41 width=0) (actual 
time=16.828..16.850 rows=7 loops=1)
 Recheck Cond: (title % 'foo'::text)
 -  Bitmap Index Scan on trgm_idx  (cost=0.00..4.58 rows=41 width=0) 
(actual time=16.818..16.818 rows=7 loops=1)
   Index Cond: (title % 'foo'::text)
 Total runtime: 16.935 ms
(6 rows)

GiN version, short:

amarok=# explain analyze select count(*) from tags where title % 'foo';
QUERY PLAN  
   
---
 Aggregate  (cost=151.89..151.90 rows=1 width=0) (actual time=30.197..30.199 
rows=1 loops=1)
   -  Bitmap Heap Scan on tags  (cost=8.64..151.79 rows=41 width=0) (actual 
time=5.555..30.157 rows=7 loops=1)
 Filter: (title % 'foo'::text)
 -  Bitmap Index Scan on trgm_idx  (cost=0.00..8.63 rows=41 width=0) 
(actual time=2.857..2.857 rows= loops=1)
   Index Cond: (title % 'foo'::text)
 Total runtime: 30.292 ms
(6 rows)


GIST version, medium:

amarok=# explain analyze select count(*) from tags where title % 'chestnuts 
roasting on an 0pen fire';
 QUERY PLAN 


 Aggregate  (cost=147.84..147.85 rows=1 width=0) (actual time=216.149..216.151 
rows=1 loops=1)
   -  Bitmap Heap Scan on tags  (cost=4.59..147.74 rows=41 width=0) (actual 
time=216.135..216.137 rows=1 loops=1)
 Recheck Cond: (title % 'chestnuts roasting on an 0pen fire'::text)
 -  Bitmap Index Scan on trgm_idx  (cost=0.00..4.58 rows=41 width=0) 
(actual time=216.124..216.124 rows=1 loops=1)
   Index Cond: (title % 'chestnuts roasting on an 0pen fire'::text)
 Total runtime: 216.214 ms
(6 rows)


amarok=# explain analyze select count(*) from tags where title % 'chestnuts 
roasting on an 0pen fire';
 QUERY PLAN 
 
-
 Aggregate  (cost=151.89..151.90 rows=1 width=0) (actual time=156.310..156.312 
rows=1 loops=1)
   -  Bitmap Heap Scan on tags  (cost=8.64..151.79 rows=41 width=0) (actual 
time=156.205..156.299 rows=1 loops=1)
 Filter: (title % 'chestnuts roasting on an 0pen fire'::text)
 -  Bitmap Index Scan on trgm_idx  (cost=0.00..8.63 rows=41 width=0) 
(actual time=155.748..155.748 rows=36 loops=1)
   Index Cond: (title % 'chestnuts roasting on an 0pen fire'::text)
 Total runtime: 156.376 ms
(6 rows)


GIST version, long:

amarok=# explain analyze select count(*) from tags where title % 
'Donaueschingen (Peter Kruders 
Donaudampfschifffahrtsgesellschaftskapitänskajütenremix)'; 
;
  QUERY PLAN
  
--
 Aggregate  (cost=147.84..147.85 rows=1 width=0) (actual time=597.115..597.117 
rows=1 loops=1)
   -  Bitmap Heap Scan on tags  (cost=4.59..147.74 rows=41 width=0) (actual 
time=597.102..597.104 rows=1 loops=1)
 Recheck Cond: (title % 'Donaueschingen (Peter Kruders 
Donaudampfschifffahrtsgesellschaftskapitänskajütenremix)'::text)
 -  Bitmap Index Scan on trgm_idx  (cost=0.00..4.58 rows=41 width=0) 
(actual time=597.093..597.093 rows=1 loops=1)
   Index Cond: (title % 'Donaueschingen (Peter Kruders 
Donaudampfschifffahrtsgesellschaftskapitänskajütenremix)'::text)
 Total runtime: 597.173 ms
(6 rows)


GiN version, long:

amarok=# explain analyze select count(*) from tags where title % 
'Donaueschingen (Peter Kruders 
Donaudampfschifffahrtsgesellschaftskapitänskajütenremix)'; 
;
  QUERY PLAN
  

Re: [PERFORM] pg_trgm performance

2007-02-23 Thread Guillaume Smet

Florian, Steinar,

Could you try to see if the GIN implementation of pg_trgm is faster in
your cases?

Florian, instead of using WHERE similarity(...)  0.4, you should use
set_limit (SELECT set_limit(0.4);).

I posted it on -patches and it is available here:
http://people.openwide.fr/~gsmet/postgresql/pg_trgm_gin3.diff .

The patch is against HEAD but It applies on 8.2 without any problem.

After applying this patch and installing pg_trgm.sql (you should
uninstall pg_trgm before compiling using the old uninstall script),
then you can create:
CREATE INDEX idx_table_word ON table USING gin(word gin_trgm_ops);

17 characters is quite long so I'm not sure it will help you because
it usually has to recheck a high number of rows due to the GIN
implementation but I'd like to know if it's faster or slower in this
case.

If your data are not private and you don't have the time to test it, I
can test it here without any problem.

Thanks.

--
Guillaume

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] pg_trgm performance

2007-02-23 Thread Steinar H. Gunderson
On Sat, Feb 24, 2007 at 12:09:41AM +0100, Guillaume Smet wrote:
 Could you try to see if the GIN implementation of pg_trgm is faster in
 your cases?

I'm sorry, I can no longer remember where I needed pg_trgm. Simple testing of
your patch seems to indicate that the GiN version is about 65% _slower_ (18ms
vs. 30ms) for a test data set I found lying around, but I remember that on
the data set I needed it, the GIST version was a lot slower than that (think
3-400ms). The 18 vs. 30ms test is a random Amarok database, on 8.2.3
(Debian).

Sorry I couldn't be of more help.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] pg_trgm performance

2007-02-23 Thread Guillaume Smet

Hi Steinar,

On 2/24/07, Steinar H. Gunderson [EMAIL PROTECTED] wrote:

I'm sorry, I can no longer remember where I needed pg_trgm. Simple testing of
your patch seems to indicate that the GiN version is about 65% _slower_ (18ms
vs. 30ms) for a test data set I found lying around, but I remember that on
the data set I needed it, the GIST version was a lot slower than that (think
3-400ms). The 18 vs. 30ms test is a random Amarok database, on 8.2.3
(Debian).


Could you post EXPLAIN ANALYZE for both queries (after 2 or 3 runs)?
And if you can provide  EXPLAIN ANALYZE for a couple of searches
(short length, medium length and long) in both cases, it could be nice
too.

The GiN version is not selective enough currently compared to GiST. It
generally finds the matching rows faster but it has a slower recheck
cond so it's sometimes interesting (in my case) and sometimes not that
interesting (it seems to be your case).

Thanks.

--
Guillaume

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


[PERFORM] pg_trgm performance

2007-01-15 Thread Florian Weimer
I've got a table with a few million rows, consisting of a single text
column.  The average length is about 17 characters.  For the sake of
an experiment, I put a trigram index on that table.  Unfortunately, %
queries without smallish LIMITs are ridiculously slow (they take
longer than an hour).  A full table scan with a WHERE similarity(...)
= 0.4 clause completes in just a couple of minutes.  The queries
only select a few hundred rows, so an index scan has got a real chance
to be faster than a sequential scan.

Am I missing something?  Or are trigrams just a poor match for my data
set?  Are the individual strings too long, maybe?

(This is with PostgreSQL 8.2.0, BTW.)

-- 
Florian Weimer[EMAIL PROTECTED]
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] pg_trgm performance

2007-01-15 Thread Steinar H. Gunderson
On Mon, Jan 15, 2007 at 11:16:36AM +0100, Florian Weimer wrote:
 Am I missing something?  Or are trigrams just a poor match for my data
 set?  Are the individual strings too long, maybe?

FWIW, I've seen the same results with 8.1.x.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 6: explain analyze is your friend