Re: [PATCHES] First implementation of GIN for pg_trgm

2007-02-22 Thread Teodor Sigaev

 From a previous discussion with Teodor, it would be better to store an
int in the index instead of a text (it takes less space and is
faster). I couldn't find any example so if anyone has an advice to fix
that, it's welcome (mostly how to pack the trigram into an int instead
of a text).


Something like that:
trg = generate_trgm(VARDATA(text), VARSIZE(text) - VARHDRSZ);
nentries = ARRNELEM(trg);
if ( nentries  0 )
{
*entries = palloc(sizeof(Datum)*nentries);
for(i=0;inentries;i++)
{
int tmp=0;
trgm *ptr = GETARR(trg)+i;

CPTRGM(tmp, ptr);
tmp = 8;
entries[i] = Int32GetDatum(tmp);
}
}
Do not forget to change CREATE OPERATOR CLASS accordingly.



The last problem is that similarity calculated in the GIN index is
higher than the one with GIST so I have to set the trgm_limit quite
high to have decent results (a limit of 0.8 instead of 0.3 seems to be
quite good).
AFAICS, it comes from the fact that I couldn't find any way to get the
length of the indexed trigram which is taken into account with GIST so
we're not exactly filtering the results in the same way.
Does anyone have an idea on how to fix this point?


For calculating similarity, you should have three value: length of first word 
(let it be a indexed text) in trigrams, length of second word (query word) and
number of the same trigrams on both words. It's a pity, but during index scan we 
don't know length of indexed text. So, in index scan (consistent function) we 
could not compute exact value of similarity, but we can compute lower limit.
For example, if our query has 50 trigrams and only one of them is a common for 
indexed value and query we can conclude that indexed value can not be similar to 
our query. So, our consistent function should say FALSE when indexed value is 
not similar to query exactly and TRUE in opposite case. Let lquery is a length 
of query and ncommon is a number of common trigrams (look at cnt_sml() 
function), and consistent function should be:

#ifdef DIVUNION
/* original formula is: count/(lvalue+lquery-lcommon), so
   with any lvalue  0 resulting similarity is smaller than
   computed below */
 return ( count/(lquery-lcommon)  limit ) ? TRUE : FALSE;
#else
/* original formula is: count/max(lvalue,lquery) - the same discourse */
return ( count/lquery  limit ) ? TRUE : FALSE;
#endif

Now consistent function doesn't guarantee exact result, so we should mark '%' 
operator in CREATE OPERATOR CLASS with RECHECK option.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] First implementation of GIN for pg_trgm

2007-02-22 Thread Guillaume Smet

On 2/22/07, Teodor Sigaev [EMAIL PROTECTED] wrote:

  From a previous discussion with Teodor, it would be better to store an
 int in the index instead of a text (it takes less space and is
 faster). I couldn't find any example so if anyone has an advice to fix
 that, it's welcome (mostly how to pack the trigram into an int instead
 of a text).

Something like that: [snip]


I worked on that this morning before I got your email. You'll find
attached a second version of the patch using int4 to store the
trigrams (it's not exactly what you gave me but it should work I
think).

I didn't see any improvement in terms of size of the index (14 MB for
642 738 rows in the index in both cases) or speed.
Our dictionary table contains 78367 words and its size is 3 MB.

Did I miss something?

If there's no benefit to use an int4, I propose we keep the text
version because it will be better if we add UTF-8 support someday.


our query. So, our consistent function should say FALSE when indexed value is
not similar to query exactly and TRUE in opposite case. Let lquery is a length
of query and ncommon is a number of common trigrams (look at cnt_sml()
function), and consistent function should be:


Yes, that's what I did in my patch if I understand you correctly but
it's not very selective IMHO.


Now consistent function doesn't guarantee exact result, so we should mark '%'
operator in CREATE OPERATOR CLASS with RECHECK option.


The attached patch adds a RECHECK too. It seems to work correctly but
the RECHECK COND costs a lot of time :/.

At the moment, I have the following results:
GIST:
---
test=# EXPLAIN ANALYZE SELECT word, similarity('aub', word) AS sml
FROM lieu_mots_gist WHERE word % 'aub' ORDER BY sml DESC;
  QUERY PLAN

Sort  (cost=204.40..204.59 rows=78 width=11) (actual
time=98.028..98.132 rows=50 loops=1)
  Sort Key: similarity('aub'::text, word)
  -  Bitmap Heap Scan on lieu_mots_gist  (cost=4.88..201.95 rows=78
width=11) (actual time=97.575..97.861 rows=50 loops=1)
Recheck Cond: (word % 'aub'::text)
-  Bitmap Index Scan on idx_word_gist  (cost=0.00..4.86
rows=78 width=0) (actual time=97.539..97.539 rows=50 loops=1)
  Index Cond: (word % 'aub'::text)
Total runtime: 98.303 ms
(7 rows)

test=# EXPLAIN ANALYZE SELECT word, similarity('auberge', word) AS sml
FROM lieu_mots_gist WHERE word % 'auberge' ORDER BY sml DESC;
   QUERY PLAN
--
Sort  (cost=204.40..204.59 rows=78 width=11) (actual
time=135.128..135.196 rows=41 loops=1)
  Sort Key: similarity('auberge'::text, word)
  -  Bitmap Heap Scan on lieu_mots_gist  (cost=4.88..201.95 rows=78
width=11) (actual time=134.829..135.016 rows=41 loops=1)
Recheck Cond: (word % 'auberge'::text)
-  Bitmap Index Scan on idx_word_gist  (cost=0.00..4.86
rows=78 width=0) (actual time=134.797..134.797 rows=41 loops=1)
  Index Cond: (word % 'auberge'::text)
Total runtime: 135.335 ms
(7 rows)

With GIN:


test=# EXPLAIN ANALYZE SELECT word, similarity('aub', word) AS sml
FROM lieu_mots_gin WHERE word % 'aub' ORDER BY sml DESC;
  QUERY PLAN
-
Sort  (cost=208.45..208.64 rows=78 width=11) (actual
time=60.409..60.513 rows=50 loops=1)
  Sort Key: similarity('aub'::text, word)
  -  Bitmap Heap Scan on lieu_mots_gin  (cost=8.93..205.99 rows=78
width=11) (actual time=10.279..60.228 rows=50 loops=1)
Filter: (word % 'aub'::text)
-  Bitmap Index Scan on idx_word_gin  (cost=0.00..8.91
rows=78 width=0) (actual time=10.069..10.069 rows=6676 loops=1)
  Index Cond: (word % 'aub'::text)
Total runtime: 60.688 ms
(7 rows)

test=# EXPLAIN ANALYZE SELECT word, similarity('auberge', word) AS sml
FROM lieu_mots_gin WHERE word % 'auberge' ORDER BY sml DESC;
  QUERY PLAN

Sort  (cost=208.45..208.64 rows=78 width=11) (actual
time=50.293..50.381 rows=41 loops=1)
  Sort Key: similarity('auberge'::text, word)
  -  Bitmap Heap Scan on lieu_mots_gin  (cost=8.93..205.99 rows=78
width=11) (actual time=21.006..50.122 rows=41 loops=1)
Filter: (word % 'auberge'::text)
-  Bitmap Index Scan on idx_word_gin  (cost=0.00..8.91
rows=78 width=0) (actual time=20.839..20.839 rows=928 loops=1)
  Index Cond: (word % 'auberge'::text)
Total runtime: 50.534 ms
(7 rows)

--
Guillaume

Re: [PATCHES] First implementation of GIN for pg_trgm

2007-02-22 Thread Teodor Sigaev

I didn't see any improvement in terms of size of the index (14 MB for
642 738 rows in the index in both cases) or speed.
Our dictionary table contains 78367 words and its size is 3 MB.

Did I miss something?
Comparing integers is cheaper than strings. Although it hasn't significant 
matter for index scan.



The attached patch adds a RECHECK too. It seems to work correctly but
the RECHECK COND costs a lot of time :/.


:(

How long is average length of strings in table?
--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] First implementation of GIN for pg_trgm

2007-02-22 Thread Guillaume Smet

On 2/22/07, Teodor Sigaev [EMAIL PROTECTED] wrote:

How long is average length of strings in table?


test=# SELECT MIN(length(word)), MAX(length(word)), AVG(length(word))
FROM lieu_mots_gin;
min | max |avg
-+-+
  1 |  38 | 7.4615463141373282
(1 row)

I don't see how to have a more precise similarity without having the
number of entries registered by the indexed value somewhere.

I think it can be interesting for other flavours of GIN usage. Is
there a way to add the number of entries of the considered indexed
item to the consistent prototype without adding too much overhead and
complexity?

--
Guillaume

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] First implementation of GIN for pg_trgm

2007-02-22 Thread Oleg Bartunov

On Thu, 22 Feb 2007, Guillaume Smet wrote:


On 2/22/07, Teodor Sigaev [EMAIL PROTECTED] wrote:

How long is average length of strings in table?


test=# SELECT MIN(length(word)), MAX(length(word)), AVG(length(word))
FROM lieu_mots_gin;
min | max |avg
-+-+
 1 |  38 | 7.4615463141373282
(1 row)

I don't see how to have a more precise similarity without having the
number of entries registered by the indexed value somewhere.

I think it can be interesting for other flavours of GIN usage. Is
there a way to add the number of entries of the considered indexed
item to the consistent prototype without adding too much overhead and
complexity?


You're right, it would be nice.
This is what we need for faster ranking in tsearch2, since currently we should
consult heap to get positional information, which slowdowns search.
We didn't investigate the possibility to keep additional information with
index, but keep in mind, that everything should works without index.

Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] First implementation of GIN for pg_trgm

2007-02-22 Thread Teodor Sigaev



I think it can be interesting for other flavours of GIN usage. Is
there a way to add the number of entries of the considered indexed
item to the consistent prototype without adding too much overhead and
complexity?

We are thinking about adding extra value, but it's still only thinking.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] First implementation of GIN for pg_trgm

2007-02-22 Thread Guillaume Smet

On 2/22/07, Oleg Bartunov oleg@sai.msu.su wrote:

You're right, it would be nice.
This is what we need for faster ranking in tsearch2, since currently we should
consult heap to get positional information, which slowdowns search.
We didn't investigate the possibility to keep additional information with
index, but keep in mind, that everything should works without index.


OK, thanks for your answer. If you do it one day or another, please
take into account the case of pg_trgm too as it will be far faster if
we can access to the number of entries of the indexed value in the
consistent function. As this information is available if you don't use
the index, it won't be a problem, I think.

Is there anything else I should fix/improve in this patch?

It could be nice to test it on other distribution of words and see if
it performs better than gist in other cases too. I'll try to test it
here on another table we need to index with tsearch2.

--
Guillaume

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[PATCHES] First implementation of GIN for pg_trgm

2007-02-21 Thread Guillaume Smet

Hi all,

Here is my preliminary work on porting pg_trgm to GIN. pg_trgm can be
a very good addition to tsearch2 to suggest spellings for mispelled
words as explained in the README.pg_trgm file and I'd like to use it
in this case. GIST implementation is a bit slow so I tried to port it
to use GIN.

The attached patch is the first working implementation. It's not final
but I would like some feedback on how to fix the remaining problems.


From a previous discussion with Teodor, it would be better to store an

int in the index instead of a text (it takes less space and is
faster). I couldn't find any example so if anyone has an advice to fix
that, it's welcome (mostly how to pack the trigram into an int instead
of a text).

The last problem is that similarity calculated in the GIN index is
higher than the one with GIST so I have to set the trgm_limit quite
high to have decent results (a limit of 0.8 instead of 0.3 seems to be
quite good).
AFAICS, it comes from the fact that I couldn't find any way to get the
length of the indexed trigram which is taken into account with GIST so
we're not exactly filtering the results in the same way.
Does anyone have an idea on how to fix this point?

Thanks for your attention.

--
Guillaume
Index: Makefile
===
RCS file: /projects/cvsroot/pgsql/contrib/pg_trgm/Makefile,v
retrieving revision 1.6
diff -c -r1.6 Makefile
*** Makefile9 Feb 2007 17:24:33 -   1.6
--- Makefile21 Feb 2007 23:49:37 -
***
*** 1,7 
  # $PostgreSQL: pgsql/contrib/pg_trgm/Makefile,v 1.6 2007/02/09 17:24:33 
petere Exp $
  
  MODULE_big = pg_trgm
! OBJS = trgm_op.o trgm_gist.o 
  
  DATA_built = pg_trgm.sql
  DATA = uninstall_pg_trgm.sql
--- 1,7 
  # $PostgreSQL: pgsql/contrib/pg_trgm/Makefile,v 1.6 2007/02/09 17:24:33 
petere Exp $
  
  MODULE_big = pg_trgm
! OBJS = trgm_op.o trgm_gist.o trgm_gin.o
  
  DATA_built = pg_trgm.sql
  DATA = uninstall_pg_trgm.sql
Index: uninstall_pg_trgm.sql
===
RCS file: /projects/cvsroot/pgsql/contrib/pg_trgm/uninstall_pg_trgm.sql,v
retrieving revision 1.2
diff -c -r1.2 uninstall_pg_trgm.sql
*** uninstall_pg_trgm.sql   13 Mar 2006 18:04:58 -  1.2
--- uninstall_pg_trgm.sql   21 Feb 2007 23:49:37 -
***
*** 20,25 
--- 20,33 
  
  DROP TYPE gtrgm CASCADE;
  
+ DROP OPERATOR CLASS gin_trgm_ops USING gin;
+ 
+ DROP FUNCTION gin_extract_trgm(text, internal);
+ 
+ DROP FUNCTION gin_extract_trgm(text, internal, internal);
+ 
+ DROP FUNCTION gin_trgm_consistent(internal, internal, text);
+ 
  DROP OPERATOR % (text, text);
  
  DROP FUNCTION similarity_op(text,text);
Index: pg_trgm.sql.in
===
RCS file: /projects/cvsroot/pgsql/contrib/pg_trgm/pg_trgm.sql.in,v
retrieving revision 1.2
diff -c -r1.2 pg_trgm.sql.in
*** pg_trgm.sql.in  27 Feb 2006 16:09:48 -  1.2
--- pg_trgm.sql.in  21 Feb 2007 23:49:37 -
***
*** 36,42 
  JOIN = contjoinsel
  );
  
! --gist key
  CREATE FUNCTION gtrgm_in(cstring)
  RETURNS gtrgm
  AS 'MODULE_PATHNAME'
--- 36,42 
  JOIN = contjoinsel
  );
  
! -- gist key
  CREATE FUNCTION gtrgm_in(cstring)
  RETURNS gtrgm
  AS 'MODULE_PATHNAME'
***
*** 53,59 
  OUTPUT = gtrgm_out
  );
  
! -- support functions
  CREATE FUNCTION gtrgm_consistent(gtrgm,internal,int4)
  RETURNS bool
  AS 'MODULE_PATHNAME'
--- 53,59 
  OUTPUT = gtrgm_out
  );
  
! -- support functions for gist
  CREATE FUNCTION gtrgm_consistent(gtrgm,internal,int4)
  RETURNS bool
  AS 'MODULE_PATHNAME'
***
*** 89,95 
  AS 'MODULE_PATHNAME'
  LANGUAGE C;
  
! -- create the operator class
  CREATE OPERATOR CLASS gist_trgm_ops
  FOR TYPE text USING gist
  AS
--- 89,95 
  AS 'MODULE_PATHNAME'
  LANGUAGE C;
  
! -- create the operator class for gist
  CREATE OPERATOR CLASS gist_trgm_ops
  FOR TYPE text USING gist
  AS
***
*** 103,107 
--- 103,133 
  FUNCTION7   gtrgm_same (gtrgm, gtrgm, internal),
  STORAGE gtrgm;
  
+ -- support functions for gin
+ CREATE FUNCTION gin_extract_trgm(text, internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C;
+ 
+ CREATE FUNCTION gin_extract_trgm(text, internal, internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C;
+ 
+ CREATE FUNCTION gin_trgm_consistent(internal, internal, text)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C;
+ 
+ -- create the operator class for gin
+ CREATE OPERATOR CLASS gin_trgm_ops
+ FOR TYPE text USING gin
+ AS
+ OPERATOR1   % (text, text),
+ FUNCTION1   bttextcmp (text, text),
+ FUNCTION2   gin_extract_trgm (text, internal),
+ FUNCTION3   gin_extract_trgm (text, internal, internal),
+ FUNCTION4