Re: [HACKERS] wildcard search support for pg_trgm

2011-02-01 Thread Alexander Korotkov
On Tue, Feb 1, 2011 at 5:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 AFAICT that would break on-disk compatibility of pg_trgm GIST indexes.
 I don't believe we have adequate evidence to justify doing that, and
 in any case it ought to be a separate patch rather than buried inside a
 mostly unrelated feature patch.

Ok. Actually, I don't think just increasement of SIGLENINT as a solution. I
beleive that we need to have it as index parameter. I'll try to provide more
of tests in order to motivate this.


With best regards,
Alexander Korotkov.


Re: [HACKERS] wildcard search support for pg_trgm

2011-01-31 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 OK, now it works flawlessly as far as I can tell. Will mark it as Ready
 for Committer.

Applied with mostly-stylistic corrections, plus addition of
documentation and a minimal regression test.

I did *not* apply this bit:

 2) I found gist index not very useful with default SIGLENINT = 3. I've
 changed this value to 15 and I found gist index performs very good on
 dictionary. But on longer strings greater values of SIGLENINT may be
 required (probably even SIGLENINT  122 will give benefit in some cases in
 spite of TOAST).

AFAICT that would break on-disk compatibility of pg_trgm GIST indexes.
I don't believe we have adequate evidence to justify doing that, and
in any case it ought to be a separate patch rather than buried inside a
mostly unrelated feature patch.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] wildcard search support for pg_trgm

2011-01-30 Thread Jan Urbański
On 29/01/11 13:07, Alexander Korotkov wrote:
 Hello!

Hi!

 
 New version of patch is in the attachment. Some comments was added in
 this version. Likely these comments need significant correction because
 of my english.

Ooh, ok, the comments now helped me understand what's exactly going on
in there.

I played with it a bit more and the idea of using trigrams to do LIKE
searches is quite clever.

Unfortunately, I think there's a problem with case insensitive queries:

create table test(t text);
insert into test values ('abcdef');
create index trgm_idx_gin on test using gin (t gin_trgm_ops);
set enable_seqscan to off; -- force index usage
select * from test where t ilike '%BCD%';
-- no results!
set enable_seqscan to on; -- do not use the index
select * from test where t ilike '%BCD%';
-- the row is returned

I saw that the code tries to handle ILIKE searches, but apparently it's
failing somewhere.

I'm sorry but I'm leaving on vacation for the next week and won't be
able to continue reviewing your patch, I'll unset myself as its
reviewer, and in the meantime I hope someone else will pick it up, as
the functionality seems very interesting.

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] wildcard search support for pg_trgm

2011-01-30 Thread Alexander Korotkov
Hi!

On Mon, Jan 31, 2011 at 12:52 AM, Jan Urbański wulc...@wulczer.org wrote:

 I saw that the code tries to handle ILIKE searches, but apparently it's
 failing somewhere.

It was just a typo. Corrected version attached.


With best regards,
Alexander Korotkov.
*** a/contrib/pg_trgm/pg_trgm.sql.in
--- b/contrib/pg_trgm/pg_trgm.sql.in
***
*** 113,118  FOR TYPE text USING gist
--- 113,120 
  AS
  OPERATOR1   % (text, text),
  OPERATOR2   - (text, text) FOR ORDER BY pg_catalog.float_ops,
+ OPERATOR3   ~~ (text, text),
+ OPERATOR4   ~~* (text, text),
  FUNCTION1   gtrgm_consistent (internal, text, int, oid, internal),
  FUNCTION2   gtrgm_union (bytea, internal),
  FUNCTION3   gtrgm_compress (internal),
***
*** 129,140  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_extract_trgm(text, internal, int2, internal, internal)
  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal)
  RETURNS bool
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
--- 131,142 
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_extract_query_trgm(text, internal, int2, internal, internal, internal, internal)
  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal, internal, internal)
  RETURNS bool
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
***
*** 144,151  CREATE OPERATOR CLASS gin_trgm_ops
  FOR TYPE text USING gin
  AS
  OPERATOR1   % (text, text),
  FUNCTION1   btint4cmp (int4, int4),
  FUNCTION2   gin_extract_trgm (text, internal),
! FUNCTION3   gin_extract_trgm (text, internal, int2, internal, internal),
! FUNCTION4   gin_trgm_consistent (internal, int2, text, int4, internal, internal),
  STORAGE int4;
--- 146,155 
  FOR TYPE text USING gin
  AS
  OPERATOR1   % (text, text),
+ OPERATOR3   ~~ (text, text),
+ OPERATOR4   ~~* (text, text),
  FUNCTION1   btint4cmp (int4, int4),
  FUNCTION2   gin_extract_trgm (text, internal),
! FUNCTION3   gin_extract_query_trgm (text, internal, int2, internal, internal, internal, internal),
! FUNCTION4   gin_trgm_consistent (internal, int2, text, int4, internal, internal, internal, internal),
  STORAGE int4;
*** a/contrib/pg_trgm/trgm.h
--- b/contrib/pg_trgm/trgm.h
***
*** 13,24 
--- 13,32 
  #define LPADDING		2
  #define RPADDING		1
  #define KEEPONLYALNUM
+ /*
+  * IGNORECASE macro means that trigrams is case-insensetive. If this macro is
+  * disabled, then ~~* operator should be excluded from operator class, because
+  * we can't handle case-insensetive wildcard search with case-sensetive
+  * trigrams.
+  */
  #define IGNORECASE
  #define DIVUNION
  
  /* operator strategy numbers */
  #define	SimilarityStrategyNumber	1
  #define	DistanceStrategyNumber		2
+ #define LikeStrategyNumber			3
+ #define ILikeStrategyNumber			4
  
  
  typedef char trgm[3];
***
*** 53,59  typedef struct
  
  /* gist */
  #define BITBYTE 8
! #define SIGLENINT  3			/* 122 = key will toast, so very slow!!! */
  #define SIGLEN	( sizeof(int)*SIGLENINT )
  
  #define SIGLENBIT (SIGLEN*BITBYTE - 1)	/* see makesign */
--- 61,67 
  
  /* gist */
  #define BITBYTE 8
! #define SIGLENINT  15			/* 122 = key will toast, so very slow!!! */
  #define SIGLEN	( sizeof(int)*SIGLENINT )
  
  #define SIGLENBIT (SIGLEN*BITBYTE - 1)	/* see makesign */
***
*** 89,94  typedef char *BITVECP;
--- 97,107 
  extern float4 trgm_limit;
  
  TRGM	   *generate_trgm(char *str, int slen);
+ TRGM	   *generate_wildcard_trgm(char *str, int slen);
  float4		cnt_sml(TRGM *trg1, TRGM *trg2);
+ booltrgm_contain(TRGM *trg1, TRGM *trg2);
+ 
+ #define ISESCAPECHAR(x) (*(x) == '\\') /* Wildcard escape character */
+ #define ISWILDCARDCHAR(x) (*(x) == '_' || *(x) == '%')  /* Wildcard meta-character */
  
  #endif /* __TRGM_H__ */
*** a/contrib/pg_trgm/trgm_gin.c
--- b/contrib/pg_trgm/trgm_gin.c
***
*** 6,11 
--- 6,12 
  #include trgm.h
  
  #include access/gin.h
+ #include access/skey.h
  #include access/itup.h
  #include access/tuptoaster.h
  #include storage/bufpage.h
***
*** 16,21 
--- 17,25 
  PG_FUNCTION_INFO_V1(gin_extract_trgm);
  Datum		gin_extract_trgm(PG_FUNCTION_ARGS);
  
+ PG_FUNCTION_INFO_V1(gin_extract_query_trgm);
+ Datum		

Re: [HACKERS] wildcard search support for pg_trgm

2011-01-30 Thread Jan Urbański
On 30/01/11 23:02, Alexander Korotkov wrote:
 Hi!
 
 On Mon, Jan 31, 2011 at 12:52 AM, Jan Urbański wulc...@wulczer.org
 mailto:wulc...@wulczer.org wrote:
 
 I saw that the code tries to handle ILIKE searches, but apparently it's
 failing somewhere.
 
 It was just a typo. Corrected version attached.

I hoped as much :)

Will test again now.

Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] wildcard search support for pg_trgm

2011-01-30 Thread Jan Urbański
On 30/01/11 23:04, Jan Urbański wrote:
 On 30/01/11 23:02, Alexander Korotkov wrote:
 Hi!

 On Mon, Jan 31, 2011 at 12:52 AM, Jan Urbański wulc...@wulczer.org
 mailto:wulc...@wulczer.org wrote:

 I saw that the code tries to handle ILIKE searches, but apparently it's
 failing somewhere.

 It was just a typo. Corrected version attached.
 
 I hoped as much :)
 
 Will test again now.

OK, now it works flawlessly as far as I can tell. Will mark it as Ready
for Committer.

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] wildcard search support for pg_trgm

2011-01-29 Thread Alexander Korotkov
Hello!

New version of patch is in the attachment. Some comments was added in this
version. Likely these comments need significant correction because of my
english.
Some notes abount gin interface functions. Extract query and extract value
functions was separated, because with wildcard search these funtions no
longer do the same. New arguments was added to sql description of gin
interface functions in order to make it confom to new gin interface. See
docs of development version:
http://developer.postgresql.org/pgdocs/postgres/gin-extensibility.html.


With best regards,
Alexander Korotkov.
*** a/contrib/pg_trgm/pg_trgm.sql.in
--- b/contrib/pg_trgm/pg_trgm.sql.in
***
*** 113,118  FOR TYPE text USING gist
--- 113,120 
  AS
  OPERATOR1   % (text, text),
  OPERATOR2   - (text, text) FOR ORDER BY pg_catalog.float_ops,
+ OPERATOR3   ~~ (text, text),
+ OPERATOR4   ~~* (text, text),
  FUNCTION1   gtrgm_consistent (internal, text, int, oid, internal),
  FUNCTION2   gtrgm_union (bytea, internal),
  FUNCTION3   gtrgm_compress (internal),
***
*** 129,140  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_extract_trgm(text, internal, int2, internal, internal)
  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal)
  RETURNS bool
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
--- 131,142 
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_extract_query_trgm(text, internal, int2, internal, internal, internal, internal)
  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal, internal, internal)
  RETURNS bool
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
***
*** 144,151  CREATE OPERATOR CLASS gin_trgm_ops
  FOR TYPE text USING gin
  AS
  OPERATOR1   % (text, text),
  FUNCTION1   btint4cmp (int4, int4),
  FUNCTION2   gin_extract_trgm (text, internal),
! FUNCTION3   gin_extract_trgm (text, internal, int2, internal, internal),
! FUNCTION4   gin_trgm_consistent (internal, int2, text, int4, internal, internal),
  STORAGE int4;
--- 146,155 
  FOR TYPE text USING gin
  AS
  OPERATOR1   % (text, text),
+ OPERATOR3   ~~ (text, text),
+ OPERATOR4   ~~* (text, text),
  FUNCTION1   btint4cmp (int4, int4),
  FUNCTION2   gin_extract_trgm (text, internal),
! FUNCTION3   gin_extract_query_trgm (text, internal, int2, internal, internal, internal, internal),
! FUNCTION4   gin_trgm_consistent (internal, int2, text, int4, internal, internal, internal, internal),
  STORAGE int4;
*** a/contrib/pg_trgm/trgm.h
--- b/contrib/pg_trgm/trgm.h
***
*** 13,24 
--- 13,32 
  #define LPADDING		2
  #define RPADDING		1
  #define KEEPONLYALNUM
+ /*
+  * IGNORECASE macro means that trigrams is case-insensetive. If this macro is
+  * disabled, then ~~* operator should be excluded from operator class, because
+  * we can't handle case-insensetive wildcard search with case-sensetive
+  * trigrams.
+  */
  #define IGNORECASE
  #define DIVUNION
  
  /* operator strategy numbers */
  #define	SimilarityStrategyNumber	1
  #define	DistanceStrategyNumber		2
+ #define LikeStrategyNumber			3
+ #define ILikeStrategyNumber			4
  
  
  typedef char trgm[3];
***
*** 53,59  typedef struct
  
  /* gist */
  #define BITBYTE 8
! #define SIGLENINT  3			/* 122 = key will toast, so very slow!!! */
  #define SIGLEN	( sizeof(int)*SIGLENINT )
  
  #define SIGLENBIT (SIGLEN*BITBYTE - 1)	/* see makesign */
--- 61,67 
  
  /* gist */
  #define BITBYTE 8
! #define SIGLENINT  15			/* 122 = key will toast, so very slow!!! */
  #define SIGLEN	( sizeof(int)*SIGLENINT )
  
  #define SIGLENBIT (SIGLEN*BITBYTE - 1)	/* see makesign */
***
*** 89,94  typedef char *BITVECP;
--- 97,107 
  extern float4 trgm_limit;
  
  TRGM	   *generate_trgm(char *str, int slen);
+ TRGM	   *generate_wildcard_trgm(char *str, int slen);
  float4		cnt_sml(TRGM *trg1, TRGM *trg2);
+ booltrgm_contain(TRGM *trg1, TRGM *trg2);
+ 
+ #define ISESCAPECHAR(x) (*(x) == '\\') /* Wildcard escape character */
+ #define ISWILDCARDCHAR(x) (*(x) == '_' || *(x) == '%')  /* Wildcard meta-character */
  
  #endif /* __TRGM_H__ */
*** a/contrib/pg_trgm/trgm_gin.c
--- b/contrib/pg_trgm/trgm_gin.c
***
*** 6,11 
--- 6,12 
  #include trgm.h
  
  

Re: [HACKERS] wildcard search support for pg_trgm

2011-01-24 Thread Alexander Korotkov
Hi!

On Mon, Jan 24, 2011 at 3:07 AM, Jan Urbański wulc...@wulczer.org wrote:

 I see two issues with this patch. First of them is the resulting index
 size. I created a table with 5 copies of
 /usr/share/dict/american-english in it and a gin index on it, using
 gin_trgm_ops. The results were:

  * relation size: 18MB
  * index size: 109 MB

 while without the patch the GIN index was 43 MB. I'm not really sure
 *why* this happens, as it's not obvious from reading the patch what
 exactly is this extra data that gets stored in the index, making it more
 than double its size.

Do you sure that you did comparison correctly? The sequence of index
building and data insertion does matter. I tried to build gin index on  5
copies of /usr/share/dict/american-english with patch and got 43 MB index
size.


 That leads me to the second issue. The pg_trgm code is already woefully
 uncommented, and after spending quite some time reading it back and
 forth I have to admit that I don't really understand what the code does
 in the first place, and so I don't understand what does that patch
 change. I read all the changes in detail and I could't find any obvious
 mistakes like reading over array boundaries or dereferencing
 uninitialized pointers, but I can't tell if the patch is correct
 semantically. All test cases I threw at it work, though.

I'll try to write sufficient comment and send new revision of patch.


 This patch changes the names and signatures of some support functions
 for GIN, and I'm not sure how that affects binary compatibility and
 pg_upgrade. I tried to create an index with the vanilla source, and then
 recompile pg_trgm and reindex the table, but it still was not using the
 index. I think it's because it's missing entries in the catalogs about
 the index supporting the like strategy. How should this be handled?

This patch don't alters structure of index. It only adds strategies for
index scan. In order update this index one should recreate operator class
(it will require to drop index). It can be done by
sequential uninstall_pg_trgm.sql and pg_trgm.sql. After that new index can
be created and it will support like strategy. Although actually there is no
need of index recreation, I don't see easier way to do this.


With best regards,
Alexander Korotkov.


Re: [HACKERS] wildcard search support for pg_trgm

2011-01-24 Thread Jesper Krogh

On 2011-01-24 16:34, Alexander Korotkov wrote:

Hi!

On Mon, Jan 24, 2011 at 3:07 AM, Jan Urbańskiwulc...@wulczer.org  wrote:


I see two issues with this patch. First of them is the resulting index
size. I created a table with 5 copies of
/usr/share/dict/american-english in it and a gin index on it, using
gin_trgm_ops. The results were:

  * relation size: 18MB
  * index size: 109 MB

while without the patch the GIN index was 43 MB. I'm not really sure
*why* this happens, as it's not obvious from reading the patch what
exactly is this extra data that gets stored in the index, making it more
than double its size.


Do you sure that you did comparison correctly? The sequence of index
building and data insertion does matter. I tried to build gin index on  5
copies of /usr/share/dict/american-english with patch and got 43 MB index
size.



That leads me to the second issue. The pg_trgm code is already woefully
uncommented, and after spending quite some time reading it back and
forth I have to admit that I don't really understand what the code does
in the first place, and so I don't understand what does that patch
change. I read all the changes in detail and I could't find any obvious
mistakes like reading over array boundaries or dereferencing
uninitialized pointers, but I can't tell if the patch is correct
semantically. All test cases I threw at it work, though.


I'll try to write sufficient comment and send new revision of patch.


Would it be hard to make it support n-grams (e.g. making the length
configurable) instead of trigrams? I actually had the feeling that
penta-grams (pen-tuples or whatever they would be called) would
be better for my usecase (large substring-search in large documents ..
eg. 500 within 3.000.

Larger sizes.. lesser sensitivity = Faster lookup .. perhaps my logic 
is wrong?


Hm.. or will the knngist stuff help me here by selecting the best using
pentuples from the beginning?

The above comment is actually general to pg_trgm and not to the wildcard 
search

patch above.

Jesper
--
Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] wildcard search support for pg_trgm

2011-01-24 Thread Tom Lane
Jesper Krogh jes...@krogh.cc writes:
 Would it be hard to make it support n-grams (e.g. making the length
 configurable) instead of trigrams?

That would be a complete rewrite with an incompatible on-disk index
representation, which seems a bit beyond the scope of this patch.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] wildcard search support for pg_trgm

2011-01-23 Thread Jan Urbański
Hi,

I tested the patch from
http://archives.postgresql.org/message-id/aanlktikvxx6_ajzb52ona7mbzijcpdqszomcd-3u1...@mail.gmail.com
which adds GIN and GIST index support for wildcard LIKE queries using
pg_trgm.

The patch is a context diff that applies cleanly. Regression test work
after applying it, but they only exercise the similarity() function, so
the new functionality is not covered by them.

The patch seems to work as advised, I tried a few searches and it does
indeed use the gin or gist index to implement '%foo%' searches. I tried
to do some tricky queries and it worked for all of them..

I see two issues with this patch. First of them is the resulting index
size. I created a table with 5 copies of
/usr/share/dict/american-english in it and a gin index on it, using
gin_trgm_ops. The results were:

 * relation size: 18MB
 * index size: 109 MB

while without the patch the GIN index was 43 MB. I'm not really sure
*why* this happens, as it's not obvious from reading the patch what
exactly is this extra data that gets stored in the index, making it more
than double its size.

That leads me to the second issue. The pg_trgm code is already woefully
uncommented, and after spending quite some time reading it back and
forth I have to admit that I don't really understand what the code does
in the first place, and so I don't understand what does that patch
change. I read all the changes in detail and I could't find any obvious
mistakes like reading over array boundaries or dereferencing
uninitialized pointers, but I can't tell if the patch is correct
semantically. All test cases I threw at it work, though.

I'm not sure if the committer with better knowledge of pg_trgm would be
able to do a better job than me. After a few days digging in that code I
simply give up.

This patch changes the names and signatures of some support functions
for GIN, and I'm not sure how that affects binary compatibility and
pg_upgrade. I tried to create an index with the vanilla source, and then
recompile pg_trgm and reindex the table, but it still was not using the
index. I think it's because it's missing entries in the catalogs about
the index supporting the like strategy. How should this be handled?

I'm going to mark the patch as Waiting on Author, because of the index
size issue (though it might be OK and expected that the index size will
grow so much, I just don't know). As for the comments, or lack of them,
I declary myself incompetent to thoroughly verify that the patch works.
I think it should have at least the added parts commented enough to
match the project's standard.

Sorry for taking so long to review this,

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wildcard search support for pg_trgm

2011-01-17 Thread Alexander Korotkov
Hi,

Here is updated version of this patch.


With best regards,
Alexander Korotkov.
*** a/contrib/pg_trgm/pg_trgm.sql.in
--- b/contrib/pg_trgm/pg_trgm.sql.in
***
*** 113,118  FOR TYPE text USING gist
--- 113,120 
  AS
  OPERATOR1   % (text, text),
  OPERATOR2   - (text, text) FOR ORDER BY pg_catalog.float_ops,
+ OPERATOR3   ~~ (text, text),
+ OPERATOR4   ~~* (text, text),
  FUNCTION1   gtrgm_consistent (internal, text, int, oid, internal),
  FUNCTION2   gtrgm_union (bytea, internal),
  FUNCTION3   gtrgm_compress (internal),
***
*** 129,140  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_extract_trgm(text, internal, int2, internal, internal)
  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal)
  RETURNS bool
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
--- 131,142 
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_extract_query_trgm(text, internal, int2, internal, internal, internal, internal)
  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal, internal, internal)
  RETURNS bool
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
***
*** 144,151  CREATE OPERATOR CLASS gin_trgm_ops
  FOR TYPE text USING gin
  AS
  OPERATOR1   % (text, text),
  FUNCTION1   btint4cmp (int4, int4),
  FUNCTION2   gin_extract_trgm (text, internal),
! FUNCTION3   gin_extract_trgm (text, internal, int2, internal, internal),
! FUNCTION4   gin_trgm_consistent (internal, int2, text, int4, internal, internal),
  STORAGE int4;
--- 146,155 
  FOR TYPE text USING gin
  AS
  OPERATOR1   % (text, text),
+ OPERATOR3   ~~ (text, text),
+ OPERATOR4   ~~* (text, text),
  FUNCTION1   btint4cmp (int4, int4),
  FUNCTION2   gin_extract_trgm (text, internal),
! FUNCTION3   gin_extract_query_trgm (text, internal, int2, internal, internal, internal, internal),
! FUNCTION4   gin_trgm_consistent (internal, int2, text, int4, internal, internal, internal, internal),
  STORAGE int4;
*** a/contrib/pg_trgm/trgm.h
--- b/contrib/pg_trgm/trgm.h
***
*** 19,24 
--- 19,26 
  /* operator strategy numbers */
  #define	SimilarityStrategyNumber	1
  #define	DistanceStrategyNumber		2
+ #define LikeStrategyNumber			3
+ #define ILikeStrategyNumber			4
  
  
  typedef char trgm[3];
***
*** 53,59  typedef struct
  
  /* gist */
  #define BITBYTE 8
! #define SIGLENINT  3			/* 122 = key will toast, so very slow!!! */
  #define SIGLEN	( sizeof(int)*SIGLENINT )
  
  #define SIGLENBIT (SIGLEN*BITBYTE - 1)	/* see makesign */
--- 55,61 
  
  /* gist */
  #define BITBYTE 8
! #define SIGLENINT  15			/* 122 = key will toast, so very slow!!! */
  #define SIGLEN	( sizeof(int)*SIGLENINT )
  
  #define SIGLENBIT (SIGLEN*BITBYTE - 1)	/* see makesign */
***
*** 89,94  typedef char *BITVECP;
--- 91,101 
  extern float4 trgm_limit;
  
  TRGM	   *generate_trgm(char *str, int slen);
+ TRGM	   *generate_wildcard_trgm(char *str, int slen);
  float4		cnt_sml(TRGM *trg1, TRGM *trg2);
+ booltrgm_contain(TRGM *trg1, TRGM *trg2);
+ 
+ #define ISESCAPECHAR(x) (*(x) == '\\')
+ #define ISWILDCARDCHAR(x) (*(x) == '_' || *(x) == '%')
  
  #endif /* __TRGM_H__ */
*** a/contrib/pg_trgm/trgm_gin.c
--- b/contrib/pg_trgm/trgm_gin.c
***
*** 6,11 
--- 6,12 
  #include trgm.h
  
  #include access/gin.h
+ #include access/skey.h
  #include access/itup.h
  #include access/tuptoaster.h
  #include storage/bufpage.h
***
*** 16,21 
--- 17,25 
  PG_FUNCTION_INFO_V1(gin_extract_trgm);
  Datum		gin_extract_trgm(PG_FUNCTION_ARGS);
  
+ PG_FUNCTION_INFO_V1(gin_extract_query_trgm);
+ Datum		gin_extract_query_trgm(PG_FUNCTION_ARGS);
+ 
  PG_FUNCTION_INFO_V1(gin_trgm_consistent);
  Datum		gin_trgm_consistent(PG_FUNCTION_ARGS);
  
***
*** 58,90  gin_extract_trgm(PG_FUNCTION_ARGS)
  }
  
  Datum
  gin_trgm_consistent(PG_FUNCTION_ARGS)
  {
  	bool	   *check = (bool *) PG_GETARG_POINTER(0);
! 	/* StrategyNumber strategy = PG_GETARG_UINT16(1); */
  	/* text*query = PG_GETARG_TEXT_P(2); */
! 	int32		nkeys = PG_GETARG_INT32(3);
! 	/* Pointer*extra_data = (Pointer *) PG_GETARG_POINTER(4); */
  	bool	   *recheck = (bool *) PG_GETARG_POINTER(5);
  	bool		res = FALSE;
  	int32		i,
! 

Re: [HACKERS] Wildcard search support for pg_trgm

2011-01-14 Thread Jan Urbański
On 08/01/11 23:37, Alexander Korotkov wrote:
 I updated my patch to make it use full index scan in GIN index which is
 possible thanks to recent Tom Lane patch. Now wildcard, where no trigram can
 be extracted from, invokes full index scan, which is slow but correct.

Hi,

unfortunately, this change made the patch not apply:
http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=be0c3ea2d30ba225f0249ae88d6b0bdf3b753162

I'm getting rejects in trgm_gin.c. Could you update the patch please?

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wildcard search support for pg_trgm

2011-01-08 Thread Alexander Korotkov
I updated my patch to make it use full index scan in GIN index which is
possible thanks to recent Tom Lane patch. Now wildcard, where no trigram can
be extracted from, invokes full index scan, which is slow but correct.

test=# explain (analyze, buffers) select * from words where word ilike
'%in%';
 QUERY PLAN



 Seq Scan on words  (cost=0.00..1703.11 rows=15930 width=9) (actual
time=0.333..225.817 rows=16558 loops=1)
   Filter: (word ~~* '%in%'::text)
   Buffers: shared read=471
 Total runtime: 248.207 ms
(4 rows)

test=# set enable_seqscan = off;
SET
test=# explain (analyze, buffers) select * from words where word ilike
'%in%';
   QUERY PLAN



 Bitmap Heap Scan on words  (cost=2287.46..2957.59 rows=15930 width=9)
(actual time=122.239..331.993
 rows=16558 loops=1)
   Recheck Cond: (word ~~* '%in%'::text)
   Buffers: shared hit=472 read=1185
   -  Bitmap Index Scan on trgm_idx  (cost=0.00..2283.48 rows=15930
width=0) (actual time=122.022..122.022 rows=98569 loops=1)
 Index Cond: (word ~~* '%in%'::text)
 Buffers: shared hit=1 read=1185
 Total runtime: 354.409 ms
(7 rows)

As an alternative solution I can propose to extract null item from every
string and ivoke scan on that item instead of full index scan. It requires
to store additional item per each string but it makes full scan fast.

Also I found a segfault when execute the query above and
switch enable_seqscan few times on line *searchMode =
GIN_SEARCH_MODE_ALL;. Is it a bug in GIN or I'm missing something?

Here goes backtrace from gdb:
#0  0xb4ead070 in gin_extract_query_trgm (fcinfo=0xbfcd8da8) at
trgm_gin.c:112
#1  0x08323a84 in OidFunctionCall5 (functionId=32802, arg1=161269768,
arg2=3217920208, arg3=4,
arg4=3217920204, arg5=3217920200) at fmgr.c:1687
#2  0x082c5654 in gincostestimate (fcinfo=0xbfcd9148) at selfuncs.c:6466
#3  0x083235d8 in OidFunctionCall9 (functionId=2741, arg1=161270176,
arg2=161271296,
arg3=161824624, arg4=0, arg5=0, arg6=3217921064, arg7=3217921056,
arg8=3217921048,
arg9=3217921040) at fmgr.c:1840
#4  0x081f3397 in cost_index (path=0x9a55050, root=0x99cc9a0,
index=0x99cce00,
indexQuals=0x9a53f70, indexOrderBys=0x0, outer_rel=0x0) at
costsize.c:268
#5  0x08216b66 in create_index_path (root=0x99cc9a0, index=0x99cce00,
clause_groups=0x9a53f88,
indexorderbys=0x0, pathkeys=0x0, indexscandir=NoMovementScanDirection,
outer_rel=0x0)
at pathnode.c:511
#6  0x081f7ef5 in find_usable_indexes (root=value optimized out,
rel=value optimized out,
clauses=value optimized out, outer_clauses=0x0, istoplevel=1 '\001',
outer_rel=0x0,
saop_control=SAOP_FORBID, scantype=ST_ANYSCAN) at indxpath.c:422
#7  0x081f8e38 in create_index_paths (root=0x99cc9a0, rel=0x99ccc30) at
indxpath.c:176
#8  0x081eec22 in set_plain_rel_pathlist (root=value optimized out,
rel=value optimized out,
rti=value optimized out, rte=0x99cc650) at allpaths.c:262
#9  set_rel_pathlist (root=value optimized out, rel=value optimized
out,
rti=value optimized out, rte=0x99cc650) at allpaths.c:202
#10 0x081efa55 in set_base_rel_pathlists (root=0x99cc9a0,
joinlist=0x99ccde8) at allpaths.c:158
#11 make_one_rel (root=0x99cc9a0, joinlist=0x99ccde8) at allpaths.c:94
#12 0x08203ef7 in query_planner (root=0x99cc9a0, tlist=0x99ccb00,
tuple_fraction=0,
limit_tuples=-1, cheapest_path=0xbfcd98cc, sorted_path=0xbfcd98c8,
num_groups=0xbfcd98c0)
at planmain.c:271
#13 0x08205b86 in grouping_planner (root=0x99cc9a0, tuple_fraction=0) at
planner.c:1182
#14 0x08207609 in subquery_planner (glob=0x99cc910, parse=0x99cc5a0,
parent_root=0x0,
hasRecursion=0 '\000', tuple_fraction=0, subroot=0xbfcd9a7c) at
planner.c:536
#15 0x08207ca6 in standard_planner (parse=0x99cc5a0, cursorOptions=0,
boundParams=0x0)
at planner.c:201
#16 0x0825db11 in pg_plan_query (querytree=0x99cc5a0, cursorOptions=0,
boundParams=0x0)
at postgres.c:764
#17 0x0815a824 in ExplainOneQuery (stmt=0x9a258e0,
queryString=0x9a24c60 explain (analyze, buffers) select * from words
where word ilike '%in%';,---Type return to continue, or q return to
quit---
 params=0x0, dest=0x9a32330) at explain.c:300
#18 ExplainQuery (stmt=0x9a258e0,
queryString=0x9a24c60 explain (analyze, buffers) select * from words
where word ilike '%in%';, params=0x0, dest=0x9a32330) at explain.c:209
#19 0x08261266 in PortalRunUtility (portal=0x9a4d6a8,
utilityStmt=0x9a258e0,
isTopLevel=value optimized out, dest=0x9a32330,
completionTag=0xbfcd9bcc ) at pquery.c:1191
#20 0x082622a4 in FillPortalStore (portal=0x9a4d6a8, isTopLevel=32 ' ') at
pquery.c:1065
#21 0x0826281a in PortalRun (portal=0x9a4d6a8, count=2147483647,
isTopLevel=-56 '\310',

Re: [HACKERS] Wildcard search support for pg_trgm

2010-12-12 Thread Alexander Korotkov
I found another problem. GIN index suffers from GIN indexes do not support
whole-index scans when no trigram can be extracted from pattern.


With best regards,
Alexander Korotkov.


Re: [HACKERS] Wildcard search support for pg_trgm

2010-12-12 Thread Dimitri Fontaine
Alexander Korotkov aekorot...@gmail.com writes:
 Here is first version of patch, which enable index support of wildcard
 search in pg_trgm contrib module.

How different (and better) is it from wildspeed?

  http://www.sai.msu.su/~megera/wiki/wildspeed

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wildcard search support for pg_trgm

2010-12-12 Thread Alexander Korotkov
On Mon, Dec 13, 2010 at 12:14 AM, Dimitri Fontaine
dimi...@2ndquadrant.frwrote:

 How different (and better) is it from wildspeed?


The general advantage is possibility of usage wildcard search and trigram
similarity search using the same index.
I expect that GIN trigram index is slightly less space demanding, but
slightly slower on search than wildspeed. Also, I expect GiST trigram index
to be slower on search, but faster on updates. While I didn't check these
assumptions in details.
I've lack of test datasets for sufficient testing, and I would like to ask
community to help me with it. Because testing on dictionaries is good,
but obviously
not enough.


With best regards,
Alexander Korotkov.


[HACKERS] Wildcard search support for pg_trgm

2010-12-11 Thread Alexander Korotkov
Hackers,

Here is first version of patch, which enable index support of wildcard
search in pg_trgm contrib module. The idea of the patch is to extract from
wildcard trigrams which should occurs in wildcard matching string. For
example, for '%sector%' wildcard such trigrams would be: 'sec', 'ect',
'tor'.

create table words (word text);
copy words from '/usr/share/dict/american-english';

test=# explain analyze select * from words where word ilike '%independ%';
  QUERY PLAN

--
 Seq Scan on words  (cost=0.00..1703.11 rows=10 width=9) (actual
time=18.818..174.146 rows=7 loops=1)
   Filter: (word ~~* '%independ%'::text)
 Total runtime: 174.200 ms
(3 rows)

CREATE INDEX trgm_idx ON words USING gist (word gist_trgm_ops);

test=# explain analyze select * from words where word ilike '%independ%';
QUERY PLAN


--
 Bitmap Heap Scan on words  (cost=4.36..40.11 rows=10 width=9) (actual
time=2.445..2.529 rows=7 loops=1)
   Recheck Cond: (word ~~* '%independ%'::text)
   -  Bitmap Index Scan on trgm_idx  (cost=0.00..4.35 rows=10 width=0)
(actual time=2.406..2.406 rows=7 loops=1)
 Index Cond: (word ~~* '%independ%'::text)
 Total runtime: 2.612 ms
(5 rows)

CREATE INDEX trgm_idx ON words USING gin (word gin_trgm_ops);

test=# explain analyze select * from words where word ilike '%independ%';
QUERY PLAN


---
 Bitmap Heap Scan on words  (cost=76.08..111.83 rows=10 width=9) (actual
time=2.675..2.755 rows=7 loops=1)
   Recheck Cond: (word ~~* '%independ%'::text)
   -  Bitmap Index Scan on trgm_idx  (cost=0.00..76.07 rows=10 width=0)
(actual time=2.642..2.642 rows=7 loops=1)
 Index Cond: (word ~~* '%independ%'::text)
 Total runtime: 2.839 ms
(5 rows)

I've encountered with following problems:
1) Indexing support for ilike is possible only with case-insensetive
wildcards, e.g. when IGNORECASE macro is enabled. But I can't use this macro
in pg_trgm.sql.in, where list of operators is defined. Probably, is it
enuogh to put comment near IGNORECASE, which tells that if one disable this
macro he should also remove oparators from pg_trgm.sql.in?
2) I found gist index not very useful with default SIGLENINT = 3. I've
changed this value to 15 and I found gist index performs very good on
dictionary. But on longer strings greater values of SIGLENINT may be
required (probably even SIGLENINT  122 will give benefit in some cases in
spite of TOAST).


With best regards,
Alexander Korotkov.
*** a/contrib/pg_trgm/pg_trgm.sql.in
--- b/contrib/pg_trgm/pg_trgm.sql.in
***
*** 113,118  FOR TYPE text USING gist
--- 113,120 
  AS
  OPERATOR1   % (text, text),
  OPERATOR2   - (text, text) FOR ORDER BY pg_catalog.float_ops,
+ OPERATOR3   ~~ (text, text),
+ OPERATOR4   ~~* (text, text),
  FUNCTION1   gtrgm_consistent (internal, text, int, oid, internal),
  FUNCTION2   gtrgm_union (bytea, internal),
  FUNCTION3   gtrgm_compress (internal),
***
*** 144,149  CREATE OPERATOR CLASS gin_trgm_ops
--- 146,153 
  FOR TYPE text USING gin
  AS
  OPERATOR1   % (text, text),
+ OPERATOR3   ~~ (text, text),
+ OPERATOR4   ~~* (text, text),
  FUNCTION1   btint4cmp (int4, int4),
  FUNCTION2   gin_extract_trgm (text, internal),
  FUNCTION3   gin_extract_trgm (text, internal, int2, internal, internal),
*** a/contrib/pg_trgm/trgm.h
--- b/contrib/pg_trgm/trgm.h
***
*** 19,24 
--- 19,26 
  /* operator strategy numbers */
  #define	SimilarityStrategyNumber	1
  #define	DistanceStrategyNumber		2
+ #define LikeStrategyNumber			3
+ #define ILikeStrategyNumber			4
  
  
  typedef char trgm[3];
***
*** 53,59  typedef struct
  
  /* gist */
  #define BITBYTE 8
! #define SIGLENINT  3			/* 122 = key will toast, so very slow!!! */
  #define SIGLEN	( sizeof(int)*SIGLENINT )
  
  #define SIGLENBIT (SIGLEN*BITBYTE - 1)	/* see makesign */
--- 55,61 
  
  /* gist */
  #define BITBYTE 8
! #define SIGLENINT  15			/* 122 = key will toast, so very slow!!! */
  #define SIGLEN	( sizeof(int)*SIGLENINT )
  
  #define SIGLENBIT (SIGLEN*BITBYTE - 1)	/* see makesign */
***
*** 89,94  typedef char *BITVECP;
--- 91,101 
  extern float4 trgm_limit;
  
  TRGM	   *generate_trgm(char *str, int slen);
+ TRGM	   *generate_wildcard_trgm(char *str, int slen);
  float4