Re: [HACKERS] wildcard search support for pg_trgm
On Tue, Feb 1, 2011 at 5:40 AM, Tom Lane t...@sss.pgh.pa.us wrote: AFAICT that would break on-disk compatibility of pg_trgm GIST indexes. I don't believe we have adequate evidence to justify doing that, and in any case it ought to be a separate patch rather than buried inside a mostly unrelated feature patch. Ok. Actually, I don't think just increasement of SIGLENINT as a solution. I beleive that we need to have it as index parameter. I'll try to provide more of tests in order to motivate this. With best regards, Alexander Korotkov.
Re: [HACKERS] wildcard search support for pg_trgm
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes: OK, now it works flawlessly as far as I can tell. Will mark it as Ready for Committer. Applied with mostly-stylistic corrections, plus addition of documentation and a minimal regression test. I did *not* apply this bit: 2) I found gist index not very useful with default SIGLENINT = 3. I've changed this value to 15 and I found gist index performs very good on dictionary. But on longer strings greater values of SIGLENINT may be required (probably even SIGLENINT 122 will give benefit in some cases in spite of TOAST). AFAICT that would break on-disk compatibility of pg_trgm GIST indexes. I don't believe we have adequate evidence to justify doing that, and in any case it ought to be a separate patch rather than buried inside a mostly unrelated feature patch. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wildcard search support for pg_trgm
On 29/01/11 13:07, Alexander Korotkov wrote: Hello! Hi! New version of patch is in the attachment. Some comments was added in this version. Likely these comments need significant correction because of my english. Ooh, ok, the comments now helped me understand what's exactly going on in there. I played with it a bit more and the idea of using trigrams to do LIKE searches is quite clever. Unfortunately, I think there's a problem with case insensitive queries: create table test(t text); insert into test values ('abcdef'); create index trgm_idx_gin on test using gin (t gin_trgm_ops); set enable_seqscan to off; -- force index usage select * from test where t ilike '%BCD%'; -- no results! set enable_seqscan to on; -- do not use the index select * from test where t ilike '%BCD%'; -- the row is returned I saw that the code tries to handle ILIKE searches, but apparently it's failing somewhere. I'm sorry but I'm leaving on vacation for the next week and won't be able to continue reviewing your patch, I'll unset myself as its reviewer, and in the meantime I hope someone else will pick it up, as the functionality seems very interesting. Cheers, Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wildcard search support for pg_trgm
Hi! On Mon, Jan 31, 2011 at 12:52 AM, Jan Urbański wulc...@wulczer.org wrote: I saw that the code tries to handle ILIKE searches, but apparently it's failing somewhere. It was just a typo. Corrected version attached. With best regards, Alexander Korotkov. *** a/contrib/pg_trgm/pg_trgm.sql.in --- b/contrib/pg_trgm/pg_trgm.sql.in *** *** 113,118 FOR TYPE text USING gist --- 113,120 AS OPERATOR1 % (text, text), OPERATOR2 - (text, text) FOR ORDER BY pg_catalog.float_ops, + OPERATOR3 ~~ (text, text), + OPERATOR4 ~~* (text, text), FUNCTION1 gtrgm_consistent (internal, text, int, oid, internal), FUNCTION2 gtrgm_union (bytea, internal), FUNCTION3 gtrgm_compress (internal), *** *** 129,140 RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_extract_trgm(text, internal, int2, internal, internal) RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal) RETURNS bool AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; --- 131,142 AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_extract_query_trgm(text, internal, int2, internal, internal, internal, internal) RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal, internal, internal) RETURNS bool AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; *** *** 144,151 CREATE OPERATOR CLASS gin_trgm_ops FOR TYPE text USING gin AS OPERATOR1 % (text, text), FUNCTION1 btint4cmp (int4, int4), FUNCTION2 gin_extract_trgm (text, internal), ! FUNCTION3 gin_extract_trgm (text, internal, int2, internal, internal), ! FUNCTION4 gin_trgm_consistent (internal, int2, text, int4, internal, internal), STORAGE int4; --- 146,155 FOR TYPE text USING gin AS OPERATOR1 % (text, text), + OPERATOR3 ~~ (text, text), + OPERATOR4 ~~* (text, text), FUNCTION1 btint4cmp (int4, int4), FUNCTION2 gin_extract_trgm (text, internal), ! FUNCTION3 gin_extract_query_trgm (text, internal, int2, internal, internal, internal, internal), ! FUNCTION4 gin_trgm_consistent (internal, int2, text, int4, internal, internal, internal, internal), STORAGE int4; *** a/contrib/pg_trgm/trgm.h --- b/contrib/pg_trgm/trgm.h *** *** 13,24 --- 13,32 #define LPADDING 2 #define RPADDING 1 #define KEEPONLYALNUM + /* + * IGNORECASE macro means that trigrams is case-insensetive. If this macro is + * disabled, then ~~* operator should be excluded from operator class, because + * we can't handle case-insensetive wildcard search with case-sensetive + * trigrams. + */ #define IGNORECASE #define DIVUNION /* operator strategy numbers */ #define SimilarityStrategyNumber 1 #define DistanceStrategyNumber 2 + #define LikeStrategyNumber 3 + #define ILikeStrategyNumber 4 typedef char trgm[3]; *** *** 53,59 typedef struct /* gist */ #define BITBYTE 8 ! #define SIGLENINT 3 /* 122 = key will toast, so very slow!!! */ #define SIGLEN ( sizeof(int)*SIGLENINT ) #define SIGLENBIT (SIGLEN*BITBYTE - 1) /* see makesign */ --- 61,67 /* gist */ #define BITBYTE 8 ! #define SIGLENINT 15 /* 122 = key will toast, so very slow!!! */ #define SIGLEN ( sizeof(int)*SIGLENINT ) #define SIGLENBIT (SIGLEN*BITBYTE - 1) /* see makesign */ *** *** 89,94 typedef char *BITVECP; --- 97,107 extern float4 trgm_limit; TRGM *generate_trgm(char *str, int slen); + TRGM *generate_wildcard_trgm(char *str, int slen); float4 cnt_sml(TRGM *trg1, TRGM *trg2); + booltrgm_contain(TRGM *trg1, TRGM *trg2); + + #define ISESCAPECHAR(x) (*(x) == '\\') /* Wildcard escape character */ + #define ISWILDCARDCHAR(x) (*(x) == '_' || *(x) == '%') /* Wildcard meta-character */ #endif /* __TRGM_H__ */ *** a/contrib/pg_trgm/trgm_gin.c --- b/contrib/pg_trgm/trgm_gin.c *** *** 6,11 --- 6,12 #include trgm.h #include access/gin.h + #include access/skey.h #include access/itup.h #include access/tuptoaster.h #include storage/bufpage.h *** *** 16,21 --- 17,25 PG_FUNCTION_INFO_V1(gin_extract_trgm); Datum gin_extract_trgm(PG_FUNCTION_ARGS); + PG_FUNCTION_INFO_V1(gin_extract_query_trgm); + Datum
Re: [HACKERS] wildcard search support for pg_trgm
On 30/01/11 23:02, Alexander Korotkov wrote: Hi! On Mon, Jan 31, 2011 at 12:52 AM, Jan Urbański wulc...@wulczer.org mailto:wulc...@wulczer.org wrote: I saw that the code tries to handle ILIKE searches, but apparently it's failing somewhere. It was just a typo. Corrected version attached. I hoped as much :) Will test again now. Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wildcard search support for pg_trgm
On 30/01/11 23:04, Jan Urbański wrote: On 30/01/11 23:02, Alexander Korotkov wrote: Hi! On Mon, Jan 31, 2011 at 12:52 AM, Jan Urbański wulc...@wulczer.org mailto:wulc...@wulczer.org wrote: I saw that the code tries to handle ILIKE searches, but apparently it's failing somewhere. It was just a typo. Corrected version attached. I hoped as much :) Will test again now. OK, now it works flawlessly as far as I can tell. Will mark it as Ready for Committer. Cheers, Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wildcard search support for pg_trgm
Hello! New version of patch is in the attachment. Some comments was added in this version. Likely these comments need significant correction because of my english. Some notes abount gin interface functions. Extract query and extract value functions was separated, because with wildcard search these funtions no longer do the same. New arguments was added to sql description of gin interface functions in order to make it confom to new gin interface. See docs of development version: http://developer.postgresql.org/pgdocs/postgres/gin-extensibility.html. With best regards, Alexander Korotkov. *** a/contrib/pg_trgm/pg_trgm.sql.in --- b/contrib/pg_trgm/pg_trgm.sql.in *** *** 113,118 FOR TYPE text USING gist --- 113,120 AS OPERATOR1 % (text, text), OPERATOR2 - (text, text) FOR ORDER BY pg_catalog.float_ops, + OPERATOR3 ~~ (text, text), + OPERATOR4 ~~* (text, text), FUNCTION1 gtrgm_consistent (internal, text, int, oid, internal), FUNCTION2 gtrgm_union (bytea, internal), FUNCTION3 gtrgm_compress (internal), *** *** 129,140 RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_extract_trgm(text, internal, int2, internal, internal) RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal) RETURNS bool AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; --- 131,142 AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_extract_query_trgm(text, internal, int2, internal, internal, internal, internal) RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal, internal, internal) RETURNS bool AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; *** *** 144,151 CREATE OPERATOR CLASS gin_trgm_ops FOR TYPE text USING gin AS OPERATOR1 % (text, text), FUNCTION1 btint4cmp (int4, int4), FUNCTION2 gin_extract_trgm (text, internal), ! FUNCTION3 gin_extract_trgm (text, internal, int2, internal, internal), ! FUNCTION4 gin_trgm_consistent (internal, int2, text, int4, internal, internal), STORAGE int4; --- 146,155 FOR TYPE text USING gin AS OPERATOR1 % (text, text), + OPERATOR3 ~~ (text, text), + OPERATOR4 ~~* (text, text), FUNCTION1 btint4cmp (int4, int4), FUNCTION2 gin_extract_trgm (text, internal), ! FUNCTION3 gin_extract_query_trgm (text, internal, int2, internal, internal, internal, internal), ! FUNCTION4 gin_trgm_consistent (internal, int2, text, int4, internal, internal, internal, internal), STORAGE int4; *** a/contrib/pg_trgm/trgm.h --- b/contrib/pg_trgm/trgm.h *** *** 13,24 --- 13,32 #define LPADDING 2 #define RPADDING 1 #define KEEPONLYALNUM + /* + * IGNORECASE macro means that trigrams is case-insensetive. If this macro is + * disabled, then ~~* operator should be excluded from operator class, because + * we can't handle case-insensetive wildcard search with case-sensetive + * trigrams. + */ #define IGNORECASE #define DIVUNION /* operator strategy numbers */ #define SimilarityStrategyNumber 1 #define DistanceStrategyNumber 2 + #define LikeStrategyNumber 3 + #define ILikeStrategyNumber 4 typedef char trgm[3]; *** *** 53,59 typedef struct /* gist */ #define BITBYTE 8 ! #define SIGLENINT 3 /* 122 = key will toast, so very slow!!! */ #define SIGLEN ( sizeof(int)*SIGLENINT ) #define SIGLENBIT (SIGLEN*BITBYTE - 1) /* see makesign */ --- 61,67 /* gist */ #define BITBYTE 8 ! #define SIGLENINT 15 /* 122 = key will toast, so very slow!!! */ #define SIGLEN ( sizeof(int)*SIGLENINT ) #define SIGLENBIT (SIGLEN*BITBYTE - 1) /* see makesign */ *** *** 89,94 typedef char *BITVECP; --- 97,107 extern float4 trgm_limit; TRGM *generate_trgm(char *str, int slen); + TRGM *generate_wildcard_trgm(char *str, int slen); float4 cnt_sml(TRGM *trg1, TRGM *trg2); + booltrgm_contain(TRGM *trg1, TRGM *trg2); + + #define ISESCAPECHAR(x) (*(x) == '\\') /* Wildcard escape character */ + #define ISWILDCARDCHAR(x) (*(x) == '_' || *(x) == '%') /* Wildcard meta-character */ #endif /* __TRGM_H__ */ *** a/contrib/pg_trgm/trgm_gin.c --- b/contrib/pg_trgm/trgm_gin.c *** *** 6,11 --- 6,12 #include trgm.h
Re: [HACKERS] wildcard search support for pg_trgm
Hi! On Mon, Jan 24, 2011 at 3:07 AM, Jan Urbański wulc...@wulczer.org wrote: I see two issues with this patch. First of them is the resulting index size. I created a table with 5 copies of /usr/share/dict/american-english in it and a gin index on it, using gin_trgm_ops. The results were: * relation size: 18MB * index size: 109 MB while without the patch the GIN index was 43 MB. I'm not really sure *why* this happens, as it's not obvious from reading the patch what exactly is this extra data that gets stored in the index, making it more than double its size. Do you sure that you did comparison correctly? The sequence of index building and data insertion does matter. I tried to build gin index on 5 copies of /usr/share/dict/american-english with patch and got 43 MB index size. That leads me to the second issue. The pg_trgm code is already woefully uncommented, and after spending quite some time reading it back and forth I have to admit that I don't really understand what the code does in the first place, and so I don't understand what does that patch change. I read all the changes in detail and I could't find any obvious mistakes like reading over array boundaries or dereferencing uninitialized pointers, but I can't tell if the patch is correct semantically. All test cases I threw at it work, though. I'll try to write sufficient comment and send new revision of patch. This patch changes the names and signatures of some support functions for GIN, and I'm not sure how that affects binary compatibility and pg_upgrade. I tried to create an index with the vanilla source, and then recompile pg_trgm and reindex the table, but it still was not using the index. I think it's because it's missing entries in the catalogs about the index supporting the like strategy. How should this be handled? This patch don't alters structure of index. It only adds strategies for index scan. In order update this index one should recreate operator class (it will require to drop index). It can be done by sequential uninstall_pg_trgm.sql and pg_trgm.sql. After that new index can be created and it will support like strategy. Although actually there is no need of index recreation, I don't see easier way to do this. With best regards, Alexander Korotkov.
Re: [HACKERS] wildcard search support for pg_trgm
On 2011-01-24 16:34, Alexander Korotkov wrote: Hi! On Mon, Jan 24, 2011 at 3:07 AM, Jan Urbańskiwulc...@wulczer.org wrote: I see two issues with this patch. First of them is the resulting index size. I created a table with 5 copies of /usr/share/dict/american-english in it and a gin index on it, using gin_trgm_ops. The results were: * relation size: 18MB * index size: 109 MB while without the patch the GIN index was 43 MB. I'm not really sure *why* this happens, as it's not obvious from reading the patch what exactly is this extra data that gets stored in the index, making it more than double its size. Do you sure that you did comparison correctly? The sequence of index building and data insertion does matter. I tried to build gin index on 5 copies of /usr/share/dict/american-english with patch and got 43 MB index size. That leads me to the second issue. The pg_trgm code is already woefully uncommented, and after spending quite some time reading it back and forth I have to admit that I don't really understand what the code does in the first place, and so I don't understand what does that patch change. I read all the changes in detail and I could't find any obvious mistakes like reading over array boundaries or dereferencing uninitialized pointers, but I can't tell if the patch is correct semantically. All test cases I threw at it work, though. I'll try to write sufficient comment and send new revision of patch. Would it be hard to make it support n-grams (e.g. making the length configurable) instead of trigrams? I actually had the feeling that penta-grams (pen-tuples or whatever they would be called) would be better for my usecase (large substring-search in large documents .. eg. 500 within 3.000. Larger sizes.. lesser sensitivity = Faster lookup .. perhaps my logic is wrong? Hm.. or will the knngist stuff help me here by selecting the best using pentuples from the beginning? The above comment is actually general to pg_trgm and not to the wildcard search patch above. Jesper -- Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wildcard search support for pg_trgm
Jesper Krogh jes...@krogh.cc writes: Would it be hard to make it support n-grams (e.g. making the length configurable) instead of trigrams? That would be a complete rewrite with an incompatible on-disk index representation, which seems a bit beyond the scope of this patch. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] wildcard search support for pg_trgm
Hi, I tested the patch from http://archives.postgresql.org/message-id/aanlktikvxx6_ajzb52ona7mbzijcpdqszomcd-3u1...@mail.gmail.com which adds GIN and GIST index support for wildcard LIKE queries using pg_trgm. The patch is a context diff that applies cleanly. Regression test work after applying it, but they only exercise the similarity() function, so the new functionality is not covered by them. The patch seems to work as advised, I tried a few searches and it does indeed use the gin or gist index to implement '%foo%' searches. I tried to do some tricky queries and it worked for all of them.. I see two issues with this patch. First of them is the resulting index size. I created a table with 5 copies of /usr/share/dict/american-english in it and a gin index on it, using gin_trgm_ops. The results were: * relation size: 18MB * index size: 109 MB while without the patch the GIN index was 43 MB. I'm not really sure *why* this happens, as it's not obvious from reading the patch what exactly is this extra data that gets stored in the index, making it more than double its size. That leads me to the second issue. The pg_trgm code is already woefully uncommented, and after spending quite some time reading it back and forth I have to admit that I don't really understand what the code does in the first place, and so I don't understand what does that patch change. I read all the changes in detail and I could't find any obvious mistakes like reading over array boundaries or dereferencing uninitialized pointers, but I can't tell if the patch is correct semantically. All test cases I threw at it work, though. I'm not sure if the committer with better knowledge of pg_trgm would be able to do a better job than me. After a few days digging in that code I simply give up. This patch changes the names and signatures of some support functions for GIN, and I'm not sure how that affects binary compatibility and pg_upgrade. I tried to create an index with the vanilla source, and then recompile pg_trgm and reindex the table, but it still was not using the index. I think it's because it's missing entries in the catalogs about the index supporting the like strategy. How should this be handled? I'm going to mark the patch as Waiting on Author, because of the index size issue (though it might be OK and expected that the index size will grow so much, I just don't know). As for the comments, or lack of them, I declary myself incompetent to thoroughly verify that the patch works. I think it should have at least the added parts commented enough to match the project's standard. Sorry for taking so long to review this, Cheers, Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wildcard search support for pg_trgm
Hi, Here is updated version of this patch. With best regards, Alexander Korotkov. *** a/contrib/pg_trgm/pg_trgm.sql.in --- b/contrib/pg_trgm/pg_trgm.sql.in *** *** 113,118 FOR TYPE text USING gist --- 113,120 AS OPERATOR1 % (text, text), OPERATOR2 - (text, text) FOR ORDER BY pg_catalog.float_ops, + OPERATOR3 ~~ (text, text), + OPERATOR4 ~~* (text, text), FUNCTION1 gtrgm_consistent (internal, text, int, oid, internal), FUNCTION2 gtrgm_union (bytea, internal), FUNCTION3 gtrgm_compress (internal), *** *** 129,140 RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_extract_trgm(text, internal, int2, internal, internal) RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal) RETURNS bool AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; --- 131,142 AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_extract_query_trgm(text, internal, int2, internal, internal, internal, internal) RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal, internal, internal) RETURNS bool AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; *** *** 144,151 CREATE OPERATOR CLASS gin_trgm_ops FOR TYPE text USING gin AS OPERATOR1 % (text, text), FUNCTION1 btint4cmp (int4, int4), FUNCTION2 gin_extract_trgm (text, internal), ! FUNCTION3 gin_extract_trgm (text, internal, int2, internal, internal), ! FUNCTION4 gin_trgm_consistent (internal, int2, text, int4, internal, internal), STORAGE int4; --- 146,155 FOR TYPE text USING gin AS OPERATOR1 % (text, text), + OPERATOR3 ~~ (text, text), + OPERATOR4 ~~* (text, text), FUNCTION1 btint4cmp (int4, int4), FUNCTION2 gin_extract_trgm (text, internal), ! FUNCTION3 gin_extract_query_trgm (text, internal, int2, internal, internal, internal, internal), ! FUNCTION4 gin_trgm_consistent (internal, int2, text, int4, internal, internal, internal, internal), STORAGE int4; *** a/contrib/pg_trgm/trgm.h --- b/contrib/pg_trgm/trgm.h *** *** 19,24 --- 19,26 /* operator strategy numbers */ #define SimilarityStrategyNumber 1 #define DistanceStrategyNumber 2 + #define LikeStrategyNumber 3 + #define ILikeStrategyNumber 4 typedef char trgm[3]; *** *** 53,59 typedef struct /* gist */ #define BITBYTE 8 ! #define SIGLENINT 3 /* 122 = key will toast, so very slow!!! */ #define SIGLEN ( sizeof(int)*SIGLENINT ) #define SIGLENBIT (SIGLEN*BITBYTE - 1) /* see makesign */ --- 55,61 /* gist */ #define BITBYTE 8 ! #define SIGLENINT 15 /* 122 = key will toast, so very slow!!! */ #define SIGLEN ( sizeof(int)*SIGLENINT ) #define SIGLENBIT (SIGLEN*BITBYTE - 1) /* see makesign */ *** *** 89,94 typedef char *BITVECP; --- 91,101 extern float4 trgm_limit; TRGM *generate_trgm(char *str, int slen); + TRGM *generate_wildcard_trgm(char *str, int slen); float4 cnt_sml(TRGM *trg1, TRGM *trg2); + booltrgm_contain(TRGM *trg1, TRGM *trg2); + + #define ISESCAPECHAR(x) (*(x) == '\\') + #define ISWILDCARDCHAR(x) (*(x) == '_' || *(x) == '%') #endif /* __TRGM_H__ */ *** a/contrib/pg_trgm/trgm_gin.c --- b/contrib/pg_trgm/trgm_gin.c *** *** 6,11 --- 6,12 #include trgm.h #include access/gin.h + #include access/skey.h #include access/itup.h #include access/tuptoaster.h #include storage/bufpage.h *** *** 16,21 --- 17,25 PG_FUNCTION_INFO_V1(gin_extract_trgm); Datum gin_extract_trgm(PG_FUNCTION_ARGS); + PG_FUNCTION_INFO_V1(gin_extract_query_trgm); + Datum gin_extract_query_trgm(PG_FUNCTION_ARGS); + PG_FUNCTION_INFO_V1(gin_trgm_consistent); Datum gin_trgm_consistent(PG_FUNCTION_ARGS); *** *** 58,90 gin_extract_trgm(PG_FUNCTION_ARGS) } Datum gin_trgm_consistent(PG_FUNCTION_ARGS) { bool *check = (bool *) PG_GETARG_POINTER(0); ! /* StrategyNumber strategy = PG_GETARG_UINT16(1); */ /* text*query = PG_GETARG_TEXT_P(2); */ ! int32 nkeys = PG_GETARG_INT32(3); ! /* Pointer*extra_data = (Pointer *) PG_GETARG_POINTER(4); */ bool *recheck = (bool *) PG_GETARG_POINTER(5); bool res = FALSE; int32 i, !
Re: [HACKERS] Wildcard search support for pg_trgm
On 08/01/11 23:37, Alexander Korotkov wrote: I updated my patch to make it use full index scan in GIN index which is possible thanks to recent Tom Lane patch. Now wildcard, where no trigram can be extracted from, invokes full index scan, which is slow but correct. Hi, unfortunately, this change made the patch not apply: http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=be0c3ea2d30ba225f0249ae88d6b0bdf3b753162 I'm getting rejects in trgm_gin.c. Could you update the patch please? Cheers, Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wildcard search support for pg_trgm
I updated my patch to make it use full index scan in GIN index which is possible thanks to recent Tom Lane patch. Now wildcard, where no trigram can be extracted from, invokes full index scan, which is slow but correct. test=# explain (analyze, buffers) select * from words where word ilike '%in%'; QUERY PLAN Seq Scan on words (cost=0.00..1703.11 rows=15930 width=9) (actual time=0.333..225.817 rows=16558 loops=1) Filter: (word ~~* '%in%'::text) Buffers: shared read=471 Total runtime: 248.207 ms (4 rows) test=# set enable_seqscan = off; SET test=# explain (analyze, buffers) select * from words where word ilike '%in%'; QUERY PLAN Bitmap Heap Scan on words (cost=2287.46..2957.59 rows=15930 width=9) (actual time=122.239..331.993 rows=16558 loops=1) Recheck Cond: (word ~~* '%in%'::text) Buffers: shared hit=472 read=1185 - Bitmap Index Scan on trgm_idx (cost=0.00..2283.48 rows=15930 width=0) (actual time=122.022..122.022 rows=98569 loops=1) Index Cond: (word ~~* '%in%'::text) Buffers: shared hit=1 read=1185 Total runtime: 354.409 ms (7 rows) As an alternative solution I can propose to extract null item from every string and ivoke scan on that item instead of full index scan. It requires to store additional item per each string but it makes full scan fast. Also I found a segfault when execute the query above and switch enable_seqscan few times on line *searchMode = GIN_SEARCH_MODE_ALL;. Is it a bug in GIN or I'm missing something? Here goes backtrace from gdb: #0 0xb4ead070 in gin_extract_query_trgm (fcinfo=0xbfcd8da8) at trgm_gin.c:112 #1 0x08323a84 in OidFunctionCall5 (functionId=32802, arg1=161269768, arg2=3217920208, arg3=4, arg4=3217920204, arg5=3217920200) at fmgr.c:1687 #2 0x082c5654 in gincostestimate (fcinfo=0xbfcd9148) at selfuncs.c:6466 #3 0x083235d8 in OidFunctionCall9 (functionId=2741, arg1=161270176, arg2=161271296, arg3=161824624, arg4=0, arg5=0, arg6=3217921064, arg7=3217921056, arg8=3217921048, arg9=3217921040) at fmgr.c:1840 #4 0x081f3397 in cost_index (path=0x9a55050, root=0x99cc9a0, index=0x99cce00, indexQuals=0x9a53f70, indexOrderBys=0x0, outer_rel=0x0) at costsize.c:268 #5 0x08216b66 in create_index_path (root=0x99cc9a0, index=0x99cce00, clause_groups=0x9a53f88, indexorderbys=0x0, pathkeys=0x0, indexscandir=NoMovementScanDirection, outer_rel=0x0) at pathnode.c:511 #6 0x081f7ef5 in find_usable_indexes (root=value optimized out, rel=value optimized out, clauses=value optimized out, outer_clauses=0x0, istoplevel=1 '\001', outer_rel=0x0, saop_control=SAOP_FORBID, scantype=ST_ANYSCAN) at indxpath.c:422 #7 0x081f8e38 in create_index_paths (root=0x99cc9a0, rel=0x99ccc30) at indxpath.c:176 #8 0x081eec22 in set_plain_rel_pathlist (root=value optimized out, rel=value optimized out, rti=value optimized out, rte=0x99cc650) at allpaths.c:262 #9 set_rel_pathlist (root=value optimized out, rel=value optimized out, rti=value optimized out, rte=0x99cc650) at allpaths.c:202 #10 0x081efa55 in set_base_rel_pathlists (root=0x99cc9a0, joinlist=0x99ccde8) at allpaths.c:158 #11 make_one_rel (root=0x99cc9a0, joinlist=0x99ccde8) at allpaths.c:94 #12 0x08203ef7 in query_planner (root=0x99cc9a0, tlist=0x99ccb00, tuple_fraction=0, limit_tuples=-1, cheapest_path=0xbfcd98cc, sorted_path=0xbfcd98c8, num_groups=0xbfcd98c0) at planmain.c:271 #13 0x08205b86 in grouping_planner (root=0x99cc9a0, tuple_fraction=0) at planner.c:1182 #14 0x08207609 in subquery_planner (glob=0x99cc910, parse=0x99cc5a0, parent_root=0x0, hasRecursion=0 '\000', tuple_fraction=0, subroot=0xbfcd9a7c) at planner.c:536 #15 0x08207ca6 in standard_planner (parse=0x99cc5a0, cursorOptions=0, boundParams=0x0) at planner.c:201 #16 0x0825db11 in pg_plan_query (querytree=0x99cc5a0, cursorOptions=0, boundParams=0x0) at postgres.c:764 #17 0x0815a824 in ExplainOneQuery (stmt=0x9a258e0, queryString=0x9a24c60 explain (analyze, buffers) select * from words where word ilike '%in%';,---Type return to continue, or q return to quit--- params=0x0, dest=0x9a32330) at explain.c:300 #18 ExplainQuery (stmt=0x9a258e0, queryString=0x9a24c60 explain (analyze, buffers) select * from words where word ilike '%in%';, params=0x0, dest=0x9a32330) at explain.c:209 #19 0x08261266 in PortalRunUtility (portal=0x9a4d6a8, utilityStmt=0x9a258e0, isTopLevel=value optimized out, dest=0x9a32330, completionTag=0xbfcd9bcc ) at pquery.c:1191 #20 0x082622a4 in FillPortalStore (portal=0x9a4d6a8, isTopLevel=32 ' ') at pquery.c:1065 #21 0x0826281a in PortalRun (portal=0x9a4d6a8, count=2147483647, isTopLevel=-56 '\310',
Re: [HACKERS] Wildcard search support for pg_trgm
I found another problem. GIN index suffers from GIN indexes do not support whole-index scans when no trigram can be extracted from pattern. With best regards, Alexander Korotkov.
Re: [HACKERS] Wildcard search support for pg_trgm
Alexander Korotkov aekorot...@gmail.com writes: Here is first version of patch, which enable index support of wildcard search in pg_trgm contrib module. How different (and better) is it from wildspeed? http://www.sai.msu.su/~megera/wiki/wildspeed Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wildcard search support for pg_trgm
On Mon, Dec 13, 2010 at 12:14 AM, Dimitri Fontaine dimi...@2ndquadrant.frwrote: How different (and better) is it from wildspeed? The general advantage is possibility of usage wildcard search and trigram similarity search using the same index. I expect that GIN trigram index is slightly less space demanding, but slightly slower on search than wildspeed. Also, I expect GiST trigram index to be slower on search, but faster on updates. While I didn't check these assumptions in details. I've lack of test datasets for sufficient testing, and I would like to ask community to help me with it. Because testing on dictionaries is good, but obviously not enough. With best regards, Alexander Korotkov.
[HACKERS] Wildcard search support for pg_trgm
Hackers, Here is first version of patch, which enable index support of wildcard search in pg_trgm contrib module. The idea of the patch is to extract from wildcard trigrams which should occurs in wildcard matching string. For example, for '%sector%' wildcard such trigrams would be: 'sec', 'ect', 'tor'. create table words (word text); copy words from '/usr/share/dict/american-english'; test=# explain analyze select * from words where word ilike '%independ%'; QUERY PLAN -- Seq Scan on words (cost=0.00..1703.11 rows=10 width=9) (actual time=18.818..174.146 rows=7 loops=1) Filter: (word ~~* '%independ%'::text) Total runtime: 174.200 ms (3 rows) CREATE INDEX trgm_idx ON words USING gist (word gist_trgm_ops); test=# explain analyze select * from words where word ilike '%independ%'; QUERY PLAN -- Bitmap Heap Scan on words (cost=4.36..40.11 rows=10 width=9) (actual time=2.445..2.529 rows=7 loops=1) Recheck Cond: (word ~~* '%independ%'::text) - Bitmap Index Scan on trgm_idx (cost=0.00..4.35 rows=10 width=0) (actual time=2.406..2.406 rows=7 loops=1) Index Cond: (word ~~* '%independ%'::text) Total runtime: 2.612 ms (5 rows) CREATE INDEX trgm_idx ON words USING gin (word gin_trgm_ops); test=# explain analyze select * from words where word ilike '%independ%'; QUERY PLAN --- Bitmap Heap Scan on words (cost=76.08..111.83 rows=10 width=9) (actual time=2.675..2.755 rows=7 loops=1) Recheck Cond: (word ~~* '%independ%'::text) - Bitmap Index Scan on trgm_idx (cost=0.00..76.07 rows=10 width=0) (actual time=2.642..2.642 rows=7 loops=1) Index Cond: (word ~~* '%independ%'::text) Total runtime: 2.839 ms (5 rows) I've encountered with following problems: 1) Indexing support for ilike is possible only with case-insensetive wildcards, e.g. when IGNORECASE macro is enabled. But I can't use this macro in pg_trgm.sql.in, where list of operators is defined. Probably, is it enuogh to put comment near IGNORECASE, which tells that if one disable this macro he should also remove oparators from pg_trgm.sql.in? 2) I found gist index not very useful with default SIGLENINT = 3. I've changed this value to 15 and I found gist index performs very good on dictionary. But on longer strings greater values of SIGLENINT may be required (probably even SIGLENINT 122 will give benefit in some cases in spite of TOAST). With best regards, Alexander Korotkov. *** a/contrib/pg_trgm/pg_trgm.sql.in --- b/contrib/pg_trgm/pg_trgm.sql.in *** *** 113,118 FOR TYPE text USING gist --- 113,120 AS OPERATOR1 % (text, text), OPERATOR2 - (text, text) FOR ORDER BY pg_catalog.float_ops, + OPERATOR3 ~~ (text, text), + OPERATOR4 ~~* (text, text), FUNCTION1 gtrgm_consistent (internal, text, int, oid, internal), FUNCTION2 gtrgm_union (bytea, internal), FUNCTION3 gtrgm_compress (internal), *** *** 144,149 CREATE OPERATOR CLASS gin_trgm_ops --- 146,153 FOR TYPE text USING gin AS OPERATOR1 % (text, text), + OPERATOR3 ~~ (text, text), + OPERATOR4 ~~* (text, text), FUNCTION1 btint4cmp (int4, int4), FUNCTION2 gin_extract_trgm (text, internal), FUNCTION3 gin_extract_trgm (text, internal, int2, internal, internal), *** a/contrib/pg_trgm/trgm.h --- b/contrib/pg_trgm/trgm.h *** *** 19,24 --- 19,26 /* operator strategy numbers */ #define SimilarityStrategyNumber 1 #define DistanceStrategyNumber 2 + #define LikeStrategyNumber 3 + #define ILikeStrategyNumber 4 typedef char trgm[3]; *** *** 53,59 typedef struct /* gist */ #define BITBYTE 8 ! #define SIGLENINT 3 /* 122 = key will toast, so very slow!!! */ #define SIGLEN ( sizeof(int)*SIGLENINT ) #define SIGLENBIT (SIGLEN*BITBYTE - 1) /* see makesign */ --- 55,61 /* gist */ #define BITBYTE 8 ! #define SIGLENINT 15 /* 122 = key will toast, so very slow!!! */ #define SIGLEN ( sizeof(int)*SIGLENINT ) #define SIGLENBIT (SIGLEN*BITBYTE - 1) /* see makesign */ *** *** 89,94 typedef char *BITVECP; --- 91,101 extern float4 trgm_limit; TRGM *generate_trgm(char *str, int slen); + TRGM *generate_wildcard_trgm(char *str, int slen); float4