Re: [HACKERS] WIP: index support for regexp search
I found you committed GiST index implementation. That's cool. I found an easy way to optimize it. We can also use trigramsMatchGraph for signatures. Attached patch contains implementation. Simple example in order to demonstrate it: Before the patch: test=# explain (analyze, buffers) select * from words where s ~ '[abc]def'; QUERY PLAN -- Bitmap Heap Scan on words (cost=4.36..40.24 rows=10 width=9) (actual time=17.189..17.193 rows=3 loops=1) Recheck Cond: (s ~ '[abc]def'::text) Buffers: shared hit=858 - Bitmap Index Scan on words_trgm_idx (cost=0.00..4.36 rows=10 width=0) (actual time=17.172..17.172 rows=3 loops=1) Index Cond: (s ~ '[abc]def'::text) Buffers: shared hit=*857* Total runtime: 17.224 ms (7 rows) After the patch: test=# explain (analyze, buffers) select * from words where s ~ '[abc]def'; QUERY PLAN -- Bitmap Heap Scan on words (cost=4.36..40.24 rows=10 width=9) (actual time=13.718..13.721 rows=3 loops=1) Recheck Cond: (s ~ '[abc]def'::text) Buffers: shared hit=498 - Bitmap Index Scan on words_trgm_idx (cost=0.00..4.36 rows=10 width=0) (actual time=13.701..13.701 rows=3 loops=1) Index Cond: (s ~ '[abc]def'::text) Buffers: shared hit=*497* Total runtime: 13.786 ms (7 rows) -- With best regards, Alexander Korotkov. trgm-regexp-gist-optimize.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Mon, Apr 15, 2013 at 05:53:41PM +0400, Alexander Korotkov wrote: I found you committed GiST index implementation. That's cool. I found an easy way to optimize it. We can also use trigramsMatchGraph for signatures. Attached patch contains implementation. Simple example in order to demonstrate it: Before the patch: Buffers: shared hit=*857* After the patch: Buffers: shared hit=*497* Neato! Inside the patch, s/monotonous/monotone/, I think. Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
Alexander Korotkov aekorot...@gmail.com writes: I found you committed GiST index implementation. That's cool. I found an easy way to optimize it. We can also use trigramsMatchGraph for signatures. Attached patch contains implementation. Good idea, committed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Tue, Apr 9, 2013 at 9:15 AM, Tom Lane t...@sss.pgh.pa.us wrote: Alexander Korotkov aekorot...@gmail.com writes: On Mon, Apr 8, 2013 at 9:28 AM, Tom Lane t...@sss.pgh.pa.us wrote: I spent the weekend hacking on this, making a number of bug fixes and a whole lot of cosmetic changes. I think there are large parts of this that are in committable shape now, but I still find the actual graph transformation logic to be mostly unintelligible. I think what's most obscure is the distinction between the arcs list and the keys list of each state in the expanded graph. I get the impression that the general idea is for the arcs to represent exactly-known transitions while the keys represent imprecisely-known transitions ... but there seems to be at least some leakage between those categories. Could you write down a specification for what's supposed to be happening there? Here is my try to specify it. Thanks. I hacked on this some more and committed it. I found a number of bugs along the way with respect to handling of word boundaries (partially-blank transition trigrams) and EOL-color ($) handling. I think it's all fixed now but it could definitely use some more study and testing. Great, thanks! I also will do some more testing. One issue that bothered me is that the regression tests really don't provide much visibility into what the code is doing. Some of the bugs had to do with failing to generate expected trigrams, for instance col ~ 'foo bar' only generating trigram foo and not bar. This still led to getting the right answer, so the error was invisible as far as the tests were concerned. Is it worth thinking of a way to expose what the extract function did at SQL level, so we could test more carefully? Yes, I also had similar idea. But, I think we need some relatively stable representation of resulting graph in order to expose it. There could be a lot of equivalent graphs. Some changes in implementation could lead to change from one equivalent graph to another. It would be better to not rewrite tests in this case. Ideally, we should expose some representation which is the same for all equivalent graphs. However, it doesn't seem to be realistic. But, I think we could at least make it stable to order sequence of states and color trigrams. Another option I see is to expose just set of trigrams. It doesn't have completeness of information, but it is quite stable. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Mon, Apr 8, 2013 at 9:28 AM, Tom Lane t...@sss.pgh.pa.us wrote: Alexander Korotkov aekorot...@gmail.com writes: [ trgm-regexp-0.15.patch.gz ] I spent the weekend hacking on this, making a number of bug fixes and a whole lot of cosmetic changes. I think there are large parts of this that are in committable shape now, but I still find the actual graph transformation logic to be mostly unintelligible. I think what's most obscure is the distinction between the arcs list and the keys list of each state in the expanded graph. I get the impression that the general idea is for the arcs to represent exactly-known transitions while the keys represent imprecisely-known transitions ... but there seems to be at least some leakage between those categories. Could you write down a specification for what's supposed to be happening there? Here is my try to specify it. At first some notions. I know, they are already in the comments, but just in order to put it together. Extended color - any color of source CNFA or one of two special values: 1) Unknown color - may represent any character either alphanumeric or non-alphanumeric. 2) Blank color - may represent any non-alphanumeric character Prefix is extended colors of last two characters read by CNFA. Key is pair of CNFA state and prefix. So, key is a extended state which is containing additional information which can influence further trigrams. So, if you are in some key and traverse some CNFA arc then you moves info another key. But there are two possible cases (or, sometimes, both of them): 1) Your move from one key into another necessary means read of some trigram. Then you create new arc labeled with that trigram. 2) You move into another key, but you doesn't necessary read an useful trigram. For example, arc of source CNFA is labeled by unextractable color. Then you add new key into keys array. And outgoing arcs from this key will also be processed similarly to source key. Therefore keys array is a set of keys which are achievable from stateKey without reading of useful trigram. We could get rid of keys array and produce some empty arcs in the second case. You can imagine that keys array is set of keys which are achivable by empty arcs from stateKey. States connected with empty arcs could be merged on the next stage. However, the reason of having separated addKeys stage is optimization. In addKey function more generals keys absorb less general ones. In many cases resulting graph becomes much simplier because of this. For example, if your regex is not prefix, then engine puts self-referencing arcs of all possible colors to initial state. Straight-forward processing of this could produce enormous output graph. I had similar situation in early version of patch where keys didn't absorb earch other. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
Alexander Korotkov aekorot...@gmail.com writes: On Mon, Apr 8, 2013 at 9:28 AM, Tom Lane t...@sss.pgh.pa.us wrote: I spent the weekend hacking on this, making a number of bug fixes and a whole lot of cosmetic changes. I think there are large parts of this that are in committable shape now, but I still find the actual graph transformation logic to be mostly unintelligible. I think what's most obscure is the distinction between the arcs list and the keys list of each state in the expanded graph. I get the impression that the general idea is for the arcs to represent exactly-known transitions while the keys represent imprecisely-known transitions ... but there seems to be at least some leakage between those categories. Could you write down a specification for what's supposed to be happening there? Here is my try to specify it. Thanks. I hacked on this some more and committed it. I found a number of bugs along the way with respect to handling of word boundaries (partially-blank transition trigrams) and EOL-color ($) handling. I think it's all fixed now but it could definitely use some more study and testing. One issue that bothered me is that the regression tests really don't provide much visibility into what the code is doing. Some of the bugs had to do with failing to generate expected trigrams, for instance col ~ 'foo bar' only generating trigram foo and not bar. This still led to getting the right answer, so the error was invisible as far as the tests were concerned. Is it worth thinking of a way to expose what the extract function did at SQL level, so we could test more carefully? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
Alexander Korotkov aekorot...@gmail.com writes: [ trgm-regexp-0.15.patch.gz ] I spent the weekend hacking on this, making a number of bug fixes and a whole lot of cosmetic changes. I think there are large parts of this that are in committable shape now, but I still find the actual graph transformation logic to be mostly unintelligible. I think what's most obscure is the distinction between the arcs list and the keys list of each state in the expanded graph. I get the impression that the general idea is for the arcs to represent exactly-known transitions while the keys represent imprecisely-known transitions ... but there seems to be at least some leakage between those categories. Could you write down a specification for what's supposed to be happening there? regards, tom lane trgm-regexp-0.16.patch.gz Description: trgm-regexp-0.16.patch.gz -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Tue, April 2, 2013 23:54, Alexander Korotkov wrote: [trgm-regexp-0.15.patch.gz] Yes, it does look good now; Attached a list of measurements. Most of the searches that I put in that test-program are now speeded up very much. There still are a few regressions, for example: HEAD azjunk6 x[aeiou]{4,5}q 83 Seq Scan 1393.465 ms trgm_regex15 azjunk6 x[aeiou]{4,5}q 83 Bitmap Heap Scan 1728.319 ms HEAD azjunk7 x[aeiou]{1,3}q 190031 Seq Scan 16819.555 ms trgm_regex15 azjunk7 x[aeiou]{1,3}q 190031 Bitmap Heap Scan 21286.804 ms Not exactly negligible, and ideally those regressions would be removed but with the huge advantages for other cases I'd say it's worth it. hth, Erik Rijkers re-head-13-15-20130403-0708.txt.bz2 Description: BZip2 compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Wed, Apr 3, 2013 at 11:10 AM, Erikjan Rijkers e...@xs4all.nl wrote: On Tue, April 2, 2013 23:54, Alexander Korotkov wrote: [trgm-regexp-0.15.patch.gz] Yes, it does look good now; Attached a list of measurements. Most of the searches that I put in that test-program are now speeded up very much. There still are a few regressions, for example: HEAD azjunk6 x[aeiou]{4,5}q 83 Seq Scan 1393.465 ms trgm_regex15 azjunk6 x[aeiou]{4,5}q 83 Bitmap Heap Scan 1728.319 ms HEAD azjunk7 x[aeiou]{1,3}q 190031 Seq Scan 16819.555 ms trgm_regex15 azjunk7 x[aeiou]{1,3}q 190031 Bitmap Heap Scan 21286.804 ms Not exactly negligible, and ideally those regressions would be removed but with the huge advantages for other cases I'd say it's worth it. Thank you for testing! Exploring results more detail I found version 13 to be buggy. This version is a dead end, we have quite different API now. Could you use v12 instead of v13 in comparison, please? Sometimes we have regression in comparison with head in two reasons: 1) We select index scan in both cases but with patch we spent more time for analysis. It's inevitable disadvantage of any index. We can only take care of analysis doesn't take too long. Current testing results don't show this reason to be significant. 2) Sometimes we select index scan while sequential scan would be faster. It's also inevitable disadvantage until we have a relevant statistics. We now have similar situation, for example, with in-core geometrical search and LIKE/ILIKE search in pg_trgm. However, probably, situation could be improved somehow even without such statistics. But I think we can do such conclusion based on synthetical testing, because improvements for synthetical cases could appear to be an worsening for real-life cases. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Mon, April 1, 2013 23:15, Alexander Korotkov wrote: [trgm-regexp-0.14.patch.gz] Hi Alexander, Something went wrong in this version of the patch: many (most) queries that were earlier spectacularly fast have become slow, often slower than a seqscan or only marginally faster. See the attached numbers; it compares head(seqscan) with trgm-regex patch versions 13 and 14. I did not even complete the test-run because version 14 is so clearly inferior to 13 (and earlier, as far as I can remember). (let me know if you want the whole result, I can run it overnight) Thanks, Erik Rijkers re-head-13-14-20130402-2219.txt.bz2 Description: BZip2 compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Wed, Apr 3, 2013 at 12:36 AM, Erikjan Rijkers e...@xs4all.nl wrote: On Mon, April 1, 2013 23:15, Alexander Korotkov wrote: [trgm-regexp-0.14.patch.gz] Hi Alexander, Hi Erik! Something went wrong in this version of the patch: many (most) queries that were earlier spectacularly fast have become slow, often slower than a seqscan or only marginally faster. See the attached numbers; it compares head(seqscan) with trgm-regex patch versions 13 and 14. I did not even complete the test-run because version 14 is so clearly inferior to 13 (and earlier, as far as I can remember). (let me know if you want the whole result, I can run it overnight) Yes, there was bug in new color map scan implementation. It was fixed in attached version. Could you re-run tests please? Attached version of patch also addresses problem that we now using API it's not correct to use special values of color number. It introduces additional structure ExtendedColor which contain ExtendedColorClass field for handling additional virtual colors introduced by analysis. -- With best regards, Alexander Korotkov. trgm-regexp-0.15.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Mon, Mar 25, 2013 at 2:38 AM, Tom Lane t...@sss.pgh.pa.us wrote: Alexander Korotkov aekorot...@gmail.com writes: On Mon, Mar 25, 2013 at 1:50 AM, Tom Lane t...@sss.pgh.pa.us wrote: Similarly, pushing PG-specific declarations like RE_compile_and_cache() into regex/regex.h is completely not the right thing for preserving a clear library boundary (even positing that we want to expose that function outside adt/regexp.c, which I'd rather we didn't). Perhaps we could avoid these issues by defining a library API that provides accessors like these for the opaque regex_t struct: * get the number of states in the CNFA * get the numbers of the initial and final states * get the number of out-arcs for the N'th state * get the out-arcs for the N'th state into a caller-provided array (sized using the previous function), where each out-arc is represented by a color and an end-state * get the number of character codes represented by color C * get the wchar codes for color C into a caller-provided array (The reason for letting the caller allocate the result arrays is so we can use palloc for that; if we allocate it in backend/regex/ we must use malloc, which will greatly increase the risk of leakages. Also, as far as the color API goes, the above lets the caller decide how many characters is too many to bother with.) I like the this idea. Seems like clear and not over-engineered API. I can implement it. Could you propose something particular to do with RE_compile_and_cache in this patch? With this API we still need a way to get regex_t or equivalent from string. Well, the brute force answer is for pg_trgm to not go through RE_compile_and_cache at all, but just call the regex library for itself. That would lose the ability to cache regex compilations, but I'm not sure we care. The motivation for RE_compile_and_cache is mainly to prevent having to compile the regex again for *every row*, which is what would otherwise happen in a simple SELECT using a regex function. Unless I'm misunderstanding something, pg_trgm would only need to compile the regex once per indexscan, which is probably tolerable. For me it makes sense. Patch with proposed API implementation is attached. -- With best regards, Alexander Korotkov. trgm-regexp-0.14.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
Alexander Korotkov aekorot...@gmail.com writes: Now I have working implemetation of this API. Comments still need rework. Could you give me any feedback? I looked at this a little bit, but it's not very far along at all towards resolving my API worries. The basic point that I'm concerned about is that we would like to split off the regex library (ie, backend/regex/) as a standalone project someday. There are already some problems to be resolved to make that possible, and I don't want this patch to introduce new ones. From that standpoint, the proposed regextract.c file is a disaster, because it depends on a boatload of Postgres-specific stuff (at least StringInfo, List, HTAB, and pg_wchar; to say nothing of palloc). We can't consider that to be on the regex side of the fence. Similarly, pushing PG-specific declarations like RE_compile_and_cache() into regex/regex.h is completely not the right thing for preserving a clear library boundary (even positing that we want to expose that function outside adt/regexp.c, which I'd rather we didn't). Given that you've already got a notion of callbacks provided by contrib/pg_trgm, perhaps this can be fixed by pushing more of the work into those callbacks, so that the heavy-duty data structures like the hash table live over there and the API exposed by backend/regex/ is at a much simpler level than what you have here. But right now I don't see any usable library API here. Perhaps we could avoid these issues by defining a library API that provides accessors like these for the opaque regex_t struct: * get the number of states in the CNFA * get the numbers of the initial and final states * get the number of out-arcs for the N'th state * get the out-arcs for the N'th state into a caller-provided array (sized using the previous function), where each out-arc is represented by a color and an end-state * get the number of character codes represented by color C * get the wchar codes for color C into a caller-provided array (The reason for letting the caller allocate the result arrays is so we can use palloc for that; if we allocate it in backend/regex/ we must use malloc, which will greatly increase the risk of leakages. Also, as far as the color API goes, the above lets the caller decide how many characters is too many to bother with.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Mon, Mar 25, 2013 at 1:50 AM, Tom Lane t...@sss.pgh.pa.us wrote: Alexander Korotkov aekorot...@gmail.com writes: Now I have working implemetation of this API. Comments still need rework. Could you give me any feedback? I looked at this a little bit, but it's not very far along at all towards resolving my API worries. The basic point that I'm concerned about is that we would like to split off the regex library (ie, backend/regex/) as a standalone project someday. There are already some problems to be resolved to make that possible, and I don't want this patch to introduce new ones. From that standpoint, the proposed regextract.c file is a disaster, because it depends on a boatload of Postgres-specific stuff (at least StringInfo, List, HTAB, and pg_wchar; to say nothing of palloc). We can't consider that to be on the regex side of the fence. Now I can see your position about API much more clearly. Previously I thought only about algorithmic side of things. Similarly, pushing PG-specific declarations like RE_compile_and_cache() into regex/regex.h is completely not the right thing for preserving a clear library boundary (even positing that we want to expose that function outside adt/regexp.c, which I'd rather we didn't). Given that you've already got a notion of callbacks provided by contrib/pg_trgm, perhaps this can be fixed by pushing more of the work into those callbacks, so that the heavy-duty data structures like the hash table live over there and the API exposed by backend/regex/ is at a much simpler level than what you have here. But right now I don't see any usable library API here. Perhaps we could avoid these issues by defining a library API that provides accessors like these for the opaque regex_t struct: * get the number of states in the CNFA * get the numbers of the initial and final states * get the number of out-arcs for the N'th state * get the out-arcs for the N'th state into a caller-provided array (sized using the previous function), where each out-arc is represented by a color and an end-state * get the number of character codes represented by color C * get the wchar codes for color C into a caller-provided array (The reason for letting the caller allocate the result arrays is so we can use palloc for that; if we allocate it in backend/regex/ we must use malloc, which will greatly increase the risk of leakages. Also, as far as the color API goes, the above lets the caller decide how many characters is too many to bother with.) I like the this idea. Seems like clear and not over-engineered API. I can implement it. Could you propose something particular to do with RE_compile_and_cache in this patch? With this API we still need a way to get regex_t or equivalent from string. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
Alexander Korotkov aekorot...@gmail.com writes: On Mon, Mar 25, 2013 at 1:50 AM, Tom Lane t...@sss.pgh.pa.us wrote: Similarly, pushing PG-specific declarations like RE_compile_and_cache() into regex/regex.h is completely not the right thing for preserving a clear library boundary (even positing that we want to expose that function outside adt/regexp.c, which I'd rather we didn't). Perhaps we could avoid these issues by defining a library API that provides accessors like these for the opaque regex_t struct: * get the number of states in the CNFA * get the numbers of the initial and final states * get the number of out-arcs for the N'th state * get the out-arcs for the N'th state into a caller-provided array (sized using the previous function), where each out-arc is represented by a color and an end-state * get the number of character codes represented by color C * get the wchar codes for color C into a caller-provided array (The reason for letting the caller allocate the result arrays is so we can use palloc for that; if we allocate it in backend/regex/ we must use malloc, which will greatly increase the risk of leakages. Also, as far as the color API goes, the above lets the caller decide how many characters is too many to bother with.) I like the this idea. Seems like clear and not over-engineered API. I can implement it. Could you propose something particular to do with RE_compile_and_cache in this patch? With this API we still need a way to get regex_t or equivalent from string. Well, the brute force answer is for pg_trgm to not go through RE_compile_and_cache at all, but just call the regex library for itself. That would lose the ability to cache regex compilations, but I'm not sure we care. The motivation for RE_compile_and_cache is mainly to prevent having to compile the regex again for *every row*, which is what would otherwise happen in a simple SELECT using a regex function. Unless I'm misunderstanding something, pg_trgm would only need to compile the regex once per indexscan, which is probably tolerable. Another approach we could take is for adt/regexp.c to expose a function that collects the deconstructed CNFA data into a palloc'd structure using the above-proposed functions. This'd be a reasonable way to go if we decide the caching behavior is actually needed for this purpose; but otherwise it seems like it's just involving one more module than necessary. BTW, one reason I don't like exposing RE_compile_and_cache directly is that then you have the question of how long is the regex_t it hands back good for. The caller doesn't own that struct, and at some point regexp.c is likely to recycle it, so it's a bit risky to just assume it'll stay around long enough without any further interaction with regexp.c. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Thu, Mar 14, 2013 at 9:40 PM, Alexander Korotkov aekorot...@gmail.comwrote: On Wed, Jan 23, 2013 at 7:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: On 23.01.2013 09:36, Alexander Korotkov wrote: On Wed, Jan 23, 2013 at 6:08 AM, Tom Lanet...@sss.pgh.pa.us wrote: The biggest problem is that I really don't care for the idea of contrib/pg_trgm being this cozy with the innards of regex_t. The only option I see now is to provide a method like export_cnfa which would export corresponding CNFA in fixed format. Yeah, I think that makes sense. The transformation code in trgm_regexp.c would probably be more readable too, if it didn't have to deal with the regex guts representation of the CNFA. Also, once you have intermediate representation of the original CNFA, you could do some of the transformation work on that representation, before building the tranformed graph containing trigrams. You could eliminate any non-alphanumeric characters, joining states connected by arcs with non-alphanumeric characters, for example. It's not just the CNFA though; the other big API problem is with mapping colors back to characters. Right now, that not only knows way too much about a part of the regex internals we have ambitions to change soon, but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of which should be known to the regex library IMO. So I'm not sure how we divvy that up sanely. To be clear: I'm not going to insist that we have to have a clean API factorization before we commit this at all. But it worries me if we don't even know how we could get to that, because we are going to need it eventually. Now I have following idea about API. Put code of stage 2 (transform the original CNFA into an automaton-like graph) into regex engine. It would use API which describes what exactly are we going to extract from CNFA. This API could look like this. typedef char *Prefix; typedef char *ArcLabel; typedef struct { Prefix newPrefix; ArcLabel label; } ArcInfo; typedef struct { Prefix (*getInitialPrefix) (); bool (*prefixContains) (Prefix prefix1, Prefix prefix2); Prefix * (*getPrefixes) (Prefix prefix, color c, int *n); ArcInfo * (*getArcs) (Prefix prefix, color c, int *n); void (*freePrefix) (Prefix prefix); void (*freeArcLabel) (ArcLabel arcLabel); } CFNATransformAPI; getInitialPrefix returns initial prefix value like now this code does: initkey.prefix.colors[0] = UNKNOWN_COLOR; initkey.prefix.colors[1] = UNKNOWN_COLOR; prefixContains are exactly same as function with this name. getPrefixes and getArcs cycle step work of addKeys an addArcs. freePrefix and freeArcLabel frees used memory of Prefix and ArcLabel strutures. Additionally regex engine should provide correct way to examine colormap. int getColorCharsCount(colormap *cm, color c); pg_wchar *getColorChars(colormap *cm, color c); getColorCharsCount would return -1 if this color should be considered as unexpandable. Now I have working implemetation of this API. Comments still need rework. Could you give me any feedback? -- With best regards, Alexander Korotkov. trgm-regexp-0.13.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Wed, Jan 23, 2013 at 7:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: On 23.01.2013 09:36, Alexander Korotkov wrote: On Wed, Jan 23, 2013 at 6:08 AM, Tom Lanet...@sss.pgh.pa.us wrote: The biggest problem is that I really don't care for the idea of contrib/pg_trgm being this cozy with the innards of regex_t. The only option I see now is to provide a method like export_cnfa which would export corresponding CNFA in fixed format. Yeah, I think that makes sense. The transformation code in trgm_regexp.c would probably be more readable too, if it didn't have to deal with the regex guts representation of the CNFA. Also, once you have intermediate representation of the original CNFA, you could do some of the transformation work on that representation, before building the tranformed graph containing trigrams. You could eliminate any non-alphanumeric characters, joining states connected by arcs with non-alphanumeric characters, for example. It's not just the CNFA though; the other big API problem is with mapping colors back to characters. Right now, that not only knows way too much about a part of the regex internals we have ambitions to change soon, but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of which should be known to the regex library IMO. So I'm not sure how we divvy that up sanely. To be clear: I'm not going to insist that we have to have a clean API factorization before we commit this at all. But it worries me if we don't even know how we could get to that, because we are going to need it eventually. Now I have following idea about API. Put code of stage 2 (transform the original CNFA into an automaton-like graph) into regex engine. It would use API which describes what exactly are we going to extract from CNFA. This API could look like this. typedef char *Prefix; typedef char *ArcLabel; typedef struct { Prefix newPrefix; ArcLabel label; } ArcInfo; typedef struct { Prefix (*getInitialPrefix) (); bool (*prefixContains) (Prefix prefix1, Prefix prefix2); Prefix * (*getPrefixes) (Prefix prefix, color c, int *n); ArcInfo * (*getArcs) (Prefix prefix, color c, int *n); void (*freePrefix) (Prefix prefix); void (*freeArcLabel) (ArcLabel arcLabel); } CFNATransformAPI; getInitialPrefix returns initial prefix value like now this code does: initkey.prefix.colors[0] = UNKNOWN_COLOR; initkey.prefix.colors[1] = UNKNOWN_COLOR; prefixContains are exactly same as function with this name. getPrefixes and getArcs cycle step work of addKeys an addArcs. freePrefix and freeArcLabel frees used memory of Prefix and ArcLabel strutures. Additionally regex engine should provide correct way to examine colormap. int getColorCharsCount(colormap *cm, color c); pg_wchar *getColorChars(colormap *cm, color c); getColorCharsCount would return -1 if this color should be considered as unexpandable. Any thoughts? -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Wed, Jan 23, 2013 at 7:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: On 23.01.2013 09:36, Alexander Korotkov wrote: On Wed, Jan 23, 2013 at 6:08 AM, Tom Lanet...@sss.pgh.pa.us wrote: The biggest problem is that I really don't care for the idea of contrib/pg_trgm being this cozy with the innards of regex_t. The only option I see now is to provide a method like export_cnfa which would export corresponding CNFA in fixed format. Yeah, I think that makes sense. The transformation code in trgm_regexp.c would probably be more readable too, if it didn't have to deal with the regex guts representation of the CNFA. Also, once you have intermediate representation of the original CNFA, you could do some of the transformation work on that representation, before building the tranformed graph containing trigrams. You could eliminate any non-alphanumeric characters, joining states connected by arcs with non-alphanumeric characters, for example. It's not just the CNFA though; the other big API problem is with mapping colors back to characters. Right now, that not only knows way too much about a part of the regex internals we have ambitions to change soon, but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of which should be known to the regex library IMO. So I'm not sure how we divvy that up sanely. To be clear: I'm not going to insist that we have to have a clean API factorization before we commit this at all. But it worries me if we don't even know how we could get to that, because we are going to need it eventually. Now, we probably don't have enough of time before 9.3 to solve an API problem :(. It's likely we have to choose either commit to 9.3 without clean API factorization or postpone it to 9.4. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
* Alexander Korotkov (aekorot...@gmail.com) wrote: Now, we probably don't have enough of time before 9.3 to solve an API problem :(. It's likely we have to choose either commit to 9.3 without clean API factorization or postpone it to 9.4. As much as I'd like this to get in, I don't think there's a question here- it should be punted to 9.4. These don't look like small issues to me. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] WIP: index support for regexp search
On Sun, Jan 27, 2013 at 10:40 PM, Alexander Korotkov aekorot...@gmail.comwrote: Now I'm working on additional comments. Some comments were added for addKey and addArc(s). I hope they clarify something. -- With best regards, Alexander Korotkov. trgm-regexp-0.12.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Fri, Jan 25, 2013 at 11:47 AM, Erik Rijkers e...@xs4all.nl wrote: On Wed, January 23, 2013 08:36, Alexander Korotkov wrote: Hi! Some quick answers to the part of notes/issues. I will provide rest of answers soon. [...] trgm-regexp-0.10.patch.gz27 k Trying to build this I get, after 'make install' in contrib/ : /usr/bin/install: cannot stat `./pg_trgm--1.1.sql': No such file or directory /usr/bin/install: cannot stat `./pg_trgm--1.0--1.1.sql': No such file or directory make[1]: *** [install] Error 1 make: *** [install-pg_trgm-recurse] Error 2 Forgot to include these files into patch. Another changes in new version of patch: 1) Get rid of recursion in trigramsMatchGraph and addKeys. 2) Both bos[0] and eos[0] are also included into checks. 3) Get rid of strnlen. 4) I found the way to fix work with collation in previous version of patch to be wrong. Collation of operator must match collation of indexed column for index scan. Only thing to fix is passing collation in gincost_pattern. 5) Some tests were added which improves the coverage. Now I'm working on additional comments. -- With best regards, Alexander Korotkov. trgm-regexp-0.11.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Wed, January 23, 2013 08:36, Alexander Korotkov wrote: Hi! Some quick answers to the part of notes/issues. I will provide rest of answers soon. [...] trgm-regexp-0.10.patch.gz27 k Trying to build this I get, after 'make install' in contrib/ : /usr/bin/install: cannot stat `./pg_trgm--1.1.sql': No such file or directory /usr/bin/install: cannot stat `./pg_trgm--1.0--1.1.sql': No such file or directory make[1]: *** [install] Error 1 make: *** [install-pg_trgm-recurse] Error 2 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On 23.01.2013 09:36, Alexander Korotkov wrote: Hi! Some quick answers to the part of notes/issues. I will provide rest of answers soon. On Wed, Jan 23, 2013 at 6:08 AM, Tom Lanet...@sss.pgh.pa.us wrote: The biggest problem is that I really don't care for the idea of contrib/pg_trgm being this cozy with the innards of regex_t. Sooner or later we are going to want to split the regex code off again as a standalone library, and we *must* have a cleaner division of labor if that is ever to happen. Not sure what a suitable API would look like though. The only option I see now is to provide a method like export_cnfa which would export corresponding CNFA in fixed format. Yeah, I think that makes sense. The transformation code in trgm_regexp.c would probably be more readable too, if it didn't have to deal with the regex guts representation of the CNFA. Also, once you have intermediate representation of the original CNFA, you could do some of the transformation work on that representation, before building the tranformed graph containing trigrams. You could eliminate any non-alphanumeric characters, joining states connected by arcs with non-alphanumeric characters, for example. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
Heikki Linnakangas hlinnakan...@vmware.com writes: On 23.01.2013 09:36, Alexander Korotkov wrote: On Wed, Jan 23, 2013 at 6:08 AM, Tom Lanet...@sss.pgh.pa.us wrote: The biggest problem is that I really don't care for the idea of contrib/pg_trgm being this cozy with the innards of regex_t. The only option I see now is to provide a method like export_cnfa which would export corresponding CNFA in fixed format. Yeah, I think that makes sense. The transformation code in trgm_regexp.c would probably be more readable too, if it didn't have to deal with the regex guts representation of the CNFA. Also, once you have intermediate representation of the original CNFA, you could do some of the transformation work on that representation, before building the tranformed graph containing trigrams. You could eliminate any non-alphanumeric characters, joining states connected by arcs with non-alphanumeric characters, for example. It's not just the CNFA though; the other big API problem is with mapping colors back to characters. Right now, that not only knows way too much about a part of the regex internals we have ambitions to change soon, but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of which should be known to the regex library IMO. So I'm not sure how we divvy that up sanely. To be clear: I'm not going to insist that we have to have a clean API factorization before we commit this at all. But it worries me if we don't even know how we could get to that, because we are going to need it eventually. Anyway, I had another thought in the shower this morning, which is that solving this problem in terms of color trigrams is really the Wrong Thing. ISTM it'd be better to think of the CNFA traversal logic as looking for required sequences of colors of unspecified length, which we'd then chop into trigrams after the fact. This might produce slightly better (more complete) trigram sets, but the real reason I'm suggesting it is that I think the CNFA traversal code might be subject to Polya's Inventor's Paradox: the more general problem may be easier to solve. It seems like casting the goal of that code as being to find variable-length sequences, rather than exactly trigrams, might lead to simpler data structures and more readable algorithms. The still-to-be-designed regex API for this also seems like it would be better off if decoupled from the notion of trigrams. It's quite possible that this idea is all wet and no meaningful improvement can be gotten this way. But I offer it for your consideration. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
Heikki Linnakangas hlinnakan...@vmware.com writes: I finally got around to look at this. I like this new version, without the path matrix, much better. I looked through this version too. I have some notes/issues: The biggest problem is that I really don't care for the idea of contrib/pg_trgm being this cozy with the innards of regex_t. Sooner or later we are going to want to split the regex code off again as a standalone library, and we *must* have a cleaner division of labor if that is ever to happen. Not sure what a suitable API would look like though. I think the assumption that all MB characters fit in 4 bytes is unacceptable; someday we'll want to support wider Unicode characters than we do now, and this code seems utterly unable to handle it. It's especially bad that the code isn't even bothering to defend itself against the possibility of wider characters. Can't just modify pg_trgm--1.0.sql in place, must create a 1.1 version and an upgrade script. Comments and documentation still need a lot of copy-editing, also I think a lot of the comment blocks will not survive pg_indent. It'd be a good idea to run trgm_regexp.c through pg_indent as soon as you have that fixed. New file trgm_regexp.c lacks a copyright notice Calling RE_compile_and_cache with DEFAULT_COLLATION_OID is not good enough; need to pass through the actual collation for the regex operator. How deep can the recursion in activateState() go? Is this a practical production approach at all? Do we need a stack depth check there? addKeys is also recursive, same questions. (But on the other hand, the check_stack_depth in scanColorMap seems useless, since its recursion depth is fixed.) Not too happy with convertPgWchar: aside from hard-wired, unchecked assumption about maximum length of pg_wchar2mb_with_len result, why is it that this is doing a lowercase conversion? Surely the regex stuff dealt with that already? I'm suspicious of the code in addKeys() that is special-casing bos[1] and eos[1] but not the other cases (BOL/EOL). My recollection from working on the fixed-prefix stuff is that it's not always obvious which of those states gets used. strnlen() used in fillTrgm() is probably not portable, and is definitely not used anywhere else in Postgres. This isn't a complete review, just some stuff I happened to notice in one quick pass over the code. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
Hi! Some quick answers to the part of notes/issues. I will provide rest of answers soon. On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane t...@sss.pgh.pa.us wrote: The biggest problem is that I really don't care for the idea of contrib/pg_trgm being this cozy with the innards of regex_t. Sooner or later we are going to want to split the regex code off again as a standalone library, and we *must* have a cleaner division of labor if that is ever to happen. Not sure what a suitable API would look like though. The only option I see now is to provide a method like export_cnfa which would export corresponding CNFA in fixed format. I think the assumption that all MB characters fit in 4 bytes is unacceptable; someday we'll want to support wider Unicode characters than we do now, and this code seems utterly unable to handle it. It's especially bad that the code isn't even bothering to defend itself against the possibility of wider characters. In attached patch I introduce MAX_MULTIBYTE_CHARACTER_LENGTH macro and use it in type definition. Is this way ok? Can't just modify pg_trgm--1.0.sql in place, must create a 1.1 version and an upgrade script. Fixed. Comments and documentation still need a lot of copy-editing, also I think a lot of the comment blocks will not survive pg_indent. It'd be a good idea to run trgm_regexp.c through pg_indent as soon as you have that fixed. Fixed. New file trgm_regexp.c lacks a copyright notice Fixed. Calling RE_compile_and_cache with DEFAULT_COLLATION_OID is not good enough; need to pass through the actual collation for the regex operator. We have collation passed to gin_extract_query in ginscan.c. I noticed that gincost_pattern don't pass collation to gin_extract_query. Is it a bug? Anyway this is not collation we need. We need collation used in operator clause. In attached patch I introduce additional argument to gin_extract_query which represent collation of operator clause. Do you think it is reasonable change in GIN interface? If so, I will provide it as separate patch. Not too happy with convertPgWchar: aside from hard-wired, unchecked assumption about maximum length of pg_wchar2mb_with_len result, why is it that this is doing a lowercase conversion? Surely the regex stuff dealt with that already? Trigrams are already lowercased. We can simplify our calculations by excluding uppercased characters from consideration. -- With best regards, Alexander Korotkov. trgm-regexp-0.10.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Tue, Dec 18, 2012 at 11:45 AM, Erik Rijkers e...@xs4all.nl wrote: On Tue, December 18, 2012 08:04, Alexander Korotkov wrote: I ran the same test again: HEAD versus trgm_regex v6, 7 and 9. In v9 there is some gain but also some regression. It remains a difficult problem... If I get some time in the holidays I'll try to diversify the test program; it is now too simple. Note, that regexes which contains {,n} are likely not what do you expect. test=# select 'xq' ~ 'x[aeiou]{,2}q'; ?column? -- f (1 row) test=# select 'xa{,2}q' ~ 'x[aeiou]{,2}q'; ?column? -- t (1 row) You should use {0,n} to express from 0 to n occurences. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Tue, December 18, 2012 09:45, Alexander Korotkov wrote: You should use {0,n} to express from 0 to n occurences. Thanks, but I know that of course. It's a testing program; and in the end robustness with unexpected or even wrong input is as important as performance. (to put it bluntly, I am also trying to get your patch to fall over ;-)) Thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Tue, Dec 18, 2012 at 12:51 PM, Erik Rijkers e...@xs4all.nl wrote: On Tue, December 18, 2012 09:45, Alexander Korotkov wrote: You should use {0,n} to express from 0 to n occurences. Thanks, but I know that of course. It's a testing program; and in the end robustness with unexpected or even wrong input is as important as performance. (to put it bluntly, I am also trying to get your patch to fall over ;-)) I found most of regressions in 0.9 version to be in {,n} cases. New version of patch use more of trigrams than previous versions. For example for regex 'x[aeiou]{,2}q'. In 0.7 version we use trigrams '__2', '_2_' and '__q'. In 0.9 version we use trigrams 'xa_', 'xe_', 'xi_', 'xo_', 'xu_', '__2', '_2_' and '__q'. But, actually trigram '__2' or '_2_' never occurs. It enough to have one of them, all others are just causing a slowdown. Simultaneously, we can't decide reasonably which trigrams to use without knowing their frequencies. For example, if trigrams 'xa_', 'xe_', 'xi_', 'xo_', 'xu_' were altogether more rare than '__2', newer version of patch would be faster. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Sun, December 16, 2012 22:25, Alexander Korotkov wrote: trgm-regexp-0.8.patch.gz 22 k Hi Alexander, I gave this a quick try; the patch works when compiled for DEBUG, but crashes as a 'speed'-compiled binary: Compile for speed: $ pg_config --configure '--prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.trgm_regex8' '--with-pgport=6556' '--enable-depend' '--with-openssl' '--with-perl' '--with-libxml' $ psql psql (9.3devel-trgm_regex8-20121216_2336-c299477229559d4ee7db68720d86d3fb391db761) Type help for help. testdb=# explain analyze select txt from azjunk5 where txt ~ 'x[aeiouy]{2,5}q'; The connection to the server was lost. Attempting reset: Failed. ! \q log after such: - 2012-12-17 09:31:02.337 CET 15801 LOG: server process (PID 16903) was terminated by signal 11: Segmentation fault 2012-12-17 09:31:02.337 CET 15801 DETAIL: Failed process was running: explain analyze select txt from azjunk5 where txt ~ 'x[aeiouy]{2,5}q'; 2012-12-17 09:31:02.347 CET 15801 LOG: terminating any other active server processes 2012-12-17 09:31:02.348 CET 17049 FATAL: the database system is in recovery mode 2012-12-17 09:31:02.722 CET 15801 LOG: all server processes terminated; reinitializing 2012-12-17 09:31:03.506 CET 17052 LOG: database system was interrupted; last known up at 2012-12-17 09:26:00 CET 2012-12-17 09:31:03.693 CET 17052 LOG: database system was not properly shut down; automatic recovery in progress 2012-12-17 09:31:04.493 CET 17052 LOG: record with zero length at 2/7E3C7588 2012-12-17 09:31:04.494 CET 17052 LOG: redo is not required 2012-12-17 09:31:06.940 CET 15801 LOG: database system is ready to accept connections A debug-compile with below options runs OK (so far): Compile for debug: $ pg_config --configure '--prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.trgm_regex8b' '--with-pgport=6560' '--enable-depend' '--enable-cassert' '--enable-debug' '--with-openssl' '--with-perl' '--with-libxml' which does show some speed gain, I think. When I have time I'll post comparisons between HEAD, versions 6, 7, 8. (BTW, is v6 still interesting?) Thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
Hi! On Mon, Dec 17, 2012 at 12:54 PM, Erik Rijkers e...@xs4all.nl wrote: On Sun, December 16, 2012 22:25, Alexander Korotkov wrote: trgm-regexp-0.8.patch.gz 22 k Hi Alexander, I gave this a quick try; the patch works when compiled for DEBUG, but crashes as a 'speed'-compiled binary: Compile for speed: $ pg_config --configure '--prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.trgm_regex8' '--with-pgport=6556' '--enable-depend' '--with-openssl' '--with-perl' '--with-libxml' $ psql psql (9.3devel-trgm_regex8-20121216_2336-c299477229559d4ee7db68720d86d3fb391db761) Type help for help. testdb=# explain analyze select txt from azjunk5 where txt ~ 'x[aeiouy]{2,5}q'; The connection to the server was lost. Attempting reset: Failed. ! \q Didn't reproduce it yet. Can you retry it with this line uncommented: #define TRGM_REGEXP_DEBUG Then we can see which stage it fails. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Mon, Dec 17, 2012 at 1:16 PM, Alexander Korotkov aekorot...@gmail.comwrote: Didn't reproduce it yet. Can you retry it with this line uncommented: #define TRGM_REGEXP_DEBUG Then we can see which stage it fails. Bug is found and fixed in attached patch. -- With best regards, Alexander Korotkov. trgm-regexp-0.9.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Tue, December 18, 2012 08:04, Alexander Korotkov wrote: trgm-regexp-0.9.patch.gz 22 k Hi. I ran the same test again: HEAD versus trgm_regex v6, 7 and 9. In v9 there is some gain but also some regression. It remains a difficult problem... If I get some time in the holidays I'll try to diversify the test program; it is now too simple. Thanks, Erik Rijkers trgmre_6_7_9.txt.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Fri, Dec 14, 2012 at 1:34 AM, Alexander Korotkov aekorot...@gmail.comwrote: On Mon, Dec 3, 2012 at 4:31 PM, Alexander Korotkov aekorot...@gmail.comwrote: Actually, I generally dislike path matrix for same reasons. But: 1) Output graphs could contain trigrams which are completely useless for search. For example, for regex /(abcdefgh)*ijk/ we need only ijk trigram while graph would contain much more.Path matrix is a method to get rid of all of them. 2) If we use color trigrams then we need some criteria for which color trigrams to expand into trigrams. Simultaneously, we shouldn't allow path from initial state to the final by unexpanded trigrams. It seems much harder to do with graph than with matrix. Now, I have an idea about doing some not comprehensive but simple and fast simplification of graph. I'm doing experiments now. In case of success we could get rid of path matrix. Attached patch have following changes: 1) Postphone expansion of colors. Graph are building on color trigrams. 2) Selective expansion of color trigrams into simple trigrams. All non-expanded color trigrams are removed. Such removal leads to union of all states pairs connected with corresponding arcs. Surely, this must no lead to union of initial and final states: that could do all previous work senseless. -- With best regards, Alexander Korotkov. trgm-regexp-0.8.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Mon, Dec 3, 2012 at 4:31 PM, Alexander Korotkov aekorot...@gmail.comwrote: Actually, I generally dislike path matrix for same reasons. But: 1) Output graphs could contain trigrams which are completely useless for search. For example, for regex /(abcdefgh)*ijk/ we need only ijk trigram while graph would contain much more.Path matrix is a method to get rid of all of them. 2) If we use color trigrams then we need some criteria for which color trigrams to expand into trigrams. Simultaneously, we shouldn't allow path from initial state to the final by unexpanded trigrams. It seems much harder to do with graph than with matrix. Now, I have an idea about doing some not comprehensive but simple and fast simplification of graph. I'm doing experiments now. In case of success we could get rid of path matrix. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On 02.12.2012 20:19, Tom Lane wrote: Alexander Korotkovaekorot...@gmail.com writes: Nice idea to delay expanding colors to characters! Obviously, we should delay expanding inly alphanumerical characters. Because non-alphanumberical characters influence graph structure. Trying to implement... Uh, why would that be? Colors are colors. The regexp machinery doesn't care whether they represent alphanumerics or not. (Or to be more precise, if there is a situation where it makes a difference, separate colors will have been created for each set of characters that need to be distinguished.) The regexp machinery doesn't care, but the trigrams that pg_trgm extracts only contain alphanumeric characters. So if by looking at the CNFA graph produced by the regexp machinery you conclude that any matching strings must contain three-letter sequences %oo and #oo, you can just club them together into oo trigram. I think you can run a pre-processing step to the colors, and merge colors that are equivalent as far as trigrams are considered. For example, if you have a color that contains only character '%', and another that contains character '#', you can treat them as the same hcolor. You might then be able to simplify the CNFA. Actually, it would be even better if you could apply the pre-processing to the regexp before the regexp machinery turns it into a CNFA. Not sure how easy it would be to do such pre-processing. BTW, why create the path matrix? You could check the check array of trigrams in the consistent function directly against the graph. Consistent should return true, iff there is a path through the graph following only arcs that contain trigrams present in the check array. Finding a path through a complex graph could be expensive, O(|E|), but if the path is complex, the path matrix would be large as well, and checking against a large matrix isn't exactly free either. It would allow you to avoid overflows caused by having too many paths through the graph. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Mon, Dec 3, 2012 at 2:05 PM, Heikki Linnakangas hlinnakan...@vmware.comwrote: On 02.12.2012 20:19, Tom Lane wrote: Alexander Korotkovaekorot...@gmail.com writes: Nice idea to delay expanding colors to characters! Obviously, we should delay expanding inly alphanumerical characters. Because non-alphanumberical characters influence graph structure. Trying to implement... Uh, why would that be? Colors are colors. The regexp machinery doesn't care whether they represent alphanumerics or not. (Or to be more precise, if there is a situation where it makes a difference, separate colors will have been created for each set of characters that need to be distinguished.) The regexp machinery doesn't care, but the trigrams that pg_trgm extracts only contain alphanumeric characters. So if by looking at the CNFA graph produced by the regexp machinery you conclude that any matching strings must contain three-letter sequences %oo and #oo, you can just club them together into oo trigram. I think you can run a pre-processing step to the colors, and merge colors that are equivalent as far as trigrams are considered. For example, if you have a color that contains only character '%', and another that contains character '#', you can treat them as the same hcolor. You might then be able to simplify the CNFA. Actually, it would be even better if you could apply the pre-processing to the regexp before the regexp machinery turns it into a CNFA. Not sure how easy it would be to do such pre-processing. Treating colors as same should be possible only for colors which has no alphanumeric characters, because colors are non-overlapping. However, this optimization could be significant in some cases. BTW, why create the path matrix? You could check the check array of trigrams in the consistent function directly against the graph. Consistent should return true, iff there is a path through the graph following only arcs that contain trigrams present in the check array. Finding a path through a complex graph could be expensive, O(|E|), but if the path is complex, the path matrix would be large as well, and checking against a large matrix isn't exactly free either. It would allow you to avoid overflows caused by having too many paths through the graph. Actually, I generally dislike path matrix for same reasons. But: 1) Output graphs could contain trigrams which are completely useless for search. For example, for regex /(abcdefgh)*ijk/ we need only ijk trigram while graph would contain much more.Path matrix is a method to get rid of all of them. 2) If we use color trigrams then we need some criteria for which color trigrams to expand into trigrams. Simultaneously, we shouldn't allow path from initial state to the final by unexpanded trigrams. It seems much harder to do with graph than with matrix. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Fri, Nov 30, 2012 at 6:23 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 30.11.2012 13:20, Alexander Korotkov wrote: On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangashlinnakangas@** vmware.com hlinnakan...@vmware.com wrote: Would it be safe to simply stop short the depth-first search on overflow, and proceed with the graph that was constructed up to that point? For depth-first it's not. But your proposal naturally makes sense. I've changed it to breadth-first search. And then it's safe to mark all unprocessed states as final when overflow. It means that we assume every unprocessed branch to immediately finish with matching (this can give us more false positives but no false negatives). For overflow of matrix collection, it's safe to do just OR between all the trigrams. New version of patch is attached. Thanks, sounds good. I've spent quite a long time trying to understand the transformation the getState/addKeys/addAcrs functions do to the original CNFA graph. I think that still needs more comments to explain the steps involved in it. One thing that occurs to me is that it might be better to delay expanding colors to characters. You could build and transform the graph, and even create the paths, all while operating on colors. You would end up with lists of color trigrams, consisting of sequences of three colors that must appear in the source string. Only at the last step you would expand the color trigrams to real character trigrams. I think that would save a lot of processing while building the graph, if you have colors that contain many characters. It would allow us to do better than the fixed small MAX_COLOR_CHARS limit. For example with MAX_COLOR_CHARS = 4 in the current patch, it's a shame that we can't do anything with a fairly simple regexp like ^a[b-g]h$ Nice idea to delay expanding colors to characters! Obviously, we should delay expanding inly alphanumerical characters. Because non-alphanumberical characters influence graph structure. Trying to implement... -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Sat, Dec 1, 2012 at 3:22 PM, Erik Rijkers e...@xs4all.nl wrote: On Fri, November 30, 2012 12:22, Alexander Korotkov wrote: Hi! On Thu, Nov 29, 2012 at 12:58 PM, er e...@xs4all.nl wrote: On Mon, November 26, 2012 20:49, Alexander Korotkov wrote: I ran the simple-minded tests against generated data (similar to the ones I did in January 2012). The problems of that older version seem pretty much all removed. (although I didn't do much work on it -- just reran these tests). Thanks a lot for testing! Could you repeat for 0.7 version of patch which has new overflow handling? I've attached a similar test re-run that compares HEAD with patch versions 0.6, and 0.7. Thanks! Did you write scripts for automated testing? I would be nice if you share them. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
Alexander Korotkov aekorot...@gmail.com writes: Nice idea to delay expanding colors to characters! Obviously, we should delay expanding inly alphanumerical characters. Because non-alphanumberical characters influence graph structure. Trying to implement... Uh, why would that be? Colors are colors. The regexp machinery doesn't care whether they represent alphanumerics or not. (Or to be more precise, if there is a situation where it makes a difference, separate colors will have been created for each set of characters that need to be distinguished.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Sun, December 2, 2012 19:07, Alexander Korotkov wrote: I've attached a similar test re-run that compares HEAD with patch versions 0.6, and 0.7. Thanks! Did you write scripts for automated testing? I would be nice if you share them. Sure, here they are. The perl program does depend a bit on my particular setup (it reads the port from the postgresql.conf of each instance, and the script knows the data_dir locations), but I suppose it's easy enough to remove that. (Just hardcode the %instances hash, instead of calling instances().) If you need help to get them to run I can 'generalise' them, but for now I'll send them as they are. Erik Rijkers create_data.sh Description: application/shellscript trgm_regex_test.pl Description: Perl program -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Fri, November 30, 2012 12:22, Alexander Korotkov wrote: Hi! On Thu, Nov 29, 2012 at 12:58 PM, er e...@xs4all.nl wrote: On Mon, November 26, 2012 20:49, Alexander Korotkov wrote: I ran the simple-minded tests against generated data (similar to the ones I did in January 2012). The problems of that older version seem pretty much all removed. (although I didn't do much work on it -- just reran these tests). Thanks a lot for testing! Could you repeat for 0.7 version of patch which has new overflow handling? I've attached a similar test re-run that compares HEAD with patch versions 0.6, and 0.7. Erik Rijkers trgm_compare.txt.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: One thing that bothers me with this algoritm is that the overflow mechanism is all-or-nothing. In many cases, even when there is a huge number of states in the diagram, you could still extract at least a few trigrams that must be present in any matching string, with little effort. At least, it seems like that to a human :-). For example, consider this: explain analyze select count(*) from azjunk4 where txt ~ ('^** aabaacaadaaeaafaagaahaaiaajaak**aalaamaanaaoaapaaqaaraasaataau** aavaawaaxaayaazabaabbabcabdabe**abfabgabhabiabjabkablabmabnabo** abpabqabrabsabtabuabvabwabxaby**abzacaacbaccacdaceacfacgachaci** acjackaclacmacnacoacpacqacracs**actacuacvacwacxacyaczadaadbadc** addadeadfadgadhadiadjadkadladm**adnadoadpadqadradsadtaduadvadw** adxadyadzaeaaebaecaedaeeaefaeg**aehaeiaejaekaelaemaenaeoaepaeq** aeraesaetaeuaevaewaexaeyaezafa**afbafcafdafeaffafgafhafiafjafk** aflafmafnafoafpafqafrafsaftafu**afvafwafxafyafzagaagbagcagdage** agfaggaghagiagjagkaglagmagnago**agpagqagragsagtaguagvagwagxagy** agzahaahbahcahdaheahfahgahhahi**ahjahkahlahmahnahoahpahqahrahs**$'); you get a query plan like this (the long regexp string edited out): Aggregate (cost=228148.02..228148.03 rows=1 width=0) (actual time=131.100..131 .101 rows=1 loops=1) - Bitmap Heap Scan on azjunk4 (cost=228144.01..228148.02 rows=1 width=0) ( actual time=131.096..131.096 rows=0 loops=1) Recheck Cond: (txt ~ ridiculously long regexp) Rows Removed by Index Recheck: 1 - Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx (cost=0.00..228144 .01 rows=1 width=0) (actual time=82.914..82.914 rows=1 loops=1) Index Cond: (txt ~ ridiculously long regexp) Total runtime: 131.230 ms (7 rows) That ridiculously long string exceeds the number of states (I think, could be number of paths or arcs too), and the algorithm gives up, resorting to scanning the whole index as can be seen by the Rows Removed by Index Recheck line. However, it's easy to see that any matching string must contain *any* of the possible trigrams the algorithm extracts. If it could safely return just a few of them, say aab and abz, and discard the rest, that would already be much better than a full index scan. Would it be safe to simply stop short the depth-first search on overflow, and proceed with the graph that was constructed up to that point? For depth-first it's not. But your proposal naturally makes sense. I've changed it to breadth-first search. And then it's safe to mark all unprocessed states as final when overflow. It means that we assume every unprocessed branch to immediately finish with matching (this can give us more false positives but no false negatives). For overflow of matrix collection, it's safe to do just OR between all the trigrams. New version of patch is attached. -- With best regards, Alexander Korotkov. trgm-regexp-0.7.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
Hi! On Thu, Nov 29, 2012 at 12:58 PM, er e...@xs4all.nl wrote: On Mon, November 26, 2012 20:49, Alexander Korotkov wrote: trgm-regexp-0.6.patch.gz I ran the simple-minded tests against generated data (similar to the ones I did in January 2012). The problems of that older version seem pretty much all removed. (although I didn't do much work on it -- just reran these tests). Thanks a lot for testing! Could you repeat for 0.7 version of patch which has new overflow handling? -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Fri, Nov 30, 2012 at 3:20 PM, Alexander Korotkov aekorot...@gmail.comwrote: For depth-first it's not. Oh, I didn't explained it. In order to stop graph processing we need to be sure that we put all outgoing arcs from state or assume that state to be final. In DFS we can be in the final part of graph producing but still didn't add some arc (with new trigram) from initial state directly to the final state. It obviously leads to false negatives. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Mon, November 26, 2012 20:49, Alexander Korotkov wrote: trgm-regexp-0.6.patch.gz I ran the simple-minded tests against generated data (similar to the ones I did in January 2012). The problems of that older version seem pretty much all removed. (although I didn't do much work on it -- just reran these tests). I used two 2 instances, 'HEAD' and 'trgm_regex', which were both compiled with '--enable-depend' '--with-openssl' '--with-perl' '--with-libxml' Tables used: rowcount size tablesize index (trgm) azjunk4 10^4 rows 1,171,456 | 9,781,248 azjunk5 10^5 rows 11,706,368 | 65,093,632 azjunk6 10^6 rows117,030,912 | 726,310,912 azjunk7 10^7 rows 1,170,292,736 | 4,976,189,440 (See my previous emails for a generating script) Tables contain random generated text: table azjunk7 limit 5; txt -- i kzzhv ssaa zv x xlepzxsgbdkxev v wn dmvqkuwb qxkyvgab gpidaosaqbewqimmai jxj bvwn zbevtzyhibbn hoctxurutn pvlatjjyxf f runa owpltbcunrbq ux peoook rxwoscbytz bbjlbbhhkivjivklgbh tvapzogh rj ky ahvgkvvlfudotvqapznludohdoyqrp kvothyclbckbxu hvic gomewbp izsjifqggyqgzcghdat lb kud ltfqaxqxjjom qkw wqggikgvph yi sftmbjt edbjfl vtcasudjpgfgjaf swooxygsse flnqd pxzsdmesqhqbzgirswysote muakq agk p w uq (5 rows) with index on column 'txt': create index az7_idx on azjunk7 using gin (txt gin_trgm_ops); Queries were of the form: explain analyze select txt from azjunkXX where txt ~ '$REGEX'; The main problem with the January version was that it chose to use the trgm index even when it could take a long time (hours). This has been resolved as far as I can see, and the results are now very attractive. (There does seem to be a very slight regression on the seqscan, but it's so small that I'm not yet sure it's not noise) Hardware: AMD FX-8120 with Linux 2.6.32-279.14.1.el6.x86_64 x86_64 GNU/Linux PostgreSQL 9.3devel-trgm_regex-20121127_2353-e78d288c895bd296e3cb1ca29c7fe2431eef3fcd on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.7.2, 64-bit port instancetableregex rows method expl.analyze timing 6543 HEADazjunk4 x[ae]q 46 Seq Scan 12.962 ms 6554 trgm_regex azjunk4 x[ae]q 46 Bitmap Heap Scan 0.800 ms 6543 HEADazjunk4 x[ae]{1}q 46 Seq Scan 12.487 ms 6554 trgm_regex azjunk4 x[ae]{1}q 46 Bitmap Heap Scan 0.209 ms 6543 HEADazjunk4 x[ae]{1,1}q 46 Seq Scan 12.266 ms 6554 trgm_regex azjunk4 x[ae]{1,1}q 46 Bitmap Heap Scan 0.210 ms 6543 HEADazjunk4 x[ae]{,2}q 0 Seq Scan 14.322 ms 6554 trgm_regex azjunk4 x[ae]{,2}q 0 Bitmap Heap Scan 0.610 ms 6543 HEADazjunk4 x[ae]{,10}q 0 Seq Scan 20.503 ms 6554 trgm_regex azjunk4 x[ae]{,10}q 0 Bitmap Heap Scan 0.511 ms 6543 HEADazjunk4 x[ae]{1,2}q 49 Seq Scan 13.060 ms 6554 trgm_regex azjunk4 x[ae]{1,2}q 49 Bitmap Heap Scan 0.429 ms 6543 HEADazjunk4 x[aei]q 81 Seq Scan 12.487 ms 6554 trgm_regex azjunk4 x[aei]q 81 Bitmap Heap Scan 0.367 ms 6543 HEADazjunk4 x[aei]{1}q 81 Seq Scan 12.132 ms 6554 trgm_regex azjunk4 x[aei]{1}q 81 Bitmap Heap Scan 0.336 ms 6543 HEADazjunk4 x[aei]{1,1}q81 Seq Scan 12.168 ms 6554 trgm_regex azjunk4 x[aei]{1,1}q81 Bitmap Heap Scan 0.319 ms 6543 HEADazjunk4 x[aei]{,2}q 0 Seq Scan 14.586 ms 6554 trgm_regex azjunk4 x[aei]{,2}q 0 Bitmap Heap Scan 0.621 ms 6543 HEADazjunk4 x[aei]{,10}q 0 Seq Scan 20.134 ms 6554 trgm_regex azjunk4 x[aei]{,10}q 0 Bitmap Heap Scan 0.552 ms 6543 HEADazjunk4 x[aei]{1,2}q89 Seq Scan 12.553 ms 6554 trgm_regex azjunk4 x[aei]{1,2}q89 Bitmap Heap Scan 0.916 ms 6543 HEADazjunk4 x[aei]{1,3}q89 Seq Scan 13.055 ms 6554 trgm_regex azjunk4 x[aei]{1,3}q89 Seq Scan 13.064 ms 6543 HEADazjunk4 x[aei]q 81 Seq Scan 11.856 ms 6554 trgm_regex azjunk4 x[aei]q 81 Bitmap Heap Scan 0.398 ms 6543 HEADazjunk4 x[aei]{1}q 81 Seq Scan 11.951 ms 6554 trgm_regex azjunk4 x[aei]{1}q 81 Bitmap Heap Scan 0.369 ms 6543 HEADazjunk4 x[aei]{1,1}q81 Seq Scan 12.750 ms 6554 trgm_regex azjunk4
Re: [HACKERS] WIP: index support for regexp search
One thing that bothers me with this algoritm is that the overflow mechanism is all-or-nothing. In many cases, even when there is a huge number of states in the diagram, you could still extract at least a few trigrams that must be present in any matching string, with little effort. At least, it seems like that to a human :-). For example, consider this: explain analyze select count(*) from azjunk4 where txt ~ ('^aabaacaadaaeaafaagaahaaiaajaakaalaamaanaaoaapaaqaaraasaataauaavaawaaxaayaazabaabbabcabdabeabfabgabhabiabjabkablabmabnaboabpabqabrabsabtabuabvabwabxabyabzacaacbaccacdaceacfacgachaciacjackaclacmacnacoacpacqacracsactacuacvacwacxacyaczadaadbadcaddadeadfadgadhadiadjadkadladmadnadoadpadqadradsadtaduadvadwadxadyadzaeaaebaecaedaeeaefaegaehaeiaejaekaelaemaenaeoaepaeqaeraesaetaeuaevaewaexaeyaezafaafbafcafdafeaffafgafhafiafjafkaflafmafnafoafpafqafrafsaftafuafvafwafxafyafzagaagbagcagdageagfaggaghagiagjagkaglagmagnagoagpagqagragsagtaguagvagwagxagyagzahaahbahcahdaheahfahgahhahiahjahkahlahmahnahoahpahqahrahs$'); you get a query plan like this (the long regexp string edited out): Aggregate (cost=228148.02..228148.03 rows=1 width=0) (actual time=131.100..131 .101 rows=1 loops=1) - Bitmap Heap Scan on azjunk4 (cost=228144.01..228148.02 rows=1 width=0) ( actual time=131.096..131.096 rows=0 loops=1) Recheck Cond: (txt ~ ridiculously long regexp) Rows Removed by Index Recheck: 1 - Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx (cost=0.00..228144 .01 rows=1 width=0) (actual time=82.914..82.914 rows=1 loops=1) Index Cond: (txt ~ ridiculously long regexp) Total runtime: 131.230 ms (7 rows) That ridiculously long string exceeds the number of states (I think, could be number of paths or arcs too), and the algorithm gives up, resorting to scanning the whole index as can be seen by the Rows Removed by Index Recheck line. However, it's easy to see that any matching string must contain *any* of the possible trigrams the algorithm extracts. If it could safely return just a few of them, say aab and abz, and discard the rest, that would already be much better than a full index scan. Would it be safe to simply stop short the depth-first search on overflow, and proceed with the graph that was constructed up to that point? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On 25.11.2012 22:55, Alexander Korotkov wrote: On Tue, Nov 20, 2012 at 1:43 PM, Heikki Linnakangashlinnakan...@vmware.com wrote: Glad to see this patch hasn't been totally forgotten. Being able to use indexes for regular expressions would be really cool! Back in January, I asked for some high-level description of how the algorithm works (http://archives.postgresql.** org/message-id/4F187D5C.30701@**enterprisedb.comhttp://archives.postgresql.org/message-id/4f187d5c.30...@enterprisedb.com). That's still sorely needed. Googling around, I found the slides for your presentation on this from PGConf.EU - it would be great to have the information from that presentation included in the patch. New version of patch is attached. The changes are following: 1) A big comment with high-level description of what is going on. 2) Regression tests. 3) Documetation update. 4) Some more refactoring. Great, that top-level comment helped tremendously! I feel enlightened. I fixed some spelling, formatting etc. trivial stuff while reading through the patch, see attached. Below is some feedback on the details: * I don't like the PG_TRY/CATCH trick. It's not generally safe to catch an error, without propagating it further or rolling back the whole (sub)transation. It might work in this case, as you're only suppressing errors with the special sqlcode that are used in the same file, but it nevertheless feels naughty. I believe none of the limits that are being checked are strict; it's OK to exceed the limits somewhat, as long as you terminate the processing in a reasonable time, in case of pathological input. I'd suggest putting an explicit check for the limits somewhere, and not rely on ereport(). Something like this, in the code that recurses: if (trgmCNFA-arcsCount MAX_RESULT_ARCS || hash_get_num_entries(trgmCNFA-states) MAX_RESULT_STATES) { trgmCNFA-overflowed = true; return; } And then check for the overflowed flag at the top level. * This part of the high-level comment was not clear to me: * States of the graph produced in the first stage are marked with keys. Key is a pair * of a prefix and a state of the original automaton. Prefix is a last * characters. So, knowing the prefix is enough to know what is a trigram when we read some new * character. However, we can know single character of prefix or don't know any * characters of it. Each state of resulting graph have an enter key (with that * key we've entered this state) and a set of keys which are reachable without * reading any predictable trigram. The algorithm of processing each state * of resulting graph are so: * 1) Add all keys which achievable without reading of any predictable trigram. * 2) Add outgoing arcs labeled with trigrams. * Step 2 leads to creation of new states and recursively processing them. So, * we use depth-first algorithm. I didn't understand that. Can you elaborate? It might help to work through an example, with some ascii art depicting the graph. * It would be nice to add some comments to TrgmCNFA struct, explaining which fields are valid at which stages. For example, it seems that 'trgms' array is calculated only after building the CNFA, by getTrgmVector() function, while arcsCount is updated on the fly, while recursing in the getState() function. * What is the representation used for the path matrix? Needs a comment. * What do the getColorinfo() and scanColorMap() functions do? What exactly does a color represent? What's the tradeoff in choosing MAX_COLOR_CHARS? - Heikki diff --git a/contrib/pg_trgm/Makefile b/contrib/pg_trgm/Makefile index 64fd69f..8033733 100644 --- a/contrib/pg_trgm/Makefile +++ b/contrib/pg_trgm/Makefile @@ -1,7 +1,7 @@ # contrib/pg_trgm/Makefile MODULE_big = pg_trgm -OBJS = trgm_op.o trgm_gist.o trgm_gin.o +OBJS = trgm_op.o trgm_gist.o trgm_gin.o trgm_regexp.o EXTENSION = pg_trgm DATA = pg_trgm--1.0.sql pg_trgm--unpackaged--1.0.sql diff --git a/contrib/pg_trgm/expected/pg_trgm.out b/contrib/pg_trgm/expected/pg_trgm.out index 81d0ca8..ee0131f 100644 --- a/contrib/pg_trgm/expected/pg_trgm.out +++ b/contrib/pg_trgm/expected/pg_trgm.out @@ -54,7 +54,7 @@ select similarity('wow',' WOW '); (1 row) CREATE TABLE test_trgm(t text); -\copy test_trgm from 'data/trgm.data +\copy test_trgm from 'data/trgm.data' select t,similarity(t,'qwertyu0988') as sml from test_trgm where t % 'qwertyu0988' order by sml desc, t; t | sml -+-- @@ -3515,6 +3515,47 @@ select * from test2 where t ilike 'qua%'; quark (1 row) +select * from test2 where t ~ '[abc]{3}'; + t + + abcdef +(1 row) + +select * from test2 where t ~ 'a[bc]+d'; + t + + abcdef +(1 row) + +select * from test2 where t ~* 'DEF'; + t + + abcdef +(1 row) + +select * from test2 where t ~ 'dEf'; + t +--- +(0 rows) + +select * from test2 where t ~* '^q'; + t +--- + quark +(1 row) + +select * from test2 where
Re: [HACKERS] WIP: index support for regexp search
On Mon, Nov 26, 2012 at 4:55 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Great, that top-level comment helped tremendously! I feel enlightened. I fixed some spelling, formatting etc. trivial stuff while reading through the patch, see attached. Below is some feedback on the details: * I don't like the PG_TRY/CATCH trick. It's not generally safe to catch an error, without propagating it further or rolling back the whole (sub)transation. It might work in this case, as you're only suppressing errors with the special sqlcode that are used in the same file, but it nevertheless feels naughty. I believe none of the limits that are being checked are strict; it's OK to exceed the limits somewhat, as long as you terminate the processing in a reasonable time, in case of pathological input. I'd suggest putting an explicit check for the limits somewhere, and not rely on ereport(). Something like this, in the code that recurses: if (trgmCNFA-arcsCount MAX_RESULT_ARCS || hash_get_num_entries(trgmCNFA-**states) MAX_RESULT_STATES) { trgmCNFA-overflowed = true; return; } PG_TRY/CATCH trick is replaced with some number of if/return. Code becomes a bit more complicated, but your notes does matter. And then check for the overflowed flag at the top level. * This part of the high-level comment was not clear to me: * States of the graph produced in the first stage are marked with keys. Key is a pair * of a prefix and a state of the original automaton. Prefix is a last * characters. So, knowing the prefix is enough to know what is a trigram when we read some new * character. However, we can know single character of prefix or don't know any * characters of it. Each state of resulting graph have an enter key (with that * key we've entered this state) and a set of keys which are reachable without * reading any predictable trigram. The algorithm of processing each state * of resulting graph are so: * 1) Add all keys which achievable without reading of any predictable trigram. * 2) Add outgoing arcs labeled with trigrams. * Step 2 leads to creation of new states and recursively processing them. So, * we use depth-first algorithm. I didn't understand that. Can you elaborate? It might help to work through an example, with some ascii art depicting the graph. This comment is extended with example. * It would be nice to add some comments to TrgmCNFA struct, explaining which fields are valid at which stages. For example, it seems that 'trgms' array is calculated only after building the CNFA, by getTrgmVector() function, while arcsCount is updated on the fly, while recursing in the getState() function. TrgmCNFA comment is extended with this. * What is the representation used for the path matrix? Needs a comment. Comments are added to PackedTrgmPaths and TrgmStatePath. * What do the getColorinfo() getColorInfo comment now references to ColorInfo struct which have comment. and scanColorMap() functions do? scanColorMap comment now have description of colormap structure. What exactly does a color represent? This is added to the top comment. What's the tradeoff in choosing MAX_COLOR_CHARS? Comment is added to the macro. -- With best regards, Alexander Korotkov. trgm-regexp-0.6.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Tue, Nov 20, 2012 at 1:43 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Glad to see this patch hasn't been totally forgotten. Being able to use indexes for regular expressions would be really cool! Back in January, I asked for some high-level description of how the algorithm works (http://archives.postgresql.** org/message-id/4F187D5C.30701@**enterprisedb.comhttp://archives.postgresql.org/message-id/4f187d5c.30...@enterprisedb.com). That's still sorely needed. Googling around, I found the slides for your presentation on this from PGConf.EU - it would be great to have the information from that presentation included in the patch. New version of patch is attached. The changes are following: 1) A big comment with high-level description of what is going on. 2) Regression tests. 3) Documetation update. 4) Some more refactoring. -- With best regards, Alexander Korotkov. trgm-regexp-0.5.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
Hi! On Wed, Nov 21, 2012 at 12:51 AM, Pavel Stehule pavel.steh...@gmail.comwrote: do you plan to support GiST? At first, I would note that pg_trgm GiST opclass is quite ridiculous for support regex search (and, actually for LIKE/ILIKE search which is already implemented too). Because in GiST opclass we store set of trigrams in leaf pages. In was designed for trigram similarity search and have sense for it because of elimination of trigram set computation. But for regex or LIKE/ILIKE search this representation is both lossy and bigger than just original string. Probably we could think about another opclass for GiST focusing on regex and LIKE/ILIKE search? However, amyway I can create additional patch for current GiST opclass. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
Glad to see this patch hasn't been totally forgotten. Being able to use indexes for regular expressions would be really cool! Back in January, I asked for some high-level description of how the algorithm works (http://archives.postgresql.org/message-id/4f187d5c.30...@enterprisedb.com). That's still sorely needed. Googling around, I found the slides for your presentation on this from PGConf.EU - it would be great to have the information from that presentation included in the patch. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra t...@fuzzy.cz wrote: 2) It's common to use upper-case names for macros, but trgm.h defines macro iswordchr - I see it's moved from trgm_op.c but maybe we could make it a bit more correct? 3) I see there are two '#ifdef KEEPONLYALNUM blocks right next to each other in trgm.h - maybe it'd be a good idea to join them? 4) The two new method prototypes at the end of trgm.h use different indendation than the rest (spaces only instead of tabs). These issues are fixed in attached patch. Additionally 3 new macros are introduced: MAX_RESULT_STATES, MAX_RESULT_ARCS, MAX_RESULT_PATHS. They are limiting resources usage during regex processing. -- With best regards, Alexander Korotkov. trgm-regexp-0.4.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
hello do you plan to support GiST? Regards Pavel 2012/11/20 Alexander Korotkov aekorot...@gmail.com: On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra t...@fuzzy.cz wrote: 2) It's common to use upper-case names for macros, but trgm.h defines macro iswordchr - I see it's moved from trgm_op.c but maybe we could make it a bit more correct? 3) I see there are two '#ifdef KEEPONLYALNUM blocks right next to each other in trgm.h - maybe it'd be a good idea to join them? 4) The two new method prototypes at the end of trgm.h use different indendation than the rest (spaces only instead of tabs). These issues are fixed in attached patch. Additionally 3 new macros are introduced: MAX_RESULT_STATES, MAX_RESULT_ARCS, MAX_RESULT_PATHS. They are limiting resources usage during regex processing. -- With best regards, Alexander Korotkov. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
Hi! New version of patch is attached. Changes are following: 1) Right way to convert from pg_wchar to multibyte. 2) Optimization of producing CFNA-like graph on trigrams (produce smaller, but equivalent, graphs in less time). 3) Comments and refactoring. -- With best regards, Alexander Korotkov. trgm-regexp-0.2.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Mon, November 19, 2012 22:58, Alexander Korotkov wrote: New version of patch is attached. Hi Alexander, I get some compile-errors: (Centos 6.3, Linux 2.6.32-279.14.1.el6.x86_64 GNU/Linux, gcc (GCC) 4.7.2) make contrib trgm_regexp.c:73:2: error: unknown type name TrgmStateKey make[1]: *** [trgm_regexp.o] Error 1 make: *** [all-pg_trgm-recurse] Error 2 trgm_regexp.c:73:2: error: unknown type name TrgmStateKey make[1]: *** [trgm_regexp.o] Error 1 make: *** [install-pg_trgm-recurse] Error 2 Did I forget something? Thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On 19.11.2012 22:58, Alexander Korotkov wrote: Hi! New version of patch is attached. Changes are following: 1) Right way to convert from pg_wchar to multibyte. 2) Optimization of producing CFNA-like graph on trigrams (produce smaller, but equivalent, graphs in less time). 3) Comments and refactoring. Hi, thanks for the updated message-id. I've done the initial review: 1) Patch applies fine to the master. 2) It's common to use upper-case names for macros, but trgm.h defines macro iswordchr - I see it's moved from trgm_op.c but maybe we could make it a bit more correct? 3) I see there are two '#ifdef KEEPONLYALNUM blocks right next to each other in trgm.h - maybe it'd be a good idea to join them? 4) The two new method prototypes at the end of trgm.h use different indendation than the rest (spaces only instead of tabs). 5) There are no regression tests / updated docs (yet). 6) It does not compile - I do get a bunch of errors like this trgm_regexp.c:73:2: error: expected specifier-qualifier-list before ‘TrgmStateKey’ trgm_regexp.c: In function ‘addKeys’: trgm_regexp.c:291:24: error: ‘TrgmState’ has no member named ‘keys’ trgm_regexp.c:304:10: error: ‘TrgmState’ has no member named ‘keys’ ... It seems this is cause by the order of typedefs in trgm_regexp.c, namely TrgmState referencing TrgmStateKey before it's defined. Moving the TrgmStateKey before TrgmState fixed the issue (I'm using gcc-4.5.4 but I think it's not compiler-dependent.) 7) Once fixed, it seems to work CREATE EXTENSION pg_trgm ; CREATE TABLE TEST (val TEXT); INSERT INTO test SELECT md5(i::text) FROM generate_series(1,100) s(i); CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops); ANALYZE test; EXPLAIN SELECT * FROM test WHERE val ~ '.*qqq.*'; QUERY PLAN - Bitmap Heap Scan on test (cost=16.77..385.16 rows=100 width=33) Recheck Cond: (val ~ '.*qqq.*'::text) - Bitmap Index Scan on trgm_idx (cost=0.00..16.75 rows=100 width=0) Index Cond: (val ~ '.*qqq.*'::text) (4 rows) but I do get a bunch of NOTICE messages with debugging info (no matter if the GIN index is used or not, so it's somewhere in the common regexp code). But I guess that's due to WIP status. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
Some quick comments. On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra t...@fuzzy.cz wrote: 6) It does not compile - I do get a bunch of errors like this Fixed. 7) Once fixed, it seems to work CREATE EXTENSION pg_trgm ; CREATE TABLE TEST (val TEXT); INSERT INTO test SELECT md5(i::text) FROM generate_series(1,100) s(i); CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops); ANALYZE test; EXPLAIN SELECT * FROM test WHERE val ~ '.*qqq.*'; QUERY PLAN - Bitmap Heap Scan on test (cost=16.77..385.16 rows=100 width=33) Recheck Cond: (val ~ '.*qqq.*'::text) - Bitmap Index Scan on trgm_idx (cost=0.00..16.75 rows=100 width=0) Index Cond: (val ~ '.*qqq.*'::text) (4 rows) but I do get a bunch of NOTICE messages with debugging info (no matter if the GIN index is used or not, so it's somewhere in the common regexp code). But I guess that's due to WIP status. It's due to TRGM_REGEXP_DEBUG macro. I disabled it by default. But I think pieces of code hidden by that macro could be useful for debug even after WIP status. -- With best regards, Alexander Korotkov. trgm-regexp-0.3.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Fri, Jan 20, 2012 at 01:33, Erik Rijkers e...@xs4all.nl wrote: Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I suppose this is normally necessary for perfomance reasons, but it would be useful te be able to compile a test version that allows it. I believe being interruptible is a requirement for the patch to be accepted. CHECK_FOR_INTERRUPTS(); should be added to the indeterminate loops. Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
Hi! Thank you for your feedback! On Fri, Jan 20, 2012 at 3:33 AM, Erik Rijkers e...@xs4all.nl wrote: The patch yields spectacular speedups with small, simple-enough regexen. But it does not do a good enough job when guessing where to use the index and where fall back to Seq Scan. This can lead to (also spectacular) slow-downs, compared to Seq Scan. Could you give some examples of regexes where index scan becomes slower than seq scan? I guessed that MAX_COLOR_CHARS limits the character class size (to 4, in your patch), is that true? I can understand you want that value to be low to limit the above risk, but now it reduces the usability of the feature a bit: one has to split up larger char-classes into several smaller ones to make a statement use the index: i.e.: Yes, MAX_COLOR_CHARS is number of maximum character in automata color when that color is divided to a separated characters. And it's likely there could be better solution than just have this hard limit. Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I suppose this is normally necessary for perfomance reasons, but it would be useful te be able to compile a test version that allows it. I don't know how hard that would be. I seems that Ctrl-C was impossible because procedure of trigrams exctraction becomes so long while it is not breakable. It's not difficult to make this procedure breakable, but actually it just shouldn't take so long. There is also a minor bug, I think, when running with 'set enable_seqscan=off' in combination with a too-large regex: Thanks for pointing. Will be fixed. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Fri, Jan 20, 2012 at 8:45 PM, Marti Raudsepp ma...@juffo.org wrote: On Fri, Jan 20, 2012 at 01:33, Erik Rijkers e...@xs4all.nl wrote: Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I suppose this is normally necessary for perfomance reasons, but it would be useful te be able to compile a test version that allows it. I believe being interruptible is a requirement for the patch to be accepted. CHECK_FOR_INTERRUPTS(); should be added to the indeterminate loops. Sure. It's easy to fix. But it seems that in this case gin extract_query method becomes slow (because index scan itself is breakable). So, it just shouldn't work so long. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Fri, Jan 20, 2012 at 12:54 AM, Alexander Korotkov aekorot...@gmail.comwrote: On Fri, Jan 20, 2012 at 12:30 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Apart from that, the multibyte issue seems like the big one. Any way around that? Conversion of pg_wchar to multibyte character is the only way I found to avoid serious hacking of existing regexp code. Do you think additional function in pg_wchar_tbl which converts pg_wchar back to multibyte character is possible solution? Do you have any notes on it? I could make the patch which adds such function into core. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Sat, January 21, 2012 06:26, Alexander Korotkov wrote: Hi! Thank you for your feedback! On Fri, Jan 20, 2012 at 3:33 AM, Erik Rijkers e...@xs4all.nl wrote: The patch yields spectacular speedups with small, simple-enough regexen. But it does not do a good enough job when guessing where to use the index and where fall back to Seq Scan. This can lead to (also spectacular) slow-downs, compared to Seq Scan. Could you give some examples of regexes where index scan becomes slower than seq scan? x[aeio]+q takes many minutes, uninterruptible. It's now running for almost 30 minutes, I'm sure it will come back eventually, but I think I'll kill it later on; I suppose you get the point ;-) Even with {n,m} quantifiers it's easy to hit slowdowns: (table azjunk6 is 112 MB, the index 693 MB.) (MAX_COLOR_CHARS=4 = your original patch) instance tableregex plan time HEAD azjunk6 x[aeio]{1,3}q Seq Scan 3566.088 ms HEAD azjunk6 x[aeio]{1,3}q Seq Scan 3540.606 ms HEAD azjunk6 x[aeio]{1,3}q Seq Scan 3495.034 ms HEAD azjunk6 x[aeio]{1,3}q Seq Scan 3510.403 ms trgm_regex azjunk6 x[aeio]{1,3}q Bitmap Heap Scan 3724.131 ms trgm_regex azjunk6 x[aeio]{1,3}q Bitmap Heap Scan 3844.999 ms trgm_regex azjunk6 x[aeio]{1,3}q Bitmap Heap Scan 3835.190 ms trgm_regex azjunk6 x[aeio]{1,3}q Bitmap Heap Scan 3724.016 ms HEAD azjunk6 x[aeio]{1,4}q Seq Scan 3617.997 ms HEAD azjunk6 x[aeio]{1,4}q Seq Scan 3644.215 ms HEAD azjunk6 x[aeio]{1,4}q Seq Scan 3636.976 ms HEAD azjunk6 x[aeio]{1,4}q Seq Scan 3625.493 ms trgm_regex azjunk6 x[aeio]{1,4}q Bitmap Heap Scan 7885.247 ms trgm_regex azjunk6 x[aeio]{1,4}q Bitmap Heap Scan 8799.082 ms trgm_regex azjunk6 x[aeio]{1,4}q Bitmap Heap Scan 7754.152 ms trgm_regex azjunk6 x[aeio]{1,4}q Bitmap Heap Scan 7721.332 ms This is with your patch as is; in instances compiled with higher MAX_COLOR_CHARS (I did 6 and 9), it is of course even easier to dream up a slow regex... Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On 22.11.2011 21:38, Alexander Korotkov wrote: WIP patch with index support for regexp search for pg_trgm contrib is attached. In spite of techniques which extracts continuous text parts from regexp, this patch presents technique of automatum transformation. That allows more comprehensive trigrams extraction. Nice! Current version of patch have some limitations: 1) Algorithm of logical expression extraction on trigrams have high computational complexity. So, it can become really slow on regexp with many branches. Probably, improvements of this algorithm is possible. 2) Surely, no perfomance benefit if no trigrams can be extracted from regexp. It's inevitably. 3) Currently, only GIN index is supported. There are no serious problems, GiST code for it just not written yet. 4) It appear to be some kind of problem to extract multibyte encoded character from pg_wchar. I've posted question about it here: http://archives.postgresql.org/pgsql-hackers/2011-11/msg01222.php While I've hardcoded some dirty solution. So PG_EUC_JP, PG_EUC_CN, PG_EUC_KR, PG_EUC_TW, PG_EUC_JIS_2004 are not supported yet. This is pretty far from being in committable state, so I'm going to mark this as returned with feedback in the commitfest app. The feedback: The code badly needs comments. There is no explanation of how the trigram extraction code in trgm_regexp.c works. Guessing from the variable names, it seems to be some sort of a coloring algorithm that works on a graph, but that all needs to be explained. Can this algorithm be found somewhere in literature, perhaps? A link to a paper would be nice. Apart from that, the multibyte issue seems like the big one. Any way around that? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Fri, Jan 20, 2012 at 12:30 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: The code badly needs comments. There is no explanation of how the trigram extraction code in trgm_regexp.c works. Sure. I hoped to find a time for comments before commitfest starts. Unfortunately I didn't, sorry. Guessing from the variable names, it seems to be some sort of a coloring algorithm that works on a graph, but that all needs to be explained. Can this algorithm be found somewhere in literature, perhaps? A link to a paper would be nice. I hope it's truly novel. At least application to regular expressions. I'm going to write a paper about it. Apart from that, the multibyte issue seems like the big one. Any way around that? Conversion of pg_wchar to multibyte character is the only way I found to avoid serious hacking of existing regexp code. Do you think additional function in pg_wchar_tbl which converts pg_wchar back to multibyte character is possible solution? -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
I also have a question about pg_wchar. /* *--- * encoding info table * XXX must be sorted by the same order as enum pg_enc (in mb/pg_wchar.h) *--- */ pg_wchar_tbl pg_wchar_table[] = { {pg_ascii2wchar_with_len, pg_ascii_mblen, pg_ascii_dsplen, pg_ascii_verifier, 1}, /* PG_SQL_ASCII */ {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3}, /* PG_EUC_JP */ {pg_euccn2wchar_with_len, pg_euccn_mblen, pg_euccn_dsplen, pg_euccn_verifier, 2}, /* PG_EUC_CN */ {pg_euckr2wchar_with_len, pg_euckr_mblen, pg_euckr_dsplen, pg_euckr_verifier, 3}, /* PG_EUC_KR */ {pg_euctw2wchar_with_len, pg_euctw_mblen, pg_euctw_dsplen, pg_euctw_verifier, 4}, /* PG_EUC_TW */ {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3}, /* PG_EUC_JIS_2004 */ {pg_utf2wchar_with_len, pg_utf_mblen, pg_utf_dsplen, pg_utf8_verifier, 4}, /* PG_UTF8 */ {pg_mule2wchar_with_len, pg_mule_mblen, pg_mule_dsplen, pg_mule_verifier, 4}, /* PG_MULE_INTERNAL */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN1 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN2 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN3 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN4 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN5 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN6 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN7 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN8 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN9 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN10 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1256 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1258 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN866 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN874 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_KOI8R */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1251 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1252 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-5 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-6 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-7 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-8 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1250 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1253 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1254 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1255 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1257 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_KOI8U */ {0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2}, /* PG_SJIS */ {0, pg_big5_mblen, pg_big5_dsplen, pg_big5_verifier, 2}, /* PG_BIG5 */ {0, pg_gbk_mblen, pg_gbk_dsplen, pg_gbk_verifier, 2}, /* PG_GBK */ {0, pg_uhc_mblen, pg_uhc_dsplen, pg_uhc_verifier, 2}, /* PG_UHC */ {0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifier, 4}, /* PG_GB18030 */ {0, pg_johab_mblen, pg_johab_dsplen, pg_johab_verifier, 3}, /* PG_JOHAB */ {0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2} /* PG_SHIFT_JIS_2004 */ }; What does last 7 zeros in the first column means? No conversion to pg_wchar is possible from these encodings? -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Fri, Jan 20, 2012 at 1:07 AM, Alexander Korotkov aekorot...@gmail.comwrote: What does last 7 zeros in the first column means? No conversion to pg_wchar is possible from these encodings? Uh, I see. These encodings is not supported as server encodings. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Thu, January 19, 2012 21:30, Heikki Linnakangas wrote: On 22.11.2011 21:38, Alexander Korotkov wrote: WIP patch with index support for regexp search for pg_trgm contrib is attached. In spite of techniques which extracts continuous text parts from regexp, this patch presents technique of automatum transformation. That allows more comprehensive trigrams extraction. Nice! Yes, wonderful stuff; I tested quite a bit with this patch; FWIW, here's what I found. The patch yields spectacular speedups with small, simple-enough regexen. But it does not do a good enough job when guessing where to use the index and where fall back to Seq Scan. This can lead to (also spectacular) slow-downs, compared to Seq Scan. I used the following to generate 3 test tables with lines of 80 random chars (just in case it's handy for others to use): $ cat create_data.sh #!/bin/sh for power in 4 5 6 do table=azjunk${power} index=${table}_trgmrgx_txt_01_idx echo -- generating table $table with index $index; time perl -E' sub ss{ join,@_[ map{rand @_} 1 .. shift ] }; say(ss(80,a..g, ,h..m, ,n..s, ,t..z)) for 1 .. 1e'${power}; \ | psql -aqXc drop table if exists $table; create table $table(txt text); copy $table from stdin; -- set session maintenance_work_mem='20GB'; create index $index on $table using gin(txt gin_trgm_ops); analyze $table;; done \dt+ public.azjunk* List of relations Schema | Name | Type | Owner | Size | Description +-+---+---+-+- public | azjunk4 | table | breinbaas | 1152 kB | public | azjunk5 | table | breinbaas | 11 MB | public | azjunk6 | table | breinbaas | 112 MB | (3 rows) I guessed that MAX_COLOR_CHARS limits the character class size (to 4, in your patch), is that true? I can understand you want that value to be low to limit the above risk, but now it reduces the usability of the feature a bit: one has to split up larger char-classes into several smaller ones to make a statement use the index: i.e.: txt ~ 'f[aeio]n' OR txt ~ 'f[uy]n' instead of txt ~ 'f[aeiouy]n' I made compiled instances with larger values for MAX_COLOR_CHARS (6 and 9), and sure enough they used the index for larger classes such as the above, but of course also got into problems easier when quantifiers are added (*, +, {n,m}). A better calculation to decide index-use seems necessary, and ideally one that allows for a larger MAX_COLOR_CHARS than 4. Especially quantifiers could perhaps be inspected wrt that decision. IMHO, the functionality would still be very useful when only very simple regexen were considered. Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I suppose this is normally necessary for perfomance reasons, but it would be useful te be able to compile a test version that allows it. I don't know how hard that would be. There is also a minor bug, I think, when running with 'set enable_seqscan=off' in combination with a too-large regex: $ cat fail.sql: set enable_bitmapscan=on; set enable_seqscan=on; explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n'; -- OK explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n'; -- OK set enable_bitmapscan=on; set enable_seqscan=off; explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n'; -- OK explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n'; -- crashes $ psql -f fail.sql QUERY PLAN -- Bitmap Heap Scan on azjunk4 (cost=52.01..56.02 rows=1 width=81) (actual time=1.011..5.291 rows=131 loops=1) Recheck Cond: (txt ~ 'f[aeio]n'::text) - Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx (cost=0.00..52.01 rows=1 width=0) (actual time=0.880..0.880 rows=131 loops=1) Index Cond: (txt ~ 'f[aeio]n'::text) Total runtime: 5.700 ms (5 rows) QUERY PLAN --- Seq Scan on azjunk4 (cost=0.00..268.00 rows=1 width=81) (actual time=1.491..36.049 rows=164 loops=1) Filter: (txt ~ 'f[aeiou]n'::text) Rows Removed by Filter: 9836 Total runtime: 36.112 ms (4 rows) QUERY PLAN -- Bitmap Heap Scan on azjunk4 (cost=52.01..56.02 rows=1 width=81) (actual time=0.346..0.927 rows=131 loops=1) Recheck Cond: (txt ~ 'f[aeio]n'::text) - Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx (cost=0.00..52.01 rows=1 width=0) (actual time=0.316..0.316 rows=131 loops=1) Index Cond: (txt ~ 'f[aeio]n'::text)
Re: [HACKERS] WIP: index support for regexp search
On Thu, Dec 1, 2011 at 12:29 AM, Robert Haas robertmh...@gmail.com wrote: Please add this patch here so it does not get lost in the shuffle: https://commitfest.postgresql.org/action/commitfest_view/open Done. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Tue, Nov 22, 2011 at 2:38 PM, Alexander Korotkov aekorot...@gmail.com wrote: WIP patch with index support for regexp search for pg_trgm contrib is attached. In spite of techniques which extracts continuous text parts from regexp, this patch presents technique of automatum transformation. That allows more comprehensive trigrams extraction. Please add this patch here so it does not get lost in the shuffle: https://commitfest.postgresql.org/action/commitfest_view/open -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers