subject:"Re\: \[HACKERS\] WIP\: index support for regexp search"

Re: [HACKERS] WIP: index support for regexp search

2013-04-15 Thread Alexander Korotkov

I found you committed GiST index implementation. That's cool.
I found an easy way to optimize it. We can also use trigramsMatchGraph for
signatures. Attached patch contains implementation.
Simple example in order to demonstrate it:

Before the patch:

test=# explain (analyze, buffers) select * from words where s ~ '[abc]def';
QUERY PLAN


--
 Bitmap Heap Scan on words  (cost=4.36..40.24 rows=10 width=9) (actual
time=17.189..17.193 rows=3 loops=1)
   Recheck Cond: (s ~ '[abc]def'::text)
   Buffers: shared hit=858
   -  Bitmap Index Scan on words_trgm_idx  (cost=0.00..4.36 rows=10
width=0) (actual time=17.172..17.172 rows=3 loops=1)
 Index Cond: (s ~ '[abc]def'::text)
 Buffers: shared hit=*857*
 Total runtime: 17.224 ms
(7 rows)

After the patch:

test=# explain (analyze, buffers) select * from words where s ~ '[abc]def';
QUERY PLAN


--
 Bitmap Heap Scan on words  (cost=4.36..40.24 rows=10 width=9) (actual
time=13.718..13.721 rows=3 loops=1)
   Recheck Cond: (s ~ '[abc]def'::text)
   Buffers: shared hit=498
   -  Bitmap Index Scan on words_trgm_idx  (cost=0.00..4.36 rows=10
width=0) (actual time=13.701..13.701 rows=3 loops=1)
 Index Cond: (s ~ '[abc]def'::text)
 Buffers: shared hit=*497*
 Total runtime: 13.786 ms
(7 rows)


--
With best regards,
Alexander Korotkov.


trgm-regexp-gist-optimize.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2013-04-15 Thread David Fetter

On Mon, Apr 15, 2013 at 05:53:41PM +0400, Alexander Korotkov wrote:
 I found you committed GiST index implementation. That's cool.
 I found an easy way to optimize it. We can also use trigramsMatchGraph for
 signatures. Attached patch contains implementation.
 Simple example in order to demonstrate it:
 
 Before the patch:
  Buffers: shared hit=*857*
 After the patch:
  Buffers: shared hit=*497*

Neato!

Inside the patch, s/monotonous/monotone/, I think.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2013-04-15 Thread Tom Lane

Alexander Korotkov aekorot...@gmail.com writes:
 I found you committed GiST index implementation. That's cool.
 I found an easy way to optimize it. We can also use trigramsMatchGraph for
 signatures. Attached patch contains implementation.

Good idea, committed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2013-04-09 Thread Alexander Korotkov

On Tue, Apr 9, 2013 at 9:15 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov aekorot...@gmail.com writes:
  On Mon, Apr 8, 2013 at 9:28 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  I spent the weekend hacking on this, making a number of bug fixes and a
  whole lot of cosmetic changes.  I think there are large parts of this
  that are in committable shape now, but I still find the actual graph
  transformation logic to be mostly unintelligible.  I think what's most
  obscure is the distinction between the arcs list and the keys list of
  each state in the expanded graph.  I get the impression that the
  general idea is for the arcs to represent exactly-known transitions
  while the keys represent imprecisely-known transitions ... but there
  seems to be at least some leakage between those categories.  Could
  you write down a specification for what's supposed to be happening
  there?

  Here is my try to specify it.

 Thanks.  I hacked on this some more and committed it.  I found a number
 of bugs along the way with respect to handling of word boundaries
 (partially-blank transition trigrams) and EOL-color ($) handling.
 I think it's all fixed now but it could definitely use some more
 study and testing.


Great, thanks! I also will do some more testing.

One issue that bothered me is that the regression tests really don't
 provide much visibility into what the code is doing.  Some of the bugs
 had to do with failing to generate expected trigrams, for instance
 col ~ 'foo bar' only generating trigram foo and not bar.  This still
 led to getting the right answer, so the error was invisible as far as the
 tests were concerned.  Is it worth thinking of a way to expose what the
 extract function did at SQL level, so we could test more carefully?


Yes, I also had similar idea. But, I think we need some relatively stable
representation of resulting graph in order to expose it. There could be a
lot of equivalent graphs. Some changes in implementation could lead to
change from one equivalent graph to another. It would be better to not
rewrite tests in this case. Ideally, we should expose some representation
which is the same for all equivalent graphs. However, it doesn't seem to be
realistic. But, I think we could at least make it stable to order sequence
of states and color trigrams. Another option I see is to expose just set of
trigrams. It doesn't have completeness of information, but it is quite
stable.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2013-04-08 Thread Alexander Korotkov

On Mon, Apr 8, 2013 at 9:28 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov aekorot...@gmail.com writes:
  [ trgm-regexp-0.15.patch.gz ]

 I spent the weekend hacking on this, making a number of bug fixes and a
 whole lot of cosmetic changes.  I think there are large parts of this
 that are in committable shape now, but I still find the actual graph
 transformation logic to be mostly unintelligible.  I think what's most
 obscure is the distinction between the arcs list and the keys list of
 each state in the expanded graph.  I get the impression that the
 general idea is for the arcs to represent exactly-known transitions
 while the keys represent imprecisely-known transitions ... but there
 seems to be at least some leakage between those categories.  Could
 you write down a specification for what's supposed to be happening
 there?


Here is my try to specify it.
At first some notions. I know, they are already in the comments, but just
in order to put it together.

Extended color - any color of source CNFA or one of two special values:
1) Unknown color - may represent any character either alphanumeric or
non-alphanumeric.
2) Blank color - may represent any non-alphanumeric character
Prefix is extended colors of last two characters read by CNFA.
Key is pair of CNFA state and prefix. So, key is a extended state which is
containing additional information which can influence further trigrams.

So, if you are in some key and traverse some CNFA arc then you moves info
another key. But there are two possible cases (or, sometimes, both of them):
1) Your move from one key into another necessary means read of some
trigram. Then you create new arc labeled with that trigram.
2) You move into another key, but you doesn't necessary read an useful
trigram. For example, arc of source CNFA is labeled by unextractable
color. Then you add new key into keys array. And outgoing arcs from this
key will also be processed similarly to source key. Therefore keys array
is a set of keys which are achievable from stateKey without reading of
useful trigram.
We could get rid of keys array and produce some empty arcs in the
second case. You can imagine that keys array is set of keys which are
achivable by empty arcs from stateKey. States connected with empty
arcs could be merged on the next stage.
However, the reason of having separated addKeys stage is optimization. In
addKey function more generals keys absorb less general ones. In many cases
resulting graph becomes much simplier because of this. For example, if your
regex is not prefix, then engine puts self-referencing arcs of all possible
colors to initial state. Straight-forward processing of this could produce
enormous output graph. I had similar situation in early version of patch
where keys didn't absorb earch other.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2013-04-08 Thread Tom Lane

Alexander Korotkov aekorot...@gmail.com writes:
 On Mon, Apr 8, 2013 at 9:28 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I spent the weekend hacking on this, making a number of bug fixes and a
 whole lot of cosmetic changes.  I think there are large parts of this
 that are in committable shape now, but I still find the actual graph
 transformation logic to be mostly unintelligible.  I think what's most
 obscure is the distinction between the arcs list and the keys list of
 each state in the expanded graph.  I get the impression that the
 general idea is for the arcs to represent exactly-known transitions
 while the keys represent imprecisely-known transitions ... but there
 seems to be at least some leakage between those categories.  Could
 you write down a specification for what's supposed to be happening
 there?

 Here is my try to specify it.

Thanks.  I hacked on this some more and committed it.  I found a number
of bugs along the way with respect to handling of word boundaries
(partially-blank transition trigrams) and EOL-color ($) handling.
I think it's all fixed now but it could definitely use some more
study and testing.

One issue that bothered me is that the regression tests really don't
provide much visibility into what the code is doing.  Some of the bugs
had to do with failing to generate expected trigrams, for instance
col ~ 'foo bar' only generating trigram foo and not bar.  This still
led to getting the right answer, so the error was invisible as far as the
tests were concerned.  Is it worth thinking of a way to expose what the
extract function did at SQL level, so we could test more carefully?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2013-04-07 Thread Tom Lane

Alexander Korotkov aekorot...@gmail.com writes:
 [ trgm-regexp-0.15.patch.gz ]

I spent the weekend hacking on this, making a number of bug fixes and a
whole lot of cosmetic changes.  I think there are large parts of this
that are in committable shape now, but I still find the actual graph
transformation logic to be mostly unintelligible.  I think what's most
obscure is the distinction between the arcs list and the keys list of
each state in the expanded graph.  I get the impression that the
general idea is for the arcs to represent exactly-known transitions
while the keys represent imprecisely-known transitions ... but there
seems to be at least some leakage between those categories.  Could
you write down a specification for what's supposed to be happening
there?

regards, tom lane



trgm-regexp-0.16.patch.gz
Description: trgm-regexp-0.16.patch.gz

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2013-04-03 Thread Erikjan Rijkers

On Tue, April 2, 2013 23:54, Alexander Korotkov wrote:

 [trgm-regexp-0.15.patch.gz]

Yes, it does look good now; Attached a list of measurements. Most of the 
searches that I put in
that test-program are now speeded up very much.

There still are a few regressions, for example:

HEAD  azjunk6  x[aeiou]{4,5}q  83  Seq Scan  1393.465 ms
trgm_regex15  azjunk6  x[aeiou]{4,5}q  83  Bitmap Heap Scan  1728.319 ms

HEAD  azjunk7  x[aeiou]{1,3}q  190031  Seq Scan 16819.555 ms
trgm_regex15  azjunk7  x[aeiou]{1,3}q  190031  Bitmap Heap Scan 21286.804 ms

Not exactly negligible, and ideally those regressions would be removed but with 
the huge
advantages for other cases I'd say it's worth it.

hth,

Erik Rijkers





re-head-13-15-20130403-0708.txt.bz2
Description: BZip2 compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2013-04-03 Thread Alexander Korotkov

On Wed, Apr 3, 2013 at 11:10 AM, Erikjan Rijkers e...@xs4all.nl wrote:

 On Tue, April 2, 2013 23:54, Alexander Korotkov wrote:

  [trgm-regexp-0.15.patch.gz]

 Yes, it does look good now; Attached a list of measurements. Most of the
 searches that I put in
 that test-program are now speeded up very much.

 There still are a few regressions, for example:

 HEAD  azjunk6  x[aeiou]{4,5}q  83  Seq Scan
  1393.465 ms
 trgm_regex15  azjunk6  x[aeiou]{4,5}q  83  Bitmap Heap Scan
  1728.319 ms

 HEAD  azjunk7  x[aeiou]{1,3}q  190031  Seq Scan
 16819.555 ms
 trgm_regex15  azjunk7  x[aeiou]{1,3}q  190031  Bitmap Heap Scan
 21286.804 ms

 Not exactly negligible, and ideally those regressions would be removed but
 with the huge
 advantages for other cases I'd say it's worth it.


Thank you for testing!
Exploring results more detail I found version 13 to be buggy. This version
is a dead end, we have quite different API now. Could you use v12 instead
of v13 in comparison, please?
Sometimes we have regression in comparison with head in two reasons:
1) We select index scan in both cases but with patch we spent more time for
analysis. It's inevitable disadvantage of any index. We can only take care
of analysis doesn't take too long. Current testing results don't show this
reason to be significant.
2) Sometimes we select index scan while sequential scan would be faster.
It's also inevitable disadvantage until we have a relevant statistics. We
now have similar situation, for example, with in-core geometrical search
and LIKE/ILIKE search in pg_trgm. However,  probably, situation could be
improved somehow even without such statistics. But I think we can do such
conclusion based on synthetical testing, because improvements for
synthetical cases could appear to be an worsening for real-life cases.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2013-04-02 Thread Erikjan Rijkers

On Mon, April 1, 2013 23:15, Alexander Korotkov wrote:

[trgm-regexp-0.14.patch.gz]

Hi Alexander,

Something went wrong in this version of the patch: many (most) queries that 
were earlier
spectacularly fast have become slow, often slower than a seqscan or only 
marginally faster. See
the attached numbers; it compares head(seqscan) with trgm-regex patch versions 
13 and 14.

I did not even complete the test-run because version 14 is so clearly inferior 
to 13 (and earlier,
as far as I can remember).

(let me know if you want the whole result, I can run it overnight)


Thanks,


Erik Rijkers


re-head-13-14-20130402-2219.txt.bz2
Description: BZip2 compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2013-04-02 Thread Alexander Korotkov

On Wed, Apr 3, 2013 at 12:36 AM, Erikjan Rijkers e...@xs4all.nl wrote:

 On Mon, April 1, 2013 23:15, Alexander Korotkov wrote:

 [trgm-regexp-0.14.patch.gz]

 Hi Alexander,


Hi Erik!


 Something went wrong in this version of the patch: many (most) queries
 that were earlier
 spectacularly fast have become slow, often slower than a seqscan or only
 marginally faster. See
 the attached numbers; it compares head(seqscan) with trgm-regex patch
 versions 13 and 14.

 I did not even complete the test-run because version 14 is so clearly
 inferior to 13 (and earlier,
 as far as I can remember).

 (let me know if you want the whole result, I can run it overnight)


Yes, there was bug in new color map scan implementation. It was fixed in
attached version. Could you re-run tests please?
Attached version of patch also addresses problem that we now using API it's
not correct to use special values of color number. It introduces additional
structure ExtendedColor which contain ExtendedColorClass field for handling
additional virtual colors introduced by analysis.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.15.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2013-04-01 Thread Alexander Korotkov

On Mon, Mar 25, 2013 at 2:38 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov aekorot...@gmail.com writes:
  On Mon, Mar 25, 2013 at 1:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Similarly, pushing PG-specific declarations like RE_compile_and_cache()
  into regex/regex.h is completely not the right thing for preserving a
  clear library boundary (even positing that we want to expose that
  function outside adt/regexp.c, which I'd rather we didn't).
 
  Perhaps we could avoid these issues by defining a library API that
  provides accessors like these for the opaque regex_t struct:
 
  * get the number of states in the CNFA
 
  * get the numbers of the initial and final states
 
  * get the number of out-arcs for the N'th state
 
  * get the out-arcs for the N'th state into a caller-provided array
  (sized using the previous function), where each out-arc is represented
  by a color and an end-state
 
  * get the number of character codes represented by color C
 
  * get the wchar codes for color C into a caller-provided array
 
  (The reason for letting the caller allocate the result arrays is so we
  can use palloc for that; if we allocate it in backend/regex/ we must
  use malloc, which will greatly increase the risk of leakages.  Also,
  as far as the color API goes, the above lets the caller decide how
  many characters is too many to bother with.)

  I like the this idea. Seems like clear and not over-engineered API. I can
  implement it. Could you propose something particular to do with
  RE_compile_and_cache in this patch? With this API we still need a way to
  get regex_t or equivalent from string.

 Well, the brute force answer is for pg_trgm to not go through
 RE_compile_and_cache at all, but just call the regex library for itself.
 That would lose the ability to cache regex compilations, but I'm not
 sure we care.  The motivation for RE_compile_and_cache is mainly to
 prevent having to compile the regex again for *every row*, which is what
 would otherwise happen in a simple SELECT using a regex function.
 Unless I'm misunderstanding something, pg_trgm would only need to
 compile the regex once per indexscan, which is probably tolerable.


For me it makes sense.
Patch with proposed API implementation is attached.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.14.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2013-03-24 Thread Tom Lane

Alexander Korotkov aekorot...@gmail.com writes:
 Now I have working implemetation of this API. Comments still need rework.
 Could you give me any feedback?

I looked at this a little bit, but it's not very far along at all
towards resolving my API worries.  The basic point that I'm concerned
about is that we would like to split off the regex library (ie,
backend/regex/) as a standalone project someday.  There are already
some problems to be resolved to make that possible, and I don't want
this patch to introduce new ones.  From that standpoint, the proposed
regextract.c file is a disaster, because it depends on a boatload of
Postgres-specific stuff (at least StringInfo, List, HTAB, and pg_wchar;
to say nothing of palloc).  We can't consider that to be on the regex
side of the fence.

Similarly, pushing PG-specific declarations like RE_compile_and_cache()
into regex/regex.h is completely not the right thing for preserving a
clear library boundary (even positing that we want to expose that
function outside adt/regexp.c, which I'd rather we didn't).

Given that you've already got a notion of callbacks provided by
contrib/pg_trgm, perhaps this can be fixed by pushing more of the work
into those callbacks, so that the heavy-duty data structures like the
hash table live over there and the API exposed by backend/regex/ is at
a much simpler level than what you have here.  But right now I don't
see any usable library API here.

Perhaps we could avoid these issues by defining a library API that
provides accessors like these for the opaque regex_t struct:

* get the number of states in the CNFA

* get the numbers of the initial and final states

* get the number of out-arcs for the N'th state

* get the out-arcs for the N'th state into a caller-provided array
(sized using the previous function), where each out-arc is represented
by a color and an end-state

* get the number of character codes represented by color C

* get the wchar codes for color C into a caller-provided array

(The reason for letting the caller allocate the result arrays is so we
can use palloc for that; if we allocate it in backend/regex/ we must
use malloc, which will greatly increase the risk of leakages.  Also,
as far as the color API goes, the above lets the caller decide how
many characters is too many to bother with.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2013-03-24 Thread Alexander Korotkov

On Mon, Mar 25, 2013 at 1:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov aekorot...@gmail.com writes:
  Now I have working implemetation of this API. Comments still need rework.
  Could you give me any feedback?

 I looked at this a little bit, but it's not very far along at all
 towards resolving my API worries.  The basic point that I'm concerned
 about is that we would like to split off the regex library (ie,
 backend/regex/) as a standalone project someday.  There are already
 some problems to be resolved to make that possible, and I don't want
 this patch to introduce new ones.  From that standpoint, the proposed
 regextract.c file is a disaster, because it depends on a boatload of
 Postgres-specific stuff (at least StringInfo, List, HTAB, and pg_wchar;
 to say nothing of palloc).  We can't consider that to be on the regex
 side of the fence.


Now I can see your position about API much more clearly. Previously I
thought only about algorithmic side of things.

Similarly, pushing PG-specific declarations like RE_compile_and_cache()
 into regex/regex.h is completely not the right thing for preserving a
 clear library boundary (even positing that we want to expose that
 function outside adt/regexp.c, which I'd rather we didn't).

 Given that you've already got a notion of callbacks provided by
 contrib/pg_trgm, perhaps this can be fixed by pushing more of the work
 into those callbacks, so that the heavy-duty data structures like the
 hash table live over there and the API exposed by backend/regex/ is at
 a much simpler level than what you have here.  But right now I don't
 see any usable library API here.

 Perhaps we could avoid these issues by defining a library API that
 provides accessors like these for the opaque regex_t struct:

 * get the number of states in the CNFA

 * get the numbers of the initial and final states

 * get the number of out-arcs for the N'th state

 * get the out-arcs for the N'th state into a caller-provided array
 (sized using the previous function), where each out-arc is represented
 by a color and an end-state

 * get the number of character codes represented by color C

 * get the wchar codes for color C into a caller-provided array

 (The reason for letting the caller allocate the result arrays is so we
 can use palloc for that; if we allocate it in backend/regex/ we must
 use malloc, which will greatly increase the risk of leakages.  Also,
 as far as the color API goes, the above lets the caller decide how
 many characters is too many to bother with.)


I like the this idea. Seems like clear and not over-engineered API. I can
implement it. Could you propose something particular to do with
RE_compile_and_cache in this patch? With this API we still need a way to
get regex_t or equivalent from string.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2013-03-24 Thread Tom Lane

Alexander Korotkov aekorot...@gmail.com writes:
 On Mon, Mar 25, 2013 at 1:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Similarly, pushing PG-specific declarations like RE_compile_and_cache()
 into regex/regex.h is completely not the right thing for preserving a
 clear library boundary (even positing that we want to expose that
 function outside adt/regexp.c, which I'd rather we didn't).
 
 Perhaps we could avoid these issues by defining a library API that
 provides accessors like these for the opaque regex_t struct:
 
 * get the number of states in the CNFA
 
 * get the numbers of the initial and final states
 
 * get the number of out-arcs for the N'th state
 
 * get the out-arcs for the N'th state into a caller-provided array
 (sized using the previous function), where each out-arc is represented
 by a color and an end-state
 
 * get the number of character codes represented by color C
 
 * get the wchar codes for color C into a caller-provided array
 
 (The reason for letting the caller allocate the result arrays is so we
 can use palloc for that; if we allocate it in backend/regex/ we must
 use malloc, which will greatly increase the risk of leakages.  Also,
 as far as the color API goes, the above lets the caller decide how
 many characters is too many to bother with.)

 I like the this idea. Seems like clear and not over-engineered API. I can
 implement it. Could you propose something particular to do with
 RE_compile_and_cache in this patch? With this API we still need a way to
 get regex_t or equivalent from string.

Well, the brute force answer is for pg_trgm to not go through
RE_compile_and_cache at all, but just call the regex library for itself.
That would lose the ability to cache regex compilations, but I'm not
sure we care.  The motivation for RE_compile_and_cache is mainly to
prevent having to compile the regex again for *every row*, which is what
would otherwise happen in a simple SELECT using a regex function.
Unless I'm misunderstanding something, pg_trgm would only need to
compile the regex once per indexscan, which is probably tolerable.

Another approach we could take is for adt/regexp.c to expose a function
that collects the deconstructed CNFA data into a palloc'd structure
using the above-proposed functions.  This'd be a reasonable way to go if
we decide the caching behavior is actually needed for this purpose; but
otherwise it seems like it's just involving one more module than
necessary.

BTW, one reason I don't like exposing RE_compile_and_cache directly is
that then you have the question of how long is the regex_t it hands back
good for.  The caller doesn't own that struct, and at some point
regexp.c is likely to recycle it, so it's a bit risky to just assume
it'll stay around long enough without any further interaction with
regexp.c.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2013-03-21 Thread Alexander Korotkov

On Thu, Mar 14, 2013 at 9:40 PM, Alexander Korotkov aekorot...@gmail.comwrote:

 On Wed, Jan 23, 2013 at 7:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Heikki Linnakangas hlinnakan...@vmware.com writes:
  On 23.01.2013 09:36, Alexander Korotkov wrote:
  On Wed, Jan 23, 2013 at 6:08 AM, Tom Lanet...@sss.pgh.pa.us  wrote:
  The biggest problem is that I really don't care for the idea of
  contrib/pg_trgm being this cozy with the innards of regex_t.

  The only option I see now is to provide a method like export_cnfa
 which
  would export corresponding CNFA in fixed format.

  Yeah, I think that makes sense. The transformation code in trgm_regexp.c
  would probably be more readable too, if it didn't have to deal with the
  regex guts representation of the CNFA. Also, once you have intermediate
  representation of the original CNFA, you could do some of the
  transformation work on that representation, before building the
  tranformed graph containing trigrams. You could eliminate any
  non-alphanumeric characters, joining states connected by arcs with
  non-alphanumeric characters, for example.

 It's not just the CNFA though; the other big API problem is with mapping
 colors back to characters.  Right now, that not only knows way too much
 about a part of the regex internals we have ambitions to change soon,
 but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of
 which should be known to the regex library IMO.  So I'm not sure how we
 divvy that up sanely.  To be clear: I'm not going to insist that we have
 to have a clean API factorization before we commit this at all.  But it
 worries me if we don't even know how we could get to that, because we
 are going to need it eventually.


 Now I have following idea about API.
 Put code of stage 2 (transform the original CNFA into an automaton-like
 graph) into regex engine. It would use API which describes what exactly are
 we going to extract from CNFA. This API could look like this.

 typedef char *Prefix;
 typedef char *ArcLabel;

 typedef struct
 {
 Prefix newPrefix;
  ArcLabel label;
 } ArcInfo;

 typedef struct
 {
 Prefix (*getInitialPrefix) ();
  bool (*prefixContains) (Prefix prefix1, Prefix prefix2);
 Prefix * (*getPrefixes) (Prefix prefix, color c, int *n);
  ArcInfo * (*getArcs) (Prefix prefix, color c, int *n);
 void (*freePrefix) (Prefix prefix);
  void (*freeArcLabel) (ArcLabel arcLabel);
 } CFNATransformAPI;

 getInitialPrefix returns initial prefix value like now this code does:
  initkey.prefix.colors[0] = UNKNOWN_COLOR;
  initkey.prefix.colors[1] = UNKNOWN_COLOR;
 prefixContains are exactly same as function with this name.
 getPrefixes and getArcs cycle step work of addKeys an addArcs.
 freePrefix and freeArcLabel frees used memory of Prefix and ArcLabel
 strutures.

 Additionally regex engine should provide correct way to examine colormap.
 int getColorCharsCount(colormap *cm, color c);
 pg_wchar *getColorChars(colormap *cm, color c);
 getColorCharsCount would return -1 if this color should be considered as
 unexpandable.


Now I have working implemetation of this API. Comments still need rework.
Could you give me any feedback?

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.13.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2013-03-14 Thread Alexander Korotkov

On Wed, Jan 23, 2013 at 7:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Heikki Linnakangas hlinnakan...@vmware.com writes:
  On 23.01.2013 09:36, Alexander Korotkov wrote:
  On Wed, Jan 23, 2013 at 6:08 AM, Tom Lanet...@sss.pgh.pa.us  wrote:
  The biggest problem is that I really don't care for the idea of
  contrib/pg_trgm being this cozy with the innards of regex_t.

  The only option I see now is to provide a method like export_cnfa
 which
  would export corresponding CNFA in fixed format.

  Yeah, I think that makes sense. The transformation code in trgm_regexp.c
  would probably be more readable too, if it didn't have to deal with the
  regex guts representation of the CNFA. Also, once you have intermediate
  representation of the original CNFA, you could do some of the
  transformation work on that representation, before building the
  tranformed graph containing trigrams. You could eliminate any
  non-alphanumeric characters, joining states connected by arcs with
  non-alphanumeric characters, for example.

 It's not just the CNFA though; the other big API problem is with mapping
 colors back to characters.  Right now, that not only knows way too much
 about a part of the regex internals we have ambitions to change soon,
 but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of
 which should be known to the regex library IMO.  So I'm not sure how we
 divvy that up sanely.  To be clear: I'm not going to insist that we have
 to have a clean API factorization before we commit this at all.  But it
 worries me if we don't even know how we could get to that, because we
 are going to need it eventually.


Now I have following idea about API.
Put code of stage 2 (transform the original CNFA into an automaton-like
graph) into regex engine. It would use API which describes what exactly are
we going to extract from CNFA. This API could look like this.

typedef char *Prefix;
typedef char *ArcLabel;

typedef struct
{
Prefix newPrefix;
 ArcLabel label;
} ArcInfo;

typedef struct
{
Prefix (*getInitialPrefix) ();
 bool (*prefixContains) (Prefix prefix1, Prefix prefix2);
Prefix * (*getPrefixes) (Prefix prefix, color c, int *n);
 ArcInfo * (*getArcs) (Prefix prefix, color c, int *n);
void (*freePrefix) (Prefix prefix);
 void (*freeArcLabel) (ArcLabel arcLabel);
} CFNATransformAPI;

getInitialPrefix returns initial prefix value like now this code does:
 initkey.prefix.colors[0] = UNKNOWN_COLOR;
 initkey.prefix.colors[1] = UNKNOWN_COLOR;
prefixContains are exactly same as function with this name.
getPrefixes and getArcs cycle step work of addKeys an addArcs.
freePrefix and freeArcLabel frees used memory of Prefix and ArcLabel
strutures.

Additionally regex engine should provide correct way to examine colormap.
int getColorCharsCount(colormap *cm, color c);
pg_wchar *getColorChars(colormap *cm, color c);
getColorCharsCount would return -1 if this color should be considered as
unexpandable.

Any thoughts?

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2013-03-06 Thread Alexander Korotkov

On Wed, Jan 23, 2013 at 7:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Heikki Linnakangas hlinnakan...@vmware.com writes:
  On 23.01.2013 09:36, Alexander Korotkov wrote:
  On Wed, Jan 23, 2013 at 6:08 AM, Tom Lanet...@sss.pgh.pa.us  wrote:
  The biggest problem is that I really don't care for the idea of
  contrib/pg_trgm being this cozy with the innards of regex_t.

  The only option I see now is to provide a method like export_cnfa
 which
  would export corresponding CNFA in fixed format.

  Yeah, I think that makes sense. The transformation code in trgm_regexp.c
  would probably be more readable too, if it didn't have to deal with the
  regex guts representation of the CNFA. Also, once you have intermediate
  representation of the original CNFA, you could do some of the
  transformation work on that representation, before building the
  tranformed graph containing trigrams. You could eliminate any
  non-alphanumeric characters, joining states connected by arcs with
  non-alphanumeric characters, for example.

 It's not just the CNFA though; the other big API problem is with mapping
 colors back to characters.  Right now, that not only knows way too much
 about a part of the regex internals we have ambitions to change soon,
 but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of
 which should be known to the regex library IMO.  So I'm not sure how we
 divvy that up sanely.  To be clear: I'm not going to insist that we have
 to have a clean API factorization before we commit this at all.  But it
 worries me if we don't even know how we could get to that, because we
 are going to need it eventually.


Now, we probably don't have enough of time before 9.3 to solve an API
problem :(. It's likely we have to choose either commit to 9.3 without
clean API factorization or postpone it to 9.4.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2013-03-06 Thread Stephen Frost

* Alexander Korotkov (aekorot...@gmail.com) wrote:
 Now, we probably don't have enough of time before 9.3 to solve an API
 problem :(. It's likely we have to choose either commit to 9.3 without
 clean API factorization or postpone it to 9.4.

As much as I'd like this to get in, I don't think there's a question
here- it should be punted to 9.4.  These don't look like small issues to
me.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] WIP: index support for regexp search

2013-01-28 Thread Alexander Korotkov

On Sun, Jan 27, 2013 at 10:40 PM, Alexander Korotkov
aekorot...@gmail.comwrote:

 Now I'm working on additional comments.


Some comments were added for addKey and addArc(s). I hope they clarify
something.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.12.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2013-01-27 Thread Alexander Korotkov

On Fri, Jan 25, 2013 at 11:47 AM, Erik Rijkers e...@xs4all.nl wrote:

 On Wed, January 23, 2013 08:36, Alexander Korotkov wrote:
  Hi!
 
  Some quick answers to the part of notes/issues. I will provide rest of
  answers soon.
 
 [...]
  trgm-regexp-0.10.patch.gz27 k

 Trying to build this I get, after 'make install' in contrib/ :

 /usr/bin/install: cannot stat `./pg_trgm--1.1.sql': No such file or
 directory
 /usr/bin/install: cannot stat `./pg_trgm--1.0--1.1.sql': No such file or
 directory
 make[1]: *** [install] Error 1
 make: *** [install-pg_trgm-recurse] Error 2


Forgot to include these files into patch.

Another changes in new version of patch:
1) Get rid of recursion in trigramsMatchGraph and addKeys.
2) Both bos[0] and eos[0] are also included into checks.
3) Get rid of strnlen.
4) I found the way to fix work with collation in previous version of patch
to be wrong. Collation of operator must match collation of indexed column
for index scan. Only thing to fix is passing collation in gincost_pattern.
5) Some tests were added which improves the coverage.

Now I'm working on additional comments.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.11.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2013-01-24 Thread Erik Rijkers

On Wed, January 23, 2013 08:36, Alexander Korotkov wrote:
 Hi!

 Some quick answers to the part of notes/issues. I will provide rest of
 answers soon.

[...]
 trgm-regexp-0.10.patch.gz27 k

Trying to build this I get, after 'make install' in contrib/ :

/usr/bin/install: cannot stat `./pg_trgm--1.1.sql': No such file or directory
/usr/bin/install: cannot stat `./pg_trgm--1.0--1.1.sql': No such file or 
directory
make[1]: *** [install] Error 1
make: *** [install-pg_trgm-recurse] Error 2












-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2013-01-23 Thread Heikki Linnakangas


On 23.01.2013 09:36, Alexander Korotkov wrote:

Hi!

Some quick answers to the part of notes/issues. I will provide rest of
answers soon.

On Wed, Jan 23, 2013 at 6:08 AM, Tom Lanet...@sss.pgh.pa.us  wrote:


The biggest problem is that I really don't care for the idea of
contrib/pg_trgm being this cozy with the innards of regex_t.  Sooner
or later we are going to want to split the regex code off again as a
standalone library, and we *must* have a cleaner division of labor if
that is ever to happen.  Not sure what a suitable API would look like
though.


The only option I see now is to provide a method like export_cnfa which
would export corresponding CNFA in fixed format.


Yeah, I think that makes sense. The transformation code in trgm_regexp.c 
would probably be more readable too, if it didn't have to deal with the 
regex guts representation of the CNFA. Also, once you have intermediate 
representation of the original CNFA, you could do some of the 
transformation work on that representation, before building the 
tranformed graph containing trigrams. You could eliminate any 
non-alphanumeric characters, joining states connected by arcs with 
non-alphanumeric characters, for example.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2013-01-23 Thread Tom Lane

Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 23.01.2013 09:36, Alexander Korotkov wrote:
 On Wed, Jan 23, 2013 at 6:08 AM, Tom Lanet...@sss.pgh.pa.us  wrote:
 The biggest problem is that I really don't care for the idea of
 contrib/pg_trgm being this cozy with the innards of regex_t.

 The only option I see now is to provide a method like export_cnfa which
 would export corresponding CNFA in fixed format.

 Yeah, I think that makes sense. The transformation code in trgm_regexp.c 
 would probably be more readable too, if it didn't have to deal with the 
 regex guts representation of the CNFA. Also, once you have intermediate 
 representation of the original CNFA, you could do some of the 
 transformation work on that representation, before building the 
 tranformed graph containing trigrams. You could eliminate any 
 non-alphanumeric characters, joining states connected by arcs with 
 non-alphanumeric characters, for example.

It's not just the CNFA though; the other big API problem is with mapping
colors back to characters.  Right now, that not only knows way too much
about a part of the regex internals we have ambitions to change soon,
but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of
which should be known to the regex library IMO.  So I'm not sure how we
divvy that up sanely.  To be clear: I'm not going to insist that we have
to have a clean API factorization before we commit this at all.  But it
worries me if we don't even know how we could get to that, because we
are going to need it eventually.

Anyway, I had another thought in the shower this morning, which is that
solving this problem in terms of color trigrams is really the Wrong
Thing.  ISTM it'd be better to think of the CNFA traversal logic as
looking for required sequences of colors of unspecified length, which
we'd then chop into trigrams after the fact.  This might produce
slightly better (more complete) trigram sets, but the real reason I'm
suggesting it is that I think the CNFA traversal code might be subject
to Polya's Inventor's Paradox: the more general problem may be easier
to solve.  It seems like casting the goal of that code as being to
find variable-length sequences, rather than exactly trigrams, might
lead to simpler data structures and more readable algorithms.  The
still-to-be-designed regex API for this also seems like it would be
better off if decoupled from the notion of trigrams.

It's quite possible that this idea is all wet and no meaningful
improvement can be gotten this way.  But I offer it for your
consideration.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2013-01-22 Thread Tom Lane

Heikki Linnakangas hlinnakan...@vmware.com writes:
 I finally got around to look at this. I like this new version, without 
 the path matrix, much better.

I looked through this version too.  I have some notes/issues:

The biggest problem is that I really don't care for the idea of
contrib/pg_trgm being this cozy with the innards of regex_t.  Sooner
or later we are going to want to split the regex code off again as a
standalone library, and we *must* have a cleaner division of labor if
that is ever to happen.  Not sure what a suitable API would look like
though.

I think the assumption that all MB characters fit in 4 bytes is
unacceptable; someday we'll want to support wider Unicode characters
than we do now, and this code seems utterly unable to handle it.  It's
especially bad that the code isn't even bothering to defend itself
against the possibility of wider characters.

Can't just modify pg_trgm--1.0.sql in place, must create a 1.1 version
and an upgrade script.

Comments and documentation still need a lot of copy-editing, also I
think a lot of the comment blocks will not survive pg_indent.  It'd be
a good idea to run trgm_regexp.c through pg_indent as soon as you have
that fixed.

New file trgm_regexp.c lacks a copyright notice

Calling RE_compile_and_cache with DEFAULT_COLLATION_OID is not good
enough; need to pass through the actual collation for the regex
operator.

How deep can the recursion in activateState() go?  Is this a practical
production approach at all?  Do we need a stack depth check there?
addKeys is also recursive, same questions.  (But on the other hand, the
check_stack_depth in scanColorMap seems useless, since its recursion
depth is fixed.)

Not too happy with convertPgWchar: aside from hard-wired, unchecked
assumption about maximum length of pg_wchar2mb_with_len result, why is
it that this is doing a lowercase conversion?  Surely the regex stuff
dealt with that already?

I'm suspicious of the code in addKeys() that is special-casing bos[1]
and eos[1] but not the other cases (BOL/EOL).  My recollection from
working on the fixed-prefix stuff is that it's not always obvious which
of those states gets used.

strnlen() used in fillTrgm() is probably not portable, and is definitely
not used anywhere else in Postgres.

This isn't a complete review, just some stuff I happened to notice in
one quick pass over the code.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2013-01-22 Thread Alexander Korotkov

Hi!

Some quick answers to the part of notes/issues. I will provide rest of
answers soon.

On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 The biggest problem is that I really don't care for the idea of
 contrib/pg_trgm being this cozy with the innards of regex_t.  Sooner
 or later we are going to want to split the regex code off again as a
 standalone library, and we *must* have a cleaner division of labor if
 that is ever to happen.  Not sure what a suitable API would look like
 though.


The only option I see now is to provide a method like export_cnfa which
would export corresponding CNFA in fixed format.


 I think the assumption that all MB characters fit in 4 bytes is
 unacceptable; someday we'll want to support wider Unicode characters
 than we do now, and this code seems utterly unable to handle it.  It's
 especially bad that the code isn't even bothering to defend itself
 against the possibility of wider characters.


In attached patch I introduce MAX_MULTIBYTE_CHARACTER_LENGTH macro and use
it in type definition. Is this way ok?


 Can't just modify pg_trgm--1.0.sql in place, must create a 1.1 version
 and an upgrade script.


Fixed.


 Comments and documentation still need a lot of copy-editing, also I
 think a lot of the comment blocks will not survive pg_indent.  It'd be
 a good idea to run trgm_regexp.c through pg_indent as soon as you have
 that fixed.


Fixed.


 New file trgm_regexp.c lacks a copyright notice


Fixed.

Calling RE_compile_and_cache with DEFAULT_COLLATION_OID is not good
 enough; need to pass through the actual collation for the regex
 operator.


We have collation passed to gin_extract_query in ginscan.c. I noticed
that gincost_pattern don't pass collation to gin_extract_query. Is it a
bug? Anyway this is not collation we need. We need collation used in
operator clause. In attached patch I introduce additional argument to
gin_extract_query which represent collation of operator clause. Do you
think it is reasonable change in GIN interface? If so, I will provide it as
separate patch.


 Not too happy with convertPgWchar: aside from hard-wired, unchecked
 assumption about maximum length of pg_wchar2mb_with_len result, why is
 it that this is doing a lowercase conversion?  Surely the regex stuff
 dealt with that already?


Trigrams are already lowercased. We can simplify our calculations by
excluding uppercased characters from consideration.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.10.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-12-18 Thread Alexander Korotkov

On Tue, Dec 18, 2012 at 11:45 AM, Erik Rijkers e...@xs4all.nl wrote:

 On Tue, December 18, 2012 08:04, Alexander Korotkov wrote:
 I ran the same test again: HEAD versus trgm_regex v6, 7 and 9.  In v9
 there is some gain but also
 some regression.

 It remains a difficult problem...

 If I get some time in the holidays I'll try to diversify the test program;
 it is now too simple.


Note, that regexes which contains {,n} are likely not what do you expect.

test=# select 'xq' ~ 'x[aeiou]{,2}q';
 ?column?
--
 f
(1 row)

test=# select 'xa{,2}q' ~ 'x[aeiou]{,2}q';
 ?column?
--
 t
(1 row)

You should use {0,n} to express from 0 to n occurences.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-12-18 Thread Erik Rijkers

On Tue, December 18, 2012 09:45, Alexander Korotkov wrote:

 You should use {0,n} to express from 0 to n occurences.



Thanks, but I know that of course.  It's a testing program; and in the end 
robustness with
unexpected or even wrong input is as important as performance.  (to put it 
bluntly, I am also
trying to get your patch to fall over ;-))

Thanks,

Erik Rijkers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-12-18 Thread Alexander Korotkov

On Tue, Dec 18, 2012 at 12:51 PM, Erik Rijkers e...@xs4all.nl wrote:

 On Tue, December 18, 2012 09:45, Alexander Korotkov wrote:
 
  You should use {0,n} to express from 0 to n occurences.
 


 Thanks, but I know that of course.  It's a testing program; and in the end
 robustness with
 unexpected or even wrong input is as important as performance.  (to put it
 bluntly, I am also
 trying to get your patch to fall over ;-))


I found most of regressions in 0.9 version to be in {,n} cases. New version
of patch use more of trigrams than previous versions.
For example for regex 'x[aeiou]{,2}q'.
In 0.7 version we use trigrams '__2', '_2_' and '__q'.
In 0.9 version we use trigrams 'xa_', 'xe_', 'xi_', 'xo_', 'xu_', '__2',
'_2_' and '__q'.

But, actually trigram '__2' or '_2_' never occurs. It enough to have one of
them, all others are just causing a slowdown. Simultaneously, we can't
decide reasonably which trigrams to use without knowing their frequencies.
For example, if trigrams 'xa_', 'xe_', 'xi_', 'xo_', 'xu_' were altogether
more rare than '__2', newer version of patch would be faster.


--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-12-17 Thread Erik Rijkers

On Sun, December 16, 2012 22:25, Alexander Korotkov wrote:

 trgm-regexp-0.8.patch.gz   22 k

Hi Alexander,

I gave this a quick try; the patch works when compiled for DEBUG, but crashes 
as a
'speed'-compiled binary:

Compile for speed:

$ pg_config --configure
'--prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.trgm_regex8' 
'--with-pgport=6556'
'--enable-depend' '--with-openssl' '--with-perl' '--with-libxml'

$ psql
psql 
(9.3devel-trgm_regex8-20121216_2336-c299477229559d4ee7db68720d86d3fb391db761)
Type help for help.

testdb=# explain analyze select txt from azjunk5 where txt ~ 'x[aeiouy]{2,5}q';
The connection to the server was lost. Attempting reset: Failed.
! \q


log after such:
-
2012-12-17 09:31:02.337 CET 15801 LOG:  server process (PID 16903) was 
terminated by signal 11:
Segmentation fault
2012-12-17 09:31:02.337 CET 15801 DETAIL:  Failed process was running: explain 
analyze select txt
from azjunk5 where txt ~ 'x[aeiouy]{2,5}q';
2012-12-17 09:31:02.347 CET 15801 LOG:  terminating any other active server 
processes
2012-12-17 09:31:02.348 CET 17049 FATAL:  the database system is in recovery 
mode
2012-12-17 09:31:02.722 CET 15801 LOG:  all server processes terminated; 
reinitializing
2012-12-17 09:31:03.506 CET 17052 LOG:  database system was interrupted; last 
known up at
2012-12-17 09:26:00 CET
2012-12-17 09:31:03.693 CET 17052 LOG:  database system was not properly shut 
down; automatic
recovery in progress
2012-12-17 09:31:04.493 CET 17052 LOG:  record with zero length at 2/7E3C7588
2012-12-17 09:31:04.494 CET 17052 LOG:  redo is not required
2012-12-17 09:31:06.940 CET 15801 LOG:  database system is ready to accept 
connections






A debug-compile with below options runs OK (so far):

Compile for debug:

$ pg_config --configure
'--prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.trgm_regex8b' 
'--with-pgport=6560'
'--enable-depend' '--enable-cassert' '--enable-debug' '--with-openssl' 
'--with-perl'
'--with-libxml'

which does show some speed gain, I think.  When I have time I'll post 
comparisons between HEAD,
versions 6, 7, 8.  (BTW, is v6 still interesting?)


Thanks,

Erik Rijkers








-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-12-17 Thread Alexander Korotkov

Hi!

On Mon, Dec 17, 2012 at 12:54 PM, Erik Rijkers e...@xs4all.nl wrote:

 On Sun, December 16, 2012 22:25, Alexander Korotkov wrote:

  trgm-regexp-0.8.patch.gz   22 k

 Hi Alexander,

 I gave this a quick try; the patch works when compiled for DEBUG, but
 crashes as a
 'speed'-compiled binary:

 Compile for speed:

 $ pg_config --configure
 '--prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.trgm_regex8'
 '--with-pgport=6556'
 '--enable-depend' '--with-openssl' '--with-perl' '--with-libxml'

 $ psql
 psql
 (9.3devel-trgm_regex8-20121216_2336-c299477229559d4ee7db68720d86d3fb391db761)
 Type help for help.

 testdb=# explain analyze select txt from azjunk5 where txt ~
 'x[aeiouy]{2,5}q';
 The connection to the server was lost. Attempting reset: Failed.
 ! \q


Didn't reproduce it yet. Can you retry it with this line uncommented:
#define TRGM_REGEXP_DEBUG
Then we can see which stage it fails.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-12-17 Thread Alexander Korotkov

On Mon, Dec 17, 2012 at 1:16 PM, Alexander Korotkov aekorot...@gmail.comwrote:

 Didn't reproduce it yet. Can you retry it with this line uncommented:
 #define TRGM_REGEXP_DEBUG
 Then we can see which stage it fails.


Bug is found and fixed in attached patch.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.9.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-12-17 Thread Erik Rijkers

On Tue, December 18, 2012 08:04, Alexander Korotkov wrote:

 trgm-regexp-0.9.patch.gz   22 k

Hi.

I ran the same test again: HEAD versus trgm_regex v6, 7 and 9.  In v9 there is 
some gain but also
some regression.

It remains a difficult problem...

If I get some time in the holidays I'll try to diversify the test program; it 
is now too simple.


Thanks,

Erik Rijkers

trgmre_6_7_9.txt.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-12-16 Thread Alexander Korotkov

On Fri, Dec 14, 2012 at 1:34 AM, Alexander Korotkov aekorot...@gmail.comwrote:

 On Mon, Dec 3, 2012 at 4:31 PM, Alexander Korotkov 
 aekorot...@gmail.comwrote:

 Actually, I generally dislike path matrix for same reasons. But:
 1) Output graphs could contain trigrams which are completely useless for
 search. For example, for regex /(abcdefgh)*ijk/ we need only ijk trigram
 while graph would contain much more.Path matrix is a method to get rid of
 all of them.
 2) If we use color trigrams then we need some criteria for which color
 trigrams to expand into trigrams. Simultaneously, we shouldn't allow path
 from initial state to the final by unexpanded trigrams. It seems much
 harder to do with graph than with matrix.


 Now, I have an idea about doing some not comprehensive but simple and fast
 simplification of graph. I'm doing experiments now. In case of success we
 could get rid of path matrix.


Attached patch have following changes:
1) Postphone expansion of colors. Graph are building on color trigrams.
2) Selective expansion of color trigrams into simple trigrams. All
non-expanded color trigrams are removed. Such removal leads to union of all
states pairs connected with corresponding arcs. Surely, this must no lead
to union of initial and final states: that could do all previous work
senseless.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.8.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-12-13 Thread Alexander Korotkov

On Mon, Dec 3, 2012 at 4:31 PM, Alexander Korotkov aekorot...@gmail.comwrote:

 Actually, I generally dislike path matrix for same reasons. But:
 1) Output graphs could contain trigrams which are completely useless for
 search. For example, for regex /(abcdefgh)*ijk/ we need only ijk trigram
 while graph would contain much more.Path matrix is a method to get rid of
 all of them.
 2) If we use color trigrams then we need some criteria for which color
 trigrams to expand into trigrams. Simultaneously, we shouldn't allow path
 from initial state to the final by unexpanded trigrams. It seems much
 harder to do with graph than with matrix.


Now, I have an idea about doing some not comprehensive but simple and fast
simplification of graph. I'm doing experiments now. In case of success we
could get rid of path matrix.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-12-03 Thread Heikki Linnakangas


On 02.12.2012 20:19, Tom Lane wrote:

Alexander Korotkovaekorot...@gmail.com  writes:

Nice idea to delay expanding colors to characters! Obviously, we should
delay expanding inly alphanumerical characters. Because non-alphanumberical
characters influence graph structure. Trying to implement...


Uh, why would that be?  Colors are colors.  The regexp machinery doesn't
care whether they represent alphanumerics or not.  (Or to be more
precise, if there is a situation where it makes a difference, separate
colors will have been created for each set of characters that need to be
distinguished.)


The regexp machinery doesn't care, but the trigrams that pg_trgm 
extracts only contain alphanumeric characters. So if by looking at the 
CNFA graph produced by the regexp machinery you conclude that any 
matching strings must contain three-letter sequences %oo and #oo, 
you can just club them together into  oo trigram.


I think you can run a pre-processing step to the colors, and merge 
colors that are equivalent as far as trigrams are considered. For 
example, if you have a color that contains only character '%', and 
another that contains character '#', you can treat them as the same 
hcolor. You might then be able to simplify the CNFA. Actually, it would 
be even better if you could apply the pre-processing to the regexp 
before the regexp machinery turns it into a CNFA. Not sure how easy it 
would be to do such pre-processing.


BTW, why create the path matrix? You could check the check array of 
trigrams in the consistent function directly against the graph. 
Consistent should return true, iff there is a path through the graph 
following only arcs that contain trigrams present in the check array. 
Finding a path through a complex graph could be expensive, O(|E|), but 
if the path is complex, the path matrix would be large as well, and 
checking against a large matrix isn't exactly free either. It would 
allow you to avoid overflows caused by having too many paths through 
the graph.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-12-03 Thread Alexander Korotkov

On Mon, Dec 3, 2012 at 2:05 PM, Heikki Linnakangas
hlinnakan...@vmware.comwrote:

 On 02.12.2012 20:19, Tom Lane wrote:

 Alexander Korotkovaekorot...@gmail.com  writes:

 Nice idea to delay expanding colors to characters! Obviously, we should
 delay expanding inly alphanumerical characters. Because
 non-alphanumberical
 characters influence graph structure. Trying to implement...


 Uh, why would that be?  Colors are colors.  The regexp machinery doesn't
 care whether they represent alphanumerics or not.  (Or to be more
 precise, if there is a situation where it makes a difference, separate
 colors will have been created for each set of characters that need to be
 distinguished.)


 The regexp machinery doesn't care, but the trigrams that pg_trgm extracts
 only contain alphanumeric characters. So if by looking at the CNFA graph
 produced by the regexp machinery you conclude that any matching strings
 must contain three-letter sequences %oo and #oo, you can just club them
 together into  oo trigram.

 I think you can run a pre-processing step to the colors, and merge colors
 that are equivalent as far as trigrams are considered. For example, if you
 have a color that contains only character '%', and another that contains
 character '#', you can treat them as the same hcolor. You might then be
 able to simplify the CNFA. Actually, it would be even better if you could
 apply the pre-processing to the regexp before the regexp machinery turns it
 into a CNFA. Not sure how easy it would be to do such pre-processing.


Treating colors as same should be possible only for colors which has no
alphanumeric characters, because colors are non-overlapping. However, this
optimization could be significant in some cases.

BTW, why create the path matrix? You could check the check array of
 trigrams in the consistent function directly against the graph. Consistent
 should return true, iff there is a path through the graph following only
 arcs that contain trigrams present in the check array. Finding a path
 through a complex graph could be expensive, O(|E|), but if the path is
 complex, the path matrix would be large as well, and checking against a
 large matrix isn't exactly free either. It would allow you to avoid
 overflows caused by having too many paths through the graph.


Actually, I generally dislike path matrix for same reasons. But:
1) Output graphs could contain trigrams which are completely useless for
search. For example, for regex /(abcdefgh)*ijk/ we need only ijk trigram
while graph would contain much more.Path matrix is a method to get rid of
all of them.
2) If we use color trigrams then we need some criteria for which color
trigrams to expand into trigrams. Simultaneously, we shouldn't allow path
from initial state to the final by unexpanded trigrams. It seems much
harder to do with graph than with matrix.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-12-02 Thread Alexander Korotkov

On Fri, Nov 30, 2012 at 6:23 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 30.11.2012 13:20, Alexander Korotkov wrote:

 On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangashlinnakangas@**
 vmware.com hlinnakan...@vmware.com

 wrote:


  Would it be safe to simply stop short the depth-first search on overflow,
 and proceed with the graph that was constructed up to that point?


 For depth-first it's not. But your proposal naturally makes sense. I've
 changed it to breadth-first search. And then it's safe to mark all
 unprocessed states as final when overflow. It means that we assume every
 unprocessed branch to immediately finish with matching (this can give us
 more false positives but no false negatives).
 For overflow of matrix collection, it's safe to do just OR between all the
 trigrams.
 New version of patch is attached.


 Thanks, sounds good.

 I've spent quite a long time trying to understand the transformation the
 getState/addKeys/addAcrs functions do to the original CNFA graph. I think
 that still needs more comments to explain the steps involved in it.

 One thing that occurs to me is that it might be better to delay expanding
 colors to characters. You could build and transform the graph, and even
 create the paths, all while operating on colors. You would end up with
 lists of color trigrams, consisting of sequences of three colors that
 must appear in the source string. Only at the last step you would expand
 the color trigrams to real character trigrams. I think that would save a
 lot of processing while building the graph, if you have colors that contain
 many characters. It would allow us to do better than the fixed small
 MAX_COLOR_CHARS limit. For example with MAX_COLOR_CHARS = 4 in the current
 patch, it's a shame that we can't do anything with a fairly simple regexp
 like ^a[b-g]h$


Nice idea to delay expanding colors to characters! Obviously, we should
delay expanding inly alphanumerical characters. Because non-alphanumberical
characters influence graph structure. Trying to implement...

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-12-02 Thread Alexander Korotkov

On Sat, Dec 1, 2012 at 3:22 PM, Erik Rijkers e...@xs4all.nl wrote:

 On Fri, November 30, 2012 12:22, Alexander Korotkov wrote:
  Hi!
 
  On Thu, Nov 29, 2012 at 12:58 PM, er e...@xs4all.nl wrote:
 
  On Mon, November 26, 2012 20:49, Alexander Korotkov wrote:
 
 
  I ran the simple-minded tests against generated data (similar to the
 ones
  I did in January 2012).
  The problems of that older version seem pretty much all removed.
 (although
  I didn't do much work
  on it -- just reran these tests).
 
 
  Thanks a lot for testing! Could you repeat for 0.7 version of patch which
  has new overflow handling?
 

 I've attached a similar test re-run that compares HEAD with patch versions
 0.6, and 0.7.


Thanks! Did you write scripts for automated testing? I would be nice if you
share them.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-12-02 Thread Tom Lane

Alexander Korotkov aekorot...@gmail.com writes:
 Nice idea to delay expanding colors to characters! Obviously, we should
 delay expanding inly alphanumerical characters. Because non-alphanumberical
 characters influence graph structure. Trying to implement...

Uh, why would that be?  Colors are colors.  The regexp machinery doesn't
care whether they represent alphanumerics or not.  (Or to be more
precise, if there is a situation where it makes a difference, separate
colors will have been created for each set of characters that need to be
distinguished.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-12-02 Thread Erik Rijkers

On Sun, December 2, 2012 19:07, Alexander Korotkov wrote:

 I've attached a similar test re-run that compares HEAD with patch versions
 0.6, and 0.7.


 Thanks! Did you write scripts for automated testing? I would be nice if you
 share them.


Sure, here they are.

The perl program does depend a bit on my particular setup (it reads the port 
from the
postgresql.conf of each instance, and the script knows the data_dir locations), 
but I suppose it's
easy enough to remove that.

(Just hardcode the %instances hash, instead of calling instances().)

If you need help to get them to run I can 'generalise' them, but for now I'll 
send them as they are.

Erik Rijkers




create_data.sh
Description: application/shellscript


trgm_regex_test.pl
Description: Perl program

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-12-01 Thread Erik Rijkers

On Fri, November 30, 2012 12:22, Alexander Korotkov wrote:
 Hi!

 On Thu, Nov 29, 2012 at 12:58 PM, er e...@xs4all.nl wrote:

 On Mon, November 26, 2012 20:49, Alexander Korotkov wrote:


 I ran the simple-minded tests against generated data (similar to the ones
 I did in January 2012).
 The problems of that older version seem pretty much all removed. (although
 I didn't do much work
 on it -- just reran these tests).


 Thanks a lot for testing! Could you repeat for 0.7 version of patch which
 has new overflow handling?


I've attached a similar test re-run that compares HEAD with patch versions 0.6, 
and 0.7.


Erik Rijkers


trgm_compare.txt.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-30 Thread Alexander Korotkov

On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 One thing that bothers me with this algoritm is that the overflow
 mechanism is all-or-nothing. In many cases, even when there is a huge
 number of states in the diagram, you could still extract at least a few
 trigrams that must be present in any matching string, with little effort.
 At least, it seems like that to a human :-).

 For example, consider this:

 explain analyze select count(*) from azjunk4 where txt ~ ('^**
 aabaacaadaaeaafaagaahaaiaajaak**aalaamaanaaoaapaaqaaraasaataau**
 aavaawaaxaayaazabaabbabcabdabe**abfabgabhabiabjabkablabmabnabo**
 abpabqabrabsabtabuabvabwabxaby**abzacaacbaccacdaceacfacgachaci**
 acjackaclacmacnacoacpacqacracs**actacuacvacwacxacyaczadaadbadc**
 addadeadfadgadhadiadjadkadladm**adnadoadpadqadradsadtaduadvadw**
 adxadyadzaeaaebaecaedaeeaefaeg**aehaeiaejaekaelaemaenaeoaepaeq**
 aeraesaetaeuaevaewaexaeyaezafa**afbafcafdafeaffafgafhafiafjafk**
 aflafmafnafoafpafqafrafsaftafu**afvafwafxafyafzagaagbagcagdage**
 agfaggaghagiagjagkaglagmagnago**agpagqagragsagtaguagvagwagxagy**
 agzahaahbahcahdaheahfahgahhahi**ahjahkahlahmahnahoahpahqahrahs**$');

 you get a query plan like this (the long regexp string edited out):

  Aggregate  (cost=228148.02..228148.03 rows=1 width=0) (actual
 time=131.100..131
 .101 rows=1 loops=1)
-  Bitmap Heap Scan on azjunk4  (cost=228144.01..228148.02 rows=1
 width=0) (
 actual time=131.096..131.096 rows=0 loops=1)
  Recheck Cond: (txt ~ ridiculously long regexp)
  Rows Removed by Index Recheck: 1
  -  Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx
 (cost=0.00..228144
 .01 rows=1 width=0) (actual time=82.914..82.914 rows=1 loops=1)
Index Cond: (txt ~ ridiculously long regexp)
  Total runtime: 131.230 ms
 (7 rows)

 That ridiculously long string exceeds the number of states (I think, could
 be number of paths or arcs too), and the algorithm gives up, resorting to
 scanning the whole index as can be seen by the Rows Removed by Index
 Recheck line. However, it's easy to see that any matching string must
 contain *any* of the possible trigrams the algorithm extracts. If it could
 safely return just a few of them, say aab and abz, and discard the
 rest, that would already be much better than a full index scan.

 Would it be safe to simply stop short the depth-first search on overflow,
 and proceed with the graph that was constructed up to that point?


For depth-first it's not. But your proposal naturally makes sense. I've
changed it to breadth-first search. And then it's safe to mark all
unprocessed states as final when overflow. It means that we assume every
unprocessed branch to immediately finish with matching (this can give us
more false positives but no false negatives).
For overflow of matrix collection, it's safe to do just OR between all the
trigrams.
New version of patch is attached.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.7.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-30 Thread Alexander Korotkov

Hi!

On Thu, Nov 29, 2012 at 12:58 PM, er e...@xs4all.nl wrote:

 On Mon, November 26, 2012 20:49, Alexander Korotkov wrote:

  trgm-regexp-0.6.patch.gz

 I ran the simple-minded tests against generated data (similar to the ones
 I did in January 2012).
 The problems of that older version seem pretty much all removed. (although
 I didn't do much work
 on it -- just reran these tests).


Thanks a lot for testing! Could you repeat for 0.7 version of patch which
has new overflow handling?

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-11-30 Thread Alexander Korotkov

On Fri, Nov 30, 2012 at 3:20 PM, Alexander Korotkov aekorot...@gmail.comwrote:

 For depth-first it's not.


Oh, I didn't explained it.
In order to stop graph processing we need to be sure that we put all
outgoing arcs from state or assume that state to be final. In DFS we can be
in the final part of graph producing but still didn't add some arc (with
new trigram) from initial state directly to the final state. It obviously
leads to false negatives.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-11-29 Thread er

On Mon, November 26, 2012 20:49, Alexander Korotkov wrote:

 trgm-regexp-0.6.patch.gz

I ran the simple-minded tests against generated data (similar to the ones I did 
in January 2012). 
The problems of that older version seem pretty much all removed. (although I 
didn't do much work
on it -- just reran these tests).

I used two 2 instances, 'HEAD' and 'trgm_regex', which were both compiled with

 '--enable-depend' '--with-openssl' '--with-perl' '--with-libxml'

Tables used:

rowcount size tablesize index (trgm)

 azjunk4   10^4 rows  1,171,456 |   9,781,248
 azjunk5   10^5 rows 11,706,368 |  65,093,632
 azjunk6   10^6 rows117,030,912 | 726,310,912
 azjunk7   10^7 rows  1,170,292,736 |   4,976,189,440

   (See my previous emails for a generating script)

Tables contain random generated text:

table azjunk7 limit 5;
   txt
--
 i kzzhv ssaa zv x  xlepzxsgbdkxev v wn dmvqkuwb qxkyvgab gpidaosaqbewqimmai  
jxj
 bvwn zbevtzyhibbn hoctxurutn pvlatjjyxf f runa owpltbcunrbq ux peoook 
rxwoscbytz
 bbjlbbhhkivjivklgbh tvapzogh rj ky ahvgkvvlfudotvqapznludohdoyqrp 
kvothyclbckbxu
 hvic gomewbp izsjifqggyqgzcghdat lb kud ltfqaxqxjjom qkw wqggikgvph   yi 
sftmbjt
 edbjfl vtcasudjpgfgjaf swooxygsse flnqd pxzsdmesqhqbzgirswysote muakq agk p w 
uq
(5 rows)

with index on column 'txt':

  create index az7_idx on azjunk7 using gin (txt gin_trgm_ops);

Queries were of the form:

   explain analyze select txt from azjunkXX where txt ~ '$REGEX';

The main problem with the January version was that it chose to use the trgm 
index even when it
could take a long time (hours).  This has been resolved as far as I can see, 
and the results are
now very attractive.

(There does seem to be a very slight regression on the seqscan, but it's so 
small that I'm not yet
sure it's not noise)


Hardware: AMD FX-8120 with Linux 2.6.32-279.14.1.el6.x86_64  x86_64  GNU/Linux

PostgreSQL 
9.3devel-trgm_regex-20121127_2353-e78d288c895bd296e3cb1ca29c7fe2431eef3fcd on
x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.7.2, 64-bit


 port instancetableregex rows method
expl.analyze timing

 6543 HEADazjunk4  x[ae]q  46 Seq Scan 
12.962 ms
 6554 trgm_regex  azjunk4  x[ae]q  46 Bitmap Heap Scan  
0.800 ms

 6543 HEADazjunk4  x[ae]{1}q   46 Seq Scan 
12.487 ms
 6554 trgm_regex  azjunk4  x[ae]{1}q   46 Bitmap Heap Scan  
0.209 ms

 6543 HEADazjunk4  x[ae]{1,1}q 46 Seq Scan 
12.266 ms
 6554 trgm_regex  azjunk4  x[ae]{1,1}q 46 Bitmap Heap Scan  
0.210 ms

 6543 HEADazjunk4  x[ae]{,2}q   0 Seq Scan 
14.322 ms
 6554 trgm_regex  azjunk4  x[ae]{,2}q   0 Bitmap Heap Scan  
0.610 ms

 6543 HEADazjunk4  x[ae]{,10}q  0 Seq Scan 
20.503 ms
 6554 trgm_regex  azjunk4  x[ae]{,10}q  0 Bitmap Heap Scan  
0.511 ms

 6543 HEADazjunk4  x[ae]{1,2}q 49 Seq Scan 
13.060 ms
 6554 trgm_regex  azjunk4  x[ae]{1,2}q 49 Bitmap Heap Scan  
0.429 ms

 6543 HEADazjunk4  x[aei]q 81 Seq Scan 
12.487 ms
 6554 trgm_regex  azjunk4  x[aei]q 81 Bitmap Heap Scan  
0.367 ms

 6543 HEADazjunk4  x[aei]{1}q  81 Seq Scan 
12.132 ms
 6554 trgm_regex  azjunk4  x[aei]{1}q  81 Bitmap Heap Scan  
0.336 ms

 6543 HEADazjunk4  x[aei]{1,1}q81 Seq Scan 
12.168 ms
 6554 trgm_regex  azjunk4  x[aei]{1,1}q81 Bitmap Heap Scan  
0.319 ms

 6543 HEADazjunk4  x[aei]{,2}q  0 Seq Scan 
14.586 ms
 6554 trgm_regex  azjunk4  x[aei]{,2}q  0 Bitmap Heap Scan  
0.621 ms

 6543 HEADazjunk4  x[aei]{,10}q 0 Seq Scan 
20.134 ms
 6554 trgm_regex  azjunk4  x[aei]{,10}q 0 Bitmap Heap Scan  
0.552 ms

 6543 HEADazjunk4  x[aei]{1,2}q89 Seq Scan 
12.553 ms
 6554 trgm_regex  azjunk4  x[aei]{1,2}q89 Bitmap Heap Scan  
0.916 ms

 6543 HEADazjunk4  x[aei]{1,3}q89 Seq Scan 
13.055 ms
 6554 trgm_regex  azjunk4  x[aei]{1,3}q89 Seq Scan 
13.064 ms

 6543 HEADazjunk4  x[aei]q 81 Seq Scan 
11.856 ms
 6554 trgm_regex  azjunk4  x[aei]q 81 Bitmap Heap Scan  
0.398 ms

 6543 HEADazjunk4  x[aei]{1}q  81 Seq Scan 
11.951 ms
 6554 trgm_regex  azjunk4  x[aei]{1}q  81 Bitmap Heap Scan  
0.369 ms

 6543 HEADazjunk4  x[aei]{1,1}q81 Seq Scan 
12.750 ms
 6554 trgm_regex  azjunk4

Re: [HACKERS] WIP: index support for regexp search

2012-11-29 Thread Heikki Linnakangas

One thing that bothers me with this algoritm is that the overflow 
mechanism is all-or-nothing. In many cases, even when there is a huge 
number of states in the diagram, you could still extract at least a few 
trigrams that must be present in any matching string, with little 
effort. At least, it seems like that to a human :-).


For example, consider this:

explain analyze select count(*) from azjunk4 where txt ~ 
('^aabaacaadaaeaafaagaahaaiaajaakaalaamaanaaoaapaaqaaraasaataauaavaawaaxaayaazabaabbabcabdabeabfabgabhabiabjabkablabmabnaboabpabqabrabsabtabuabvabwabxabyabzacaacbaccacdaceacfacgachaciacjackaclacmacnacoacpacqacracsactacuacvacwacxacyaczadaadbadcaddadeadfadgadhadiadjadkadladmadnadoadpadqadradsadtaduadvadwadxadyadzaeaaebaecaedaeeaefaegaehaeiaejaekaelaemaenaeoaepaeqaeraesaetaeuaevaewaexaeyaezafaafbafcafdafeaffafgafhafiafjafkaflafmafnafoafpafqafrafsaftafuafvafwafxafyafzagaagbagcagdageagfaggaghagiagjagkaglagmagnagoagpagqagragsagtaguagvagwagxagyagzahaahbahcahdaheahfahgahhahiahjahkahlahmahnahoahpahqahrahs$');


you get a query plan like this (the long regexp string edited out):

 Aggregate  (cost=228148.02..228148.03 rows=1 width=0) (actual 
time=131.100..131

.101 rows=1 loops=1)
   -  Bitmap Heap Scan on azjunk4  (cost=228144.01..228148.02 rows=1 
width=0) (

actual time=131.096..131.096 rows=0 loops=1)
 Recheck Cond: (txt ~ ridiculously long regexp)
 Rows Removed by Index Recheck: 1
 -  Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx 
(cost=0.00..228144

.01 rows=1 width=0) (actual time=82.914..82.914 rows=1 loops=1)
   Index Cond: (txt ~ ridiculously long regexp)
 Total runtime: 131.230 ms
(7 rows)

That ridiculously long string exceeds the number of states (I think, 
could be number of paths or arcs too), and the algorithm gives up, 
resorting to scanning the whole index as can be seen by the Rows 
Removed by Index Recheck line. However, it's easy to see that any 
matching string must contain *any* of the possible trigrams the 
algorithm extracts. If it could safely return just a few of them, say 
aab and abz, and discard the rest, that would already be much better 
than a full index scan.


Would it be safe to simply stop short the depth-first search on 
overflow, and proceed with the graph that was constructed up to that point?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-26 Thread Heikki Linnakangas


On 25.11.2012 22:55, Alexander Korotkov wrote:

On Tue, Nov 20, 2012 at 1:43 PM, Heikki Linnakangashlinnakan...@vmware.com

wrote:



Glad to see this patch hasn't been totally forgotten. Being able to use
indexes for regular expressions would be really cool!

Back in January, I asked for some high-level description of how the
algorithm works (http://archives.postgresql.**
org/message-id/4F187D5C.30701@**enterprisedb.comhttp://archives.postgresql.org/message-id/4f187d5c.30...@enterprisedb.com).
That's still sorely needed. Googling around, I found the slides for your
presentation on this from PGConf.EU - it would be great to have the
information from that presentation included in the patch.



New version of patch is attached. The changes are following:
1) A big comment with high-level description of what is going on.
2) Regression tests.
3) Documetation update.
4) Some more refactoring.


Great, that top-level comment helped tremendously! I feel enlightened.

I fixed some spelling, formatting etc. trivial stuff while reading 
through the patch, see attached. Below is some feedback on the details:


* I don't like the PG_TRY/CATCH trick. It's not generally safe to catch 
an error, without propagating it further or rolling back the whole 
(sub)transation. It might work in this case, as you're only suppressing 
errors with the special sqlcode that are used in the same file, but it 
nevertheless feels naughty. I believe none of the limits that are being 
checked are strict; it's OK to exceed the limits somewhat, as long as 
you terminate the processing in a reasonable time, in case of 
pathological input. I'd suggest putting an explicit check for the limits 
somewhere, and not rely on ereport(). Something like this, in the code 
that recurses:


if (trgmCNFA-arcsCount  MAX_RESULT_ARCS ||
hash_get_num_entries(trgmCNFA-states)  MAX_RESULT_STATES)
{
trgmCNFA-overflowed = true;
return;
}

And then check for the overflowed flag at the top level.

* This part of the high-level comment was not clear to me:


 * States of the graph produced in the first stage are marked with keys. Key 
is a pair
 * of a prefix and a state of the original automaton. Prefix is a last
 * characters. So, knowing the prefix is enough to know what is a trigram when 
we read some new
 * character. However, we can know single character of prefix or don't know any
 * characters of it. Each state of resulting graph have an enter key (with 
that
 * key we've entered this state) and a set of keys which are reachable without
 * reading any predictable trigram. The algorithm of processing each state
 * of resulting graph are so:
 * 1) Add all keys which achievable without reading of any predictable trigram.
 * 2) Add outgoing arcs labeled with trigrams.
 * Step 2 leads to creation of new states and recursively processing them. So,
 * we use depth-first algorithm.


I didn't understand that. Can you elaborate? It might help to work 
through an example, with some ascii art depicting the graph.


* It would be nice to add some comments to TrgmCNFA struct, explaining 
which fields are valid at which stages. For example, it seems that 
'trgms' array is calculated only after building the CNFA, by 
getTrgmVector() function, while arcsCount is updated on the fly, while 
recursing in the getState() function.


* What is the representation used for the path matrix? Needs a comment.

* What do the getColorinfo() and scanColorMap() functions do? What 
exactly does a color represent? What's the tradeoff in choosing 
MAX_COLOR_CHARS?


- Heikki
diff --git a/contrib/pg_trgm/Makefile b/contrib/pg_trgm/Makefile
index 64fd69f..8033733 100644
--- a/contrib/pg_trgm/Makefile
+++ b/contrib/pg_trgm/Makefile
@@ -1,7 +1,7 @@
 # contrib/pg_trgm/Makefile
 
 MODULE_big = pg_trgm
-OBJS = trgm_op.o trgm_gist.o trgm_gin.o
+OBJS = trgm_op.o trgm_gist.o trgm_gin.o trgm_regexp.o
 
 EXTENSION = pg_trgm
 DATA = pg_trgm--1.0.sql pg_trgm--unpackaged--1.0.sql
diff --git a/contrib/pg_trgm/expected/pg_trgm.out b/contrib/pg_trgm/expected/pg_trgm.out
index 81d0ca8..ee0131f 100644
--- a/contrib/pg_trgm/expected/pg_trgm.out
+++ b/contrib/pg_trgm/expected/pg_trgm.out
@@ -54,7 +54,7 @@ select similarity('wow',' WOW ');
 (1 row)
 
 CREATE TABLE test_trgm(t text);
-\copy test_trgm from 'data/trgm.data
+\copy test_trgm from 'data/trgm.data'
 select t,similarity(t,'qwertyu0988') as sml from test_trgm where t % 'qwertyu0988' order by sml desc, t;
   t  |   sml
 -+--
@@ -3515,6 +3515,47 @@ select * from test2 where t ilike 'qua%';
  quark
 (1 row)
 
+select * from test2 where t ~ '[abc]{3}';
+   t
+
+ abcdef
+(1 row)
+
+select * from test2 where t ~ 'a[bc]+d';
+   t
+
+ abcdef
+(1 row)
+
+select * from test2 where t ~* 'DEF';
+   t
+
+ abcdef
+(1 row)
+
+select * from test2 where t ~ 'dEf';
+ t 
+---
+(0 rows)
+
+select * from test2 where t ~* '^q';
+   t   
+---
+ quark
+(1 row)
+
+select * from test2 where

Re: [HACKERS] WIP: index support for regexp search

2012-11-26 Thread Alexander Korotkov

On Mon, Nov 26, 2012 at 4:55 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 Great, that top-level comment helped tremendously! I feel enlightened.

 I fixed some spelling, formatting etc. trivial stuff while reading through
 the patch, see attached. Below is some feedback on the details:

 * I don't like the PG_TRY/CATCH trick. It's not generally safe to catch an
 error, without propagating it further or rolling back the whole
 (sub)transation. It might work in this case, as you're only suppressing
 errors with the special sqlcode that are used in the same file, but it
 nevertheless feels naughty. I believe none of the limits that are being
 checked are strict; it's OK to exceed the limits somewhat, as long as you
 terminate the processing in a reasonable time, in case of pathological
 input. I'd suggest putting an explicit check for the limits somewhere, and
 not rely on ereport(). Something like this, in the code that recurses:

 if (trgmCNFA-arcsCount  MAX_RESULT_ARCS ||
 hash_get_num_entries(trgmCNFA-**states)  MAX_RESULT_STATES)
 {
 trgmCNFA-overflowed = true;
 return;
 }


PG_TRY/CATCH trick is replaced with some number of if/return. Code becomes
a bit more complicated, but your notes does matter.

And then check for the overflowed flag at the top level.

 * This part of the high-level comment was not clear to me:

   * States of the graph produced in the first stage are marked with
 keys. Key is a pair
  * of a prefix and a state of the original automaton. Prefix is a last
  * characters. So, knowing the prefix is enough to know what is a trigram
 when we read some new
  * character. However, we can know single character of prefix or don't
 know any
  * characters of it. Each state of resulting graph have an enter key
 (with that
  * key we've entered this state) and a set of keys which are reachable
 without
  * reading any predictable trigram. The algorithm of processing each state
  * of resulting graph are so:
  * 1) Add all keys which achievable without reading of any predictable
 trigram.
  * 2) Add outgoing arcs labeled with trigrams.
  * Step 2 leads to creation of new states and recursively processing
 them. So,
  * we use depth-first algorithm.


 I didn't understand that. Can you elaborate? It might help to work through
 an example, with some ascii art depicting the graph.


This comment is extended with example.


 * It would be nice to add some comments to TrgmCNFA struct, explaining
 which fields are valid at which stages. For example, it seems that 'trgms'
 array is calculated only after building the CNFA, by getTrgmVector()
 function, while arcsCount is updated on the fly, while recursing in the
 getState() function.


TrgmCNFA comment is extended with this.


 * What is the representation used for the path matrix? Needs a comment.


Comments are added to PackedTrgmPaths and TrgmStatePath.


 * What do the getColorinfo()


getColorInfo comment now references to ColorInfo struct which have comment.

and scanColorMap() functions do?


 scanColorMap comment now have description of colormap structure.


 What exactly does a color represent?


This is added to the top comment.


 What's the tradeoff in choosing MAX_COLOR_CHARS?


Comment is added to the macro.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.6.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-25 Thread Alexander Korotkov

On Tue, Nov 20, 2012 at 1:43 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 Glad to see this patch hasn't been totally forgotten. Being able to use
 indexes for regular expressions would be really cool!

 Back in January, I asked for some high-level description of how the
 algorithm works (http://archives.postgresql.**
 org/message-id/4F187D5C.30701@**enterprisedb.comhttp://archives.postgresql.org/message-id/4f187d5c.30...@enterprisedb.com).
 That's still sorely needed. Googling around, I found the slides for your
 presentation on this from PGConf.EU - it would be great to have the
 information from that presentation included in the patch.


New version of patch is attached. The changes are following:
1) A big comment with high-level description of what is going on.
2) Regression tests.
3) Documetation update.
4) Some more refactoring.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.5.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-25 Thread Alexander Korotkov

Hi!

On Wed, Nov 21, 2012 at 12:51 AM, Pavel Stehule pavel.steh...@gmail.comwrote:

 do you plan to support GiST?


At first, I would note that pg_trgm GiST opclass is quite ridiculous for
support regex search (and, actually for LIKE/ILIKE search which is already
implemented too). Because in GiST opclass we store set of trigrams in leaf
pages. In was designed for trigram similarity search and have sense for it
because of elimination of trigram set computation. But for regex or
LIKE/ILIKE search this representation is both lossy and bigger than just
original string. Probably we could think about another opclass for GiST
focusing on regex and LIKE/ILIKE search?

However, amyway I can create additional patch for current GiST opclass.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-11-20 Thread Heikki Linnakangas

Glad to see this patch hasn't been totally forgotten. Being able to use 
indexes for regular expressions would be really cool!


Back in January, I asked for some high-level description of how the 
algorithm works 
(http://archives.postgresql.org/message-id/4f187d5c.30...@enterprisedb.com). 
That's still sorely needed. Googling around, I found the slides for your 
presentation on this from PGConf.EU - it would be great to have the 
information from that presentation included in the patch.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-20 Thread Alexander Korotkov

On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra t...@fuzzy.cz wrote:

 2) It's common to use upper-case names for macros, but trgm.h defines
macro iswordchr - I see it's moved from trgm_op.c but maybe we
could make it a bit more correct?

 3) I see there are two '#ifdef KEEPONLYALNUM blocks right next to each
other in trgm.h - maybe it'd be a good idea to join them?

 4) The two new method prototypes at the end of trgm.h use different
indendation than the rest (spaces only instead of tabs).

These issues are fixed in attached patch.
Additionally 3 new macros are
introduced: MAX_RESULT_STATES, MAX_RESULT_ARCS, MAX_RESULT_PATHS. They are
limiting resources usage during regex processing.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.4.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-20 Thread Pavel Stehule

hello

do you plan to support GiST?

Regards

Pavel

2012/11/20 Alexander Korotkov aekorot...@gmail.com:
 On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra t...@fuzzy.cz wrote:

 2) It's common to use upper-case names for macros, but trgm.h defines
macro iswordchr - I see it's moved from trgm_op.c but maybe we
could make it a bit more correct?

 3) I see there are two '#ifdef KEEPONLYALNUM blocks right next to each
other in trgm.h - maybe it'd be a good idea to join them?

 4) The two new method prototypes at the end of trgm.h use different
indendation than the rest (spaces only instead of tabs).

 These issues are fixed in attached patch.
 Additionally 3 new macros are introduced: MAX_RESULT_STATES,
 MAX_RESULT_ARCS, MAX_RESULT_PATHS. They are limiting resources usage during
 regex processing.

 --
 With best regards,
 Alexander Korotkov.


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-19 Thread Alexander Korotkov

Hi!

New version of patch is attached. Changes are following:
1) Right way to convert from pg_wchar to multibyte.
2) Optimization of producing CFNA-like graph on trigrams (produce smaller,
but equivalent, graphs in less time).
3) Comments and refactoring.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.2.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-19 Thread Erik Rijkers

On Mon, November 19, 2012 22:58, Alexander Korotkov wrote:

 New version of patch is attached.

Hi Alexander,

I get some compile-errors:

(Centos 6.3, Linux 2.6.32-279.14.1.el6.x86_64 GNU/Linux, gcc (GCC) 4.7.2)

make contrib
trgm_regexp.c:73:2: error: unknown type name TrgmStateKey
make[1]: *** [trgm_regexp.o] Error 1
make: *** [all-pg_trgm-recurse] Error 2
trgm_regexp.c:73:2: error: unknown type name TrgmStateKey
make[1]: *** [trgm_regexp.o] Error 1
make: *** [install-pg_trgm-recurse] Error 2


Did I forget something?

Thanks,

Erik Rijkers







-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-19 Thread Tomas Vondra

On 19.11.2012 22:58, Alexander Korotkov wrote:
 Hi!
 
 New version of patch is attached. Changes are following:
 1) Right way to convert from pg_wchar to multibyte.
 2) Optimization of producing CFNA-like graph on trigrams (produce
 smaller, but equivalent, graphs in less time).
 3) Comments and refactoring.

Hi,

thanks for the updated message-id. I've done the initial review:

1) Patch applies fine to the master.

2) It's common to use upper-case names for macros, but trgm.h defines
   macro iswordchr - I see it's moved from trgm_op.c but maybe we
   could make it a bit more correct?

3) I see there are two '#ifdef KEEPONLYALNUM blocks right next to each
   other in trgm.h - maybe it'd be a good idea to join them?

4) The two new method prototypes at the end of trgm.h use different
   indendation than the rest (spaces only instead of tabs).

5) There are no regression tests / updated docs (yet).

6) It does not compile - I do get a bunch of errors like this

trgm_regexp.c:73:2: error: expected specifier-qualifier-list before
‘TrgmStateKey’
trgm_regexp.c: In function ‘addKeys’:
trgm_regexp.c:291:24: error: ‘TrgmState’ has no member named ‘keys’
trgm_regexp.c:304:10: error: ‘TrgmState’ has no member named ‘keys’
...

It seems this is cause by the order of typedefs in trgm_regexp.c, namely
TrgmState referencing TrgmStateKey before it's defined. Moving the
TrgmStateKey before TrgmState fixed the issue (I'm using gcc-4.5.4  but
I think it's not compiler-dependent.)

7) Once fixed, it seems to work

CREATE EXTENSION pg_trgm ;
CREATE TABLE TEST (val TEXT);
INSERT INTO test
   SELECT md5(i::text) FROM generate_series(1,100) s(i);
CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
ANALYZE test;

EXPLAIN SELECT * FROM test WHERE val ~ '.*qqq.*';


   QUERY PLAN
-
 Bitmap Heap Scan on test  (cost=16.77..385.16 rows=100 width=33)
   Recheck Cond: (val ~ '.*qqq.*'::text)
   -  Bitmap Index Scan on trgm_idx  (cost=0.00..16.75 rows=100
   width=0)
 Index Cond: (val ~ '.*qqq.*'::text)
(4 rows)

but I do get a bunch of NOTICE messages with debugging info (no matter
if the GIN index is used or not, so it's somewhere in the common regexp
code). But I guess that's due to WIP status.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-19 Thread Alexander Korotkov

Some quick comments.

On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra t...@fuzzy.cz wrote:

 6) It does not compile - I do get a bunch of errors like this

Fixed.

7) Once fixed, it seems to work

 CREATE EXTENSION pg_trgm ;
 CREATE TABLE TEST (val TEXT);
 INSERT INTO test
SELECT md5(i::text) FROM generate_series(1,100) s(i);
 CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
 ANALYZE test;

 EXPLAIN SELECT * FROM test WHERE val ~ '.*qqq.*';


QUERY PLAN
 -
  Bitmap Heap Scan on test  (cost=16.77..385.16 rows=100 width=33)
Recheck Cond: (val ~ '.*qqq.*'::text)
-  Bitmap Index Scan on trgm_idx  (cost=0.00..16.75 rows=100
width=0)
  Index Cond: (val ~ '.*qqq.*'::text)
 (4 rows)

 but I do get a bunch of NOTICE messages with debugging info (no matter
 if the GIN index is used or not, so it's somewhere in the common regexp
 code). But I guess that's due to WIP status.

It's due to TRGM_REGEXP_DEBUG macro. I disabled it by default. But I think
pieces of code hidden by that macro could be useful for debug even after
WIP status.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.3.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-01-20 Thread Marti Raudsepp

On Fri, Jan 20, 2012 at 01:33, Erik Rijkers e...@xs4all.nl wrote:
 Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I 
 suppose this is
 normally necessary for perfomance reasons, but it would be useful te be able 
 to compile a test
 version that allows it.

I believe being interruptible is a requirement for the patch to be accepted.

CHECK_FOR_INTERRUPTS(); should be added to the indeterminate loops.

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-01-20 Thread Alexander Korotkov

Hi!

Thank you for your feedback!

On Fri, Jan 20, 2012 at 3:33 AM, Erik Rijkers e...@xs4all.nl wrote:

 The patch yields spectacular speedups with small, simple-enough regexen.
  But it does not do a
 good enough job when guessing where to use the index and where fall back
 to Seq Scan.  This can
 lead to (also spectacular) slow-downs, compared to Seq Scan.

Could you give some examples of regexes where index scan becomes slower
than seq scan?


 I guessed that MAX_COLOR_CHARS limits the character class size (to 4, in
 your patch), is that
 true?   I can understand you want that value to be low to limit the above
 risk, but now it reduces
 the usability of the feature a bit: one has to split up larger
 char-classes into several smaller
 ones to make a statement use the index: i.e.:

Yes, MAX_COLOR_CHARS is number of maximum character in automata color when
that color is divided to a separated characters. And it's likely there
could be better solution than just have this hard limit.


 Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I
 suppose this is
 normally necessary for perfomance reasons, but it would be useful te be
 able to compile a test
 version that allows it.  I don't know how hard that would be.

I seems that Ctrl-C was impossible because procedure of trigrams
exctraction becomes so long while it is not breakable. It's not difficult
to make this procedure breakable, but actually it just shouldn't take so
long.


 There is also a minor bug, I think, when running with  'set
 enable_seqscan=off'  in combination
 with a too-large regex:

Thanks for pointing. Will be fixed.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-01-20 Thread Alexander Korotkov

On Fri, Jan 20, 2012 at 8:45 PM, Marti Raudsepp ma...@juffo.org wrote:

 On Fri, Jan 20, 2012 at 01:33, Erik Rijkers e...@xs4all.nl wrote:
  Btw, it seems impossible to Ctrl-C out of a search once it is submitted;
 I suppose this is
  normally necessary for perfomance reasons, but it would be useful te be
 able to compile a test
  version that allows it.

 I believe being interruptible is a requirement for the patch to be
 accepted.

 CHECK_FOR_INTERRUPTS(); should be added to the indeterminate loops.

Sure. It's easy to fix. But it seems that in this case gin extract_query
method becomes slow (because index scan itself is breakable). So, it just
shouldn't work so long.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-01-20 Thread Alexander Korotkov

On Fri, Jan 20, 2012 at 12:54 AM, Alexander Korotkov
aekorot...@gmail.comwrote:

 On Fri, Jan 20, 2012 at 12:30 AM, Heikki Linnakangas 
 heikki.linnakan...@enterprisedb.com wrote:

 Apart from that, the multibyte issue seems like the big one. Any way
 around that?

 Conversion of pg_wchar to multibyte character is the only way I found to
 avoid serious hacking of existing regexp code. Do you think additional
 function in pg_wchar_tbl which converts pg_wchar back to multibyte
 character is possible solution?

Do you have any notes on it? I could make the patch which adds such
function into core.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-01-20 Thread Erik Rijkers

On Sat, January 21, 2012 06:26, Alexander Korotkov wrote:
 Hi!

 Thank you for your feedback!

 On Fri, Jan 20, 2012 at 3:33 AM, Erik Rijkers e...@xs4all.nl wrote:

 The patch yields spectacular speedups with small, simple-enough regexen.
  But it does not do a
 good enough job when guessing where to use the index and where fall back
 to Seq Scan.  This can
 lead to (also spectacular) slow-downs, compared to Seq Scan.

 Could you give some examples of regexes where index scan becomes slower
 than seq scan?



x[aeio]+q takes many minutes, uninterruptible.  It's now running for almost 30 
minutes, I'm sure
it will come back eventually, but I think I'll kill it later on; I suppose you 
get the point  ;-)

Even with {n,m} quantifiers it's easy to hit slowdowns:

   (table azjunk6 is 112 MB, the index 693 MB.)
   (MAX_COLOR_CHARS=4  = your original patch)


instance   tableregex  plan   time

HEAD   azjunk6  x[aeio]{1,3}q  Seq Scan   3566.088 
ms
HEAD   azjunk6  x[aeio]{1,3}q  Seq Scan   3540.606 
ms
HEAD   azjunk6  x[aeio]{1,3}q  Seq Scan   3495.034 
ms
HEAD   azjunk6  x[aeio]{1,3}q  Seq Scan   3510.403 
ms
trgm_regex azjunk6  x[aeio]{1,3}q  Bitmap Heap Scan   3724.131 
ms
trgm_regex azjunk6  x[aeio]{1,3}q  Bitmap Heap Scan   3844.999 
ms
trgm_regex azjunk6  x[aeio]{1,3}q  Bitmap Heap Scan   3835.190 
ms
trgm_regex azjunk6  x[aeio]{1,3}q  Bitmap Heap Scan   3724.016 
ms


HEAD   azjunk6  x[aeio]{1,4}q  Seq Scan   3617.997 
ms
HEAD   azjunk6  x[aeio]{1,4}q  Seq Scan   3644.215 
ms
HEAD   azjunk6  x[aeio]{1,4}q  Seq Scan   3636.976 
ms
HEAD   azjunk6  x[aeio]{1,4}q  Seq Scan   3625.493 
ms
trgm_regex azjunk6  x[aeio]{1,4}q  Bitmap Heap Scan   7885.247 
ms
trgm_regex azjunk6  x[aeio]{1,4}q  Bitmap Heap Scan   8799.082 
ms
trgm_regex azjunk6  x[aeio]{1,4}q  Bitmap Heap Scan   7754.152 
ms
trgm_regex azjunk6  x[aeio]{1,4}q  Bitmap Heap Scan   7721.332 
ms


This is with your patch as is; in instances compiled with higher 
MAX_COLOR_CHARS (I did 6 and 9),
it is of course even easier to dream up a slow regex...



Erik Rijkers






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-01-19 Thread Heikki Linnakangas


On 22.11.2011 21:38, Alexander Korotkov wrote:

WIP patch with index support for regexp search for pg_trgm contrib is
attached.
In spite of techniques which extracts continuous text parts from regexp,
this patch presents technique of automatum transformation. That allows more
comprehensive trigrams extraction.


Nice!


Current version of patch have some limitations:
1) Algorithm of logical expression extraction on trigrams have high
computational complexity. So, it can become really slow on regexp with many
branches. Probably, improvements of this algorithm is possible.
2) Surely, no perfomance benefit if no trigrams can be extracted from
regexp. It's inevitably.
3) Currently, only GIN index is supported. There are no serious problems,
GiST code for it just not written yet.
4) It appear to be some kind of problem to extract multibyte encoded
character from pg_wchar. I've posted question about it here:
http://archives.postgresql.org/pgsql-hackers/2011-11/msg01222.php
While I've hardcoded some dirty solution. So
PG_EUC_JP, PG_EUC_CN, PG_EUC_KR, PG_EUC_TW, PG_EUC_JIS_2004 are not
supported yet.


This is pretty far from being in committable state, so I'm going to mark 
this as returned with feedback in the commitfest app. The feedback:


The code badly needs comments. There is no explanation of how the 
trigram extraction code in trgm_regexp.c works. Guessing from the 
variable names, it seems to be some sort of a coloring algorithm that 
works on a graph, but that all needs to be explained. Can this algorithm 
be found somewhere in literature, perhaps? A link to a paper would be nice.


Apart from that, the multibyte issue seems like the big one. Any way 
around that?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-01-19 Thread Alexander Korotkov

On Fri, Jan 20, 2012 at 12:30 AM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 The code badly needs comments. There is no explanation of how the trigram
 extraction code in trgm_regexp.c works.

 Sure. I hoped to find a time for comments before commitfest starts.
Unfortunately I didn't, sorry.


 Guessing from the variable names, it seems to be some sort of a coloring
 algorithm that works on a graph, but that all needs to be explained. Can
 this algorithm be found somewhere in literature, perhaps? A link to a paper
 would be nice.

I hope it's truly novel. At least application to regular expressions. I'm
going to write a paper about it.


 Apart from that, the multibyte issue seems like the big one. Any way
 around that?

Conversion of pg_wchar to multibyte character is the only way I found to
avoid serious hacking of existing regexp code. Do you think additional
function in pg_wchar_tbl which converts pg_wchar back to multibyte
character is possible solution?

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-01-19 Thread Alexander Korotkov

I also have a question about pg_wchar.

/*
 *---
 * encoding info table
 * XXX must be sorted by the same order as enum pg_enc (in mb/pg_wchar.h)
 *---
 */
pg_wchar_tbl pg_wchar_table[] = {
{pg_ascii2wchar_with_len, pg_ascii_mblen, pg_ascii_dsplen,
pg_ascii_verifier, 1}, /* PG_SQL_ASCII */
 {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen,
pg_eucjp_verifier, 3}, /* PG_EUC_JP */
 {pg_euccn2wchar_with_len, pg_euccn_mblen, pg_euccn_dsplen,
pg_euccn_verifier, 2}, /* PG_EUC_CN */
 {pg_euckr2wchar_with_len, pg_euckr_mblen, pg_euckr_dsplen,
pg_euckr_verifier, 3}, /* PG_EUC_KR */
 {pg_euctw2wchar_with_len, pg_euctw_mblen, pg_euctw_dsplen,
pg_euctw_verifier, 4}, /* PG_EUC_TW */
 {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen,
pg_eucjp_verifier, 3}, /* PG_EUC_JIS_2004 */
 {pg_utf2wchar_with_len, pg_utf_mblen, pg_utf_dsplen, pg_utf8_verifier, 4}, /*
PG_UTF8 */
 {pg_mule2wchar_with_len, pg_mule_mblen, pg_mule_dsplen, pg_mule_verifier,
4}, /* PG_MULE_INTERNAL */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN1 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN2 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN3 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN4 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN5 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN6 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN7 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN8 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN9 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN10 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1256 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1258 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN866 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN874 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_KOI8R */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1251 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1252 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* ISO-8859-5 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* ISO-8859-6 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* ISO-8859-7 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* ISO-8859-8 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1250 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1253 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1254 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1255 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1257 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_KOI8U */
 {0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2}, /* PG_SJIS */
{0, pg_big5_mblen, pg_big5_dsplen, pg_big5_verifier, 2}, /* PG_BIG5 */
 {0, pg_gbk_mblen, pg_gbk_dsplen, pg_gbk_verifier, 2}, /* PG_GBK */
{0, pg_uhc_mblen, pg_uhc_dsplen, pg_uhc_verifier, 2}, /* PG_UHC */
 {0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifier, 4}, /*
PG_GB18030 */
{0, pg_johab_mblen, pg_johab_dsplen, pg_johab_verifier, 3}, /* PG_JOHAB */
 {0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2} /*
PG_SHIFT_JIS_2004 */
};

What does last 7 zeros in the first column means? No conversion to pg_wchar
is possible from these encodings?

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-01-19 Thread Alexander Korotkov

On Fri, Jan 20, 2012 at 1:07 AM, Alexander Korotkov aekorot...@gmail.comwrote:

 What does last 7 zeros in the first column means? No conversion to
 pg_wchar is possible from these encodings?

Uh, I see. These encodings is not supported as server encodings.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-01-19 Thread Erik Rijkers

On Thu, January 19, 2012 21:30, Heikki Linnakangas wrote:
 On 22.11.2011 21:38, Alexander Korotkov wrote:
 WIP patch with index support for regexp search for pg_trgm contrib is
 attached.
 In spite of techniques which extracts continuous text parts from regexp,
 this patch presents technique of automatum transformation. That allows more
 comprehensive trigrams extraction.

 Nice!


Yes, wonderful stuff; I tested quite a bit with this patch; FWIW, here's what I 
found.

The patch yields spectacular speedups with small, simple-enough regexen.  But 
it does not do a
good enough job when guessing where to use the index and where fall back to Seq 
Scan.  This can
lead to (also spectacular) slow-downs, compared to Seq Scan.

I used the following to generate 3 test tables with lines of 80 random chars 
(just in case it's
handy for others to use):

$ cat create_data.sh
#!/bin/sh

for power in 4 5 6
do
  table=azjunk${power}
  index=${table}_trgmrgx_txt_01_idx
  echo -- generating table $table with index $index;
  time perl -E'
  sub ss{ join,@_[ map{rand @_} 1 .. shift ] };
  say(ss(80,a..g, ,h..m, ,n..s, ,t..z))
 for 1 .. 1e'${power}; \
  | psql -aqXc drop table if exists $table;
  create table $table(txt text); copy $table from stdin;
  -- set session maintenance_work_mem='20GB';
  create index $index on $table using gin(txt gin_trgm_ops);
  analyze $table;;
done


\dt+ public.azjunk*
  List of relations
 Schema |  Name   | Type  |   Owner   |  Size   | Description
+-+---+---+-+-
 public | azjunk4 | table | breinbaas | 1152 kB |
 public | azjunk5 | table | breinbaas | 11 MB   |
 public | azjunk6 | table | breinbaas | 112 MB  |
(3 rows)



I guessed that MAX_COLOR_CHARS limits the character class size (to 4, in your 
patch), is that
true?   I can understand you want that value to be low to limit the above risk, 
but now it reduces
the usability of the feature a bit: one has to split up larger char-classes 
into several smaller
ones to make a statement use the index: i.e.:
 txt ~ 'f[aeio]n' OR txt ~ 'f[uy]n'
instead of
 txt ~ 'f[aeiouy]n'



I made compiled instances with larger values for MAX_COLOR_CHARS (6 and 9), and 
sure enough they
used the index for larger classes such as the above, but of course also got 
into problems easier
when quantifiers are added (*, +, {n,m}).

A better calculation to decide index-use seems necessary, and ideally one that 
allows for a larger
MAX_COLOR_CHARS than 4.  Especially quantifiers could perhaps be inspected wrt 
that decision.
IMHO, the functionality would still be very useful when only very simple 
regexen were considered.

Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I 
suppose this is
normally necessary for perfomance reasons, but it would be useful te be able to 
compile a test
version that allows it.  I don't know how hard that would be.




There is also a minor bug, I think, when running with  'set enable_seqscan=off' 
 in combination
with a too-large regex:

$ cat fail.sql:

set enable_bitmapscan=on;
set enable_seqscan=on;
explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n';   -- OK
explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n';  -- OK

set enable_bitmapscan=on;
set enable_seqscan=off;
explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n';   -- OK
explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n';  -- crashes


 $ psql -f fail.sql
  QUERY PLAN
--
 Bitmap Heap Scan on azjunk4  (cost=52.01..56.02 rows=1 width=81) (actual 
time=1.011..5.291
rows=131 loops=1)
   Recheck Cond: (txt ~ 'f[aeio]n'::text)
   -  Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx  (cost=0.00..52.01 
rows=1 width=0) (actual
time=0.880..0.880 rows=131 loops=1)
 Index Cond: (txt ~ 'f[aeio]n'::text)
 Total runtime: 5.700 ms
(5 rows)

  QUERY PLAN
---
 Seq Scan on azjunk4  (cost=0.00..268.00 rows=1 width=81) (actual 
time=1.491..36.049 rows=164
loops=1)
   Filter: (txt ~ 'f[aeiou]n'::text)
   Rows Removed by Filter: 9836
 Total runtime: 36.112 ms
(4 rows)

  QUERY PLAN
--
 Bitmap Heap Scan on azjunk4  (cost=52.01..56.02 rows=1 width=81) (actual 
time=0.346..0.927
rows=131 loops=1)
   Recheck Cond: (txt ~ 'f[aeio]n'::text)
   -  Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx  (cost=0.00..52.01 
rows=1 width=0) (actual
time=0.316..0.316 rows=131 loops=1)
 Index Cond: (txt ~ 'f[aeio]n'::text)

Re: [HACKERS] WIP: index support for regexp search

2011-12-01 Thread Alexander Korotkov

On Thu, Dec 1, 2011 at 12:29 AM, Robert Haas robertmh...@gmail.com wrote:

 Please add this patch here so it does not get lost in the shuffle:

 https://commitfest.postgresql.org/action/commitfest_view/open


Done.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2011-11-30 Thread Robert Haas

On Tue, Nov 22, 2011 at 2:38 PM, Alexander Korotkov
aekorot...@gmail.com wrote:
 WIP patch with index support for regexp search for pg_trgm contrib is
 attached.
 In spite of techniques which extracts continuous text parts from regexp,
 this patch presents technique of automatum transformation. That allows more
 comprehensive trigrams extraction.

Please add this patch here so it does not get lost in the shuffle:

https://commitfest.postgresql.org/action/commitfest_view/open

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

70 matches

Mail list logo