Re: [HACKERS] trgm regex index peculiarity

2014-04-05 Thread Tom Lane
Erik Rijkers e...@xs4all.nl writes:
 On Fri, March 28, 2014 09:31, Heikki Linnakangas wrote:
 So thanks to the fast scan patch, I don't think this patch is worth
 pursuing anymore. Unless there are some other test case where this patch
 helps, but the fast scan patch doesn't.

 FWIW, for me the difference (from HEAD) remains quite a bit larger:
 ...
 That's a lot better than the original 140ms vs 5ms  but your laptop's  2.5 ms 
 vs. 1 ms  is perhaps not representative.

I also feel that Alexander's patch is still worth pursuing.  What I see
(in an assert-disabled build of HEAD) is that Erik's slow query takes
about 2 ms to plan and 5.5 ms to execute, while the fast query is about
0.7 ms to plan and 1.3 ms to execute.  With the patch, the slow query is
still about 2 ms to plan but only 0.3 ms to execute, while the fast
query doesn't change much.  There's a lot of noise in these numbers
(successive executions seem to be bouncing around more than usual today),
but still it looks like the runtime gain is impressive percentagewise.

One thing that's interesting is that the planning time is so much more for
the slow query, even though the fast one has an additional WHERE
clause that the planner has to think about.  I haven't tried to do perf
measurements to be sure, but my guess is that this has nothing to do with
GIN or pg_trgm directly, but represents the planner's efforts to get a
selectivity estimate for the ~ operator --- that code works much
differently for patterns that define fixed prefixes vs those that don't.

Anyway, the important point here is that I think the planning time
is helping to mask the fact that there's a pretty useful runtime
improvement.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] trgm regex index peculiarity

2014-04-05 Thread Tom Lane
Alexander Korotkov aekorot...@gmail.com writes:
 Next revision of patch is attached. Changes are so:
 1) Notion penalty is used instead of size.
 2) We try to reduce total penalty to WISH_TRGM_PENALTY, but restriction is
 MAX_TRGM_COUNT total trigrams count.
 3) Penalties are assigned to particular color trigram classes. I.e.
 separate penalties for __a, _aa, _a_, aa_. It's based on analysis of
 trigram frequencies in Oscar Wilde writings. We can end up with different
 numbers, but I don't think they will be dramatically different.

Committed with cosmetic improvements (adjusting the comments mostly).

The new whitespace penalties look reasonably sane to me.  I wonder though
if WISH_TRGM_PENALTY is too small --- it seems like this code will tend to
select many fewer trigrams than the old code did.  What testing did you do
that led you to select the specific value of 16?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] trgm regex index peculiarity

2014-03-28 Thread Heikki Linnakangas
I went back and tried Erik's original test 
(http://www.postgresql.org/message-id/dafad644f268ce1503e1b8b682aae38a.squir...@webmail.xs4all.nl). 
With a fresh checkout from master, the difference between the slow and 
fast queries is much less dramatic than Erik reported. The reason is 
that Alexander's GIN fast scan patch is very effective on that query. 
Erik reported that the '^abcd' query took 140ms, vs 5ms for 'abcd'. On 
my laptop, the numbers with a fresh checkout are about 2.5 ms vs. 1 ms, 
and with fast scan disabled (by modifying the source code), 40ms vs 1ms.


So thanks to the fast scan patch, I don't think this patch is worth 
pursuing anymore. Unless there are some other test case where this patch 
helps, but the fast scan patch doesn't.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] trgm regex index peculiarity

2014-03-28 Thread Erik Rijkers
On Fri, March 28, 2014 09:31, Heikki Linnakangas wrote:
 I went back and tried Erik's original test
 (http://www.postgresql.org/message-id/dafad644f268ce1503e1b8b682aae38a.squir...@webmail.xs4all.nl).
 With a fresh checkout from master, the difference between the slow and
 fast queries is much less dramatic than Erik reported. The reason is
 that Alexander's GIN fast scan patch is very effective on that query.
 Erik reported that the '^abcd' query took 140ms, vs 5ms for 'abcd'. On
 my laptop, the numbers with a fresh checkout are about 2.5 ms vs. 1 ms,
 and with fast scan disabled (by modifying the source code), 40ms vs 1ms.

 So thanks to the fast scan patch, I don't think this patch is worth
 pursuing anymore. Unless there are some other test case where this patch
 helps, but the fast scan patch doesn't.


for the same 2 statements of my original test:
explain (analyze,buffers) select txt from azjunk6 where txt ~ '^abcd'; -- slow 
(140 ms)
explain (analyze,buffers) select txt from azjunk6 where txt ~  'abcd' and 
substr(txt,1,4) = 'abcd'; -- fast (5 ms)

You mention (from HEAD, I suppose?) a difference of 2.5 ms vs. 1 ms.

FWIW, for me the difference (from HEAD) remains quite a bit larger:

for n in `seq 1 10`; do ./trgm_peculiarity.sh ; done | grep runtime

 Total runtime: 16.167 ms
 Total runtime: 2.188 ms
 Total runtime: 16.902 ms
 Total runtime: 2.203 ms
 Total runtime: 17.486 ms
 Total runtime: 2.201 ms
 Total runtime: 17.663 ms
 Total runtime: 2.441 ms
 Total runtime: 13.555 ms
 Total runtime: 2.204 ms
 Total runtime: 16.862 ms
 Total runtime: 2.225 ms
 Total runtime: 13.207 ms
 Total runtime: 2.550 ms
 Total runtime: 16.768 ms
 Total runtime: 2.172 ms
 Total runtime: 19.259 ms
 Total runtime: 2.180 ms
 Total runtime: 12.934 ms
 Total runtime: 2.198 ms

That's a lot better than the original 140ms vs 5ms  but your laptop's  2.5 ms 
vs. 1 ms  is perhaps not representative.

(for the full plans see below)


Erik Rijkers



--
 Bitmap Heap Scan on azjunk6  (cost=56.77..432.93 rows=100 width=81) (actual 
time=15.898..15.925 rows=2 loops=1)
   Recheck Cond: (txt ~ '^abcd'::text)
   Rows Removed by Index Recheck: 11
   Heap Blocks: exact=13
   Buffers: shared hit=105
   -  Bitmap Index Scan on azjunk6_trgm_re_idx  (cost=0.00..56.75 rows=100 
width=0) (actual time=15.834..15.834 rows=13
loops=1)
 Index Cond: (txt ~ '^abcd'::text)
 Buffers: shared hit=92
 Planning time: 3.304 ms
 Total runtime: 16.179 ms
(10 rows)

Time: 21.103 ms
explain (analyze,buffers) select txt from azjunk6 where txt ~  'abcd' and 
substr(txt,1,4) = 'abcd'; -- fast (5 ms)
   QUERY PLAN
-
 Bitmap Heap Scan on azjunk6  (cost=28.75..405.40 rows=1 width=81) (actual 
time=1.681..2.164 rows=2 loops=1)
   Recheck Cond: (txt ~ 'abcd'::text)
   Rows Removed by Index Recheck: 11
   Filter: (substr(txt, 1, 4) = 'abcd'::text)
   Rows Removed by Filter: 101
   Heap Blocks: exact=113
   Buffers: shared hit=120
   -  Bitmap Index Scan on azjunk6_trgm_re_idx  (cost=0.00..28.75 rows=100 
width=0) (actual time=1.171..1.171 rows=114
loops=1)
 Index Cond: (txt ~ 'abcd'::text)
 Buffers: shared hit=7
 Planning time: 0.516 ms
 Total runtime: 2.183 ms
(12 rows)








-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] trgm regex index peculiarity

2014-03-01 Thread Alexander Korotkov
On Mon, Feb 10, 2014 at 1:01 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov aekorot...@gmail.com writes:
  On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  I looked at this patch a bit.  It seems like this:
  +  *BLANK_COLOR_SIZE - How much blank character is more frequent
 than
  +  *   other character in average
  + #define BLANK_COLOR_SIZE  32
  is a drastic overestimate.

  OK, I would like to make more reasoning for penalty.
  Let's consider word of length n.
  It contains n+1 trigrams including:
  1 __x trigram
  1 _xx trigram
  1 xx_ trigram
  n - 2 xxx trigrams

  Assume alphabet of size m those trigrams have following average
 frequencies:
  __x 1/((n+1)*m)
  _xx 1/((n+1)*m^2)
  xx_ 1/((n+1)*m^2)
  xxx (n-2)/((n+1)*m^3)

 Hmm, but you're assuming that the m letters are all equally frequent,
 which is surely not true in normal text.

 I didn't have a machine-readable copy of Hemingway or Tolstoy at hand,
 but I do have a text file with the collected works of H.P. Lovecraft,
 so I tried analyzing the trigrams in that.  I find that
 * The average word length is 4.78 characters.
 * There are 5652 distinct trigrams (discounting some containing digits),
 the most common one ('  t') occurring 81222 times; the average
 occurrence count is 500.0.
 * Considering only trigrams not containing any blanks, there are
 4937 of them, the most common one ('the') occurring 45832 times,
 with an average count of 266.9.
 * There are (unsurprisingly) exactly 26 trigrams of the form '  x',
 with an average count of 19598.5.
 * There are 689 trigrams of the form ' xx' or 'xx ', the most common one
 (' th') occurring 58200 times, with an average count of 1450.1.

 So, we've rediscovered the fact that 'the' is the most common word in
 English text ;-).  But I think the significant conclusion for our purposes
 here is that single-space trigrams are on average about 1450.1/266.9 = 5.4
 times more common than the average space-free trigram, and two-space
 trigrams about 19598.5/266.9 = 73.4 times more common.

 So this leads me to the conclusion that we should be using a
 BLANK_COLOR_SIZE value around 5 or 6 (with maybe something other than
 a square-law rule for two-space trigrams).  Even using your numbers,
 it shouldn't be 32.


Next revision of patch is attached. Changes are so:
1) Notion penalty is used instead of size.
2) We try to reduce total penalty to WISH_TRGM_PENALTY, but restriction is
MAX_TRGM_COUNT total trigrams count.
3) Penalties are assigned to particular color trigram classes. I.e.
separate penalties for __a, _aa, _a_, aa_. It's based on analysis of
trigram frequencies in Oscar Wilde writings. We can end up with different
numbers, but I don't think they will be dramatically different.

--
With best regards,
Alexander Korotkov.


trgm-regex-optimize.3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] trgm regex index peculiarity

2014-02-09 Thread Alexander Korotkov
On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov aekorot...@gmail.com writes:
  Revised version of patch with necessary comments.

 I looked at this patch a bit.  It seems like this:

 +  *BLANK_COLOR_SIZE - How much blank character is more frequent than
 +  *   other character in average
 + #define BLANK_COLOR_SIZE  32

 is a drastic overestimate.  Using the program Erik provided to generate
 some sample data, I find that blanks in trigrams are maybe about three
 times more common than other characters.  Possibly Erik's data isn't
 very representative of real text, but if that's the case we'd better
 get a more representative sample case before we start optimizing ...

 Now maybe the above number is okay for this purpose anyway, but if so
 it needs some different justification; maybe call it a penalty?
 But nonetheless it seems like a darn large penalty, particularly for  x
 type trigrams where the value would effectively be squared.  Compared to
 the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like
 you might as well forget the sizing approach and just flat out reject
 trigrams containing COLOR_BLANK, because they'll never get through the
 size filter.


It seems to be right decision to me when we have other trigrams can reject
them. But there are cases when we can't.


  I'm inclined to think you need a larger MAX_TRGM_SIZE;
 and WISH_TRGM_SIZE seems mighty small as well, compared to what the
 code would have done before.


OK, I would like to make more reasoning for penalty.
Let's consider word of length n.
It contains n+1 trigrams including:
1 __x trigram
1 _xx trigram
1 xx_ trigram
n - 2 xxx trigrams

Assume alphabet of size m those trigrams have following average frequencies:
__x 1/((n+1)*m)
_xx 1/((n+1)*m^2)
xx_ 1/((n+1)*m^2)
xxx (n-2)/((n+1)*m^3)

The ratio of this frequencies with blanks to frequency of xxx is:
__x m^2/(n-2)
_xx and xx_ m/(n-2)

In order to estimate n I've analyzed some classics:
Ernest Hemingway A farewell to arms - 3.8
Leo Tolstoy War and Peace - 5.0

In english with alphabet size = 26 we have

__x m^2/(n-2) = 375
_xx and xx_ m/(n-2) = 14.4

In russian with alphabet size = 33 we have

__x m^2/(n-2) = 363
_xx and xx_ m/(n-2) = 11

Assuming these calculations is approximate, we can safely use same values
for both languages.
Does such reasoning make sense?

Also, surely this bit:

 !   trgmNFA-totalTrgmCount = (int) totalTrgmSize;

 is flat out wrong?  expandColorTrigrams() expects that
 trgmNFA-totalTrgmCount is the truth, not some penalty-bloated
 abstraction.  The fact that the patch hasn't failed on you completely
 demonstrates that you've not tested any cases where blank-containing
 trigrams got through the filter, reinforcing my feeling that it's
 probably too strict now.


Oh, I wonder how can I leave such weird bug in patch :-(


 I wonder if it would be more appropriate to continue to measure the limit
 MAX_TRGM_COUNT in terms of actual trigram counts, and just use the
 penalized size as the sort key while determining which color trigrams
 to discard first.


Agree. But I would like to save some equivalent of WISH_TRGM_SIZE.

Another thought is that it's not clear that you should apply the same
 penalty to blanks in all three positions.  Because of the padding
 settings, any one word will normally lead to padded strings   a,
  ab, yz , so that spaces in the first position are about twice
 as common as spaces in the other positions.  (It's a little less
 than that because single-letter words produce only   a and  a ;
 but I'd think that such words aren't very common in text that trigrams
 are effective for.)  So I'm thinking the penalty ought to take that
 into account.

 I'm also inclined to think that it might be worth adding a size
 field to ColorTrgmInfo rather than repeatedly calculating the
 size estimate.  We only allow a thousand and some ColorTrgmInfos
 at most, so the extra space wouldn't be that large, and I'm concerned
 by the expense the current patch adds to the sort comparator.


Agree.



 Another point is that this comment:

  * Note: per-color-trigram counts cannot overflow an int so long as
  * COLOR_COUNT_LIMIT is not more than the cube root of INT_MAX, ie
 about
  * 1290.  However, the grand total totalTrgmCount might conceivably
  * overflow an int, so we use int64 for that within this routine.

 is no longer valid, or at least it fails to demonstrate that the result
 of getColorTrigramSize() can't overflow an int; indeed I rather fear it
 can.
 The patch failed to even update the variable name in this comment, let
 alone address the substantive question.

 There are some other cosmetic things I didn't like, for instance
 colorTrgmInfoCountCmp() is no longer comparing counts but you didn't
 rename it.  That's about it for substantive comments though.


Thanks, will be fixed.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] trgm regex index peculiarity

2014-02-09 Thread Tom Lane
Alexander Korotkov aekorot...@gmail.com writes:
 On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I looked at this patch a bit.  It seems like this:
 +  *BLANK_COLOR_SIZE - How much blank character is more frequent than
 +  *   other character in average
 + #define BLANK_COLOR_SIZE  32
 is a drastic overestimate.

 OK, I would like to make more reasoning for penalty.
 Let's consider word of length n.
 It contains n+1 trigrams including:
 1 __x trigram
 1 _xx trigram
 1 xx_ trigram
 n - 2 xxx trigrams

 Assume alphabet of size m those trigrams have following average frequencies:
 __x 1/((n+1)*m)
 _xx 1/((n+1)*m^2)
 xx_ 1/((n+1)*m^2)
 xxx (n-2)/((n+1)*m^3)

Hmm, but you're assuming that the m letters are all equally frequent,
which is surely not true in normal text.

I didn't have a machine-readable copy of Hemingway or Tolstoy at hand,
but I do have a text file with the collected works of H.P. Lovecraft,
so I tried analyzing the trigrams in that.  I find that
* The average word length is 4.78 characters.
* There are 5652 distinct trigrams (discounting some containing digits),
the most common one ('  t') occurring 81222 times; the average
occurrence count is 500.0.
* Considering only trigrams not containing any blanks, there are
4937 of them, the most common one ('the') occurring 45832 times,
with an average count of 266.9.
* There are (unsurprisingly) exactly 26 trigrams of the form '  x',
with an average count of 19598.5.
* There are 689 trigrams of the form ' xx' or 'xx ', the most common one
(' th') occurring 58200 times, with an average count of 1450.1.

So, we've rediscovered the fact that 'the' is the most common word in
English text ;-).  But I think the significant conclusion for our purposes
here is that single-space trigrams are on average about 1450.1/266.9 = 5.4
times more common than the average space-free trigram, and two-space
trigrams about 19598.5/266.9 = 73.4 times more common.

So this leads me to the conclusion that we should be using a
BLANK_COLOR_SIZE value around 5 or 6 (with maybe something other than
a square-law rule for two-space trigrams).  Even using your numbers,
it shouldn't be 32.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] trgm regex index peculiarity

2014-01-15 Thread Alexander Korotkov
On Fri, Jun 21, 2013 at 5:39 PM, Erik Rijkers e...@xs4all.nl wrote:

 On Fri, June 21, 2013 15:11, Alexander Korotkov wrote:
  On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers e...@xs4all.nl wrote:
 
  On Fri, June 21, 2013 05:25, Tom Lane wrote:
   Erik Rijkers e...@xs4all.nl writes:
   In a 112 MB test table (containing random generated text) with a trgm
  index (gin_trgm_ops), I consistently get these
   timings:
   select txt from azjunk6 where txt ~ '^abcd';
  130 ms
   select txt from azjunk6
   where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
  3 ms
  
 
  Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and
 'bcd'.
  However trigrams '__a' is much more frequent than '_ab' which in turn is
  much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting
 of
  '__a' and '_ab' and that gives so significant speedup.

  [trgm_regex_optimize.1.patch ]

 Yes, that fixes the problem, thanks.


Revised version of patch with necessary comments.

--
With best regards,
Alexander Korotkov.


trgm-regex-optimize.2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] trgm regex index peculiarity

2014-01-15 Thread Tom Lane
Alexander Korotkov aekorot...@gmail.com writes:
 Revised version of patch with necessary comments.

I looked at this patch a bit.  It seems like this:

+  *BLANK_COLOR_SIZE - How much blank character is more frequent than
+  *   other character in average
+ #define BLANK_COLOR_SIZE  32

is a drastic overestimate.  Using the program Erik provided to generate
some sample data, I find that blanks in trigrams are maybe about three
times more common than other characters.  Possibly Erik's data isn't
very representative of real text, but if that's the case we'd better
get a more representative sample case before we start optimizing ...

Now maybe the above number is okay for this purpose anyway, but if so
it needs some different justification; maybe call it a penalty?
But nonetheless it seems like a darn large penalty, particularly for  x
type trigrams where the value would effectively be squared.  Compared to
the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like
you might as well forget the sizing approach and just flat out reject
trigrams containing COLOR_BLANK, because they'll never get through the
size filter.  I'm inclined to think you need a larger MAX_TRGM_SIZE;
and WISH_TRGM_SIZE seems mighty small as well, compared to what the
code would have done before.

Also, surely this bit:

!   trgmNFA-totalTrgmCount = (int) totalTrgmSize;

is flat out wrong?  expandColorTrigrams() expects that
trgmNFA-totalTrgmCount is the truth, not some penalty-bloated
abstraction.  The fact that the patch hasn't failed on you completely
demonstrates that you've not tested any cases where blank-containing
trigrams got through the filter, reinforcing my feeling that it's
probably too strict now.

I wonder if it would be more appropriate to continue to measure the limit
MAX_TRGM_COUNT in terms of actual trigram counts, and just use the
penalized size as the sort key while determining which color trigrams
to discard first.

Another thought is that it's not clear that you should apply the same
penalty to blanks in all three positions.  Because of the padding
settings, any one word will normally lead to padded strings   a,
 ab, yz , so that spaces in the first position are about twice
as common as spaces in the other positions.  (It's a little less
than that because single-letter words produce only   a and  a ;
but I'd think that such words aren't very common in text that trigrams
are effective for.)  So I'm thinking the penalty ought to take that
into account.

I'm also inclined to think that it might be worth adding a size
field to ColorTrgmInfo rather than repeatedly calculating the
size estimate.  We only allow a thousand and some ColorTrgmInfos
at most, so the extra space wouldn't be that large, and I'm concerned
by the expense the current patch adds to the sort comparator.

Another point is that this comment:

 * Note: per-color-trigram counts cannot overflow an int so long as
 * COLOR_COUNT_LIMIT is not more than the cube root of INT_MAX, ie about
 * 1290.  However, the grand total totalTrgmCount might conceivably
 * overflow an int, so we use int64 for that within this routine.

is no longer valid, or at least it fails to demonstrate that the result
of getColorTrigramSize() can't overflow an int; indeed I rather fear it can.
The patch failed to even update the variable name in this comment, let
alone address the substantive question.

There are some other cosmetic things I didn't like, for instance
colorTrgmInfoCountCmp() is no longer comparing counts but you didn't
rename it.  That's about it for substantive comments though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] trgm regex index peculiarity

2013-06-21 Thread Erik Rijkers
On Fri, June 21, 2013 05:25, Tom Lane wrote:
 Erik Rijkers e...@xs4all.nl writes:
 In a 112 MB test table (containing random generated text) with a trgm index 
 (gin_trgm_ops), I consistently get these
 timings:
 select txt from azjunk6 where txt ~ '^abcd';
130 ms
 select txt from azjunk6
 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
3 ms

 Hm, could you provide a self-contained test case?


yes, sorry.   I tested on a 1M row table:

#!/bin/sh

# create table:
for power in 6;
do
  table=azjunk${power}
  index=${table}_trgm_re_idx
  perl -E'
  sub ss{ join,@_[ map{rand @_} 1 .. shift ] };
  say(ss(80,a..g, ,h..m, ,n..s, ,t..z)) for 1 .. 
1e'${power}; \
  | psql -aqXc 
drop table if exists $table;
create table $table(txt text);
copy $table from stdin;;
  echo set session maintenance_work_mem='1GB';
create index $index on $table using gin (txt gin_trgm_ops);
analyze $table; | psql -qtAX;
done

# test:
echo 
\\timing on
explain analyze select txt from azjunk6 where txt ~ '^abcd';  -- slow (140 ms)
explain analyze select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) 
= 'abcd'; -- fast (5 ms)
 | psql -Xqa





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] trgm regex index peculiarity

2013-06-21 Thread Alexander Korotkov
On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers e...@xs4all.nl wrote:

 On Fri, June 21, 2013 05:25, Tom Lane wrote:
  Erik Rijkers e...@xs4all.nl writes:
  In a 112 MB test table (containing random generated text) with a trgm
 index (gin_trgm_ops), I consistently get these
  timings:
  select txt from azjunk6 where txt ~ '^abcd';
 130 ms
  select txt from azjunk6
  where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
 3 ms
 
  Hm, could you provide a self-contained test case?
 

 yes, sorry.   I tested on a 1M row table:

 #!/bin/sh

 # create table:
 for power in 6;
 do
   table=azjunk${power}
   index=${table}_trgm_re_idx
   perl -E'
   sub ss{ join,@_[ map{rand @_} 1 .. shift ] };
   say(ss(80,a..g, ,h..m, ,n..s, ,t..z)) for 1 ..
 1e'${power}; \
   | psql -aqXc 
 drop table if exists $table;
 create table $table(txt text);
 copy $table from stdin;;
   echo set session maintenance_work_mem='1GB';
 create index $index on $table using gin (txt gin_trgm_ops);
 analyze $table; | psql -qtAX;
 done

 # test:
 echo 
 \\timing on
 explain analyze select txt from azjunk6 where txt ~ '^abcd';  -- slow (140
 ms)
 explain analyze select txt from azjunk6 where txt ~ 'abcd' and
 substr(txt,1,4) = 'abcd'; -- fast (5 ms)
  | psql -Xqa


Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and 'bcd'.
However trigrams '__a' is much more frequent than '_ab' which in turn is
much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting of
'__a' and '_ab' and that gives so significant speedup.
The problem is that trigram regex engine doesn't know that '__a' and '_ab'
is more frequent than another trigrams because we don't know their
distribution. However, simple assumption that blank character (in pg_trgm
meaning) is much more frequent than others seems to be true for most cases.
Attached patch implementing this assumption. It introduces BLANK_COLOR_SIZE
macro which is used in selectColorTrigrams like count of characters in
COLOR_BLANK. Another change in this patch is split of MAX_TRGM_COUNT into
WISH_TRGM_SIZE and MAX_TRGM_SIZE. The idea is to keep trying to remove
color trigrams from graph even when it fits into MAX_TRGM_SIZE, because we
are intending to scan less posting lists/trees.
Comments is not fixed yet, coming soon.

--
With best regards,
Alexander Korotkov.


trgm_regex_optimize.1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] trgm regex index peculiarity

2013-06-21 Thread Erik Rijkers
On Fri, June 21, 2013 15:11, Alexander Korotkov wrote:
 On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers e...@xs4all.nl wrote:

 On Fri, June 21, 2013 05:25, Tom Lane wrote:
  Erik Rijkers e...@xs4all.nl writes:
  In a 112 MB test table (containing random generated text) with a trgm
 index (gin_trgm_ops), I consistently get these
  timings:
  select txt from azjunk6 where txt ~ '^abcd';
 130 ms
  select txt from azjunk6
  where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
 3 ms
 

 Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and 'bcd'.
 However trigrams '__a' is much more frequent than '_ab' which in turn is
 much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting of
 '__a' and '_ab' and that gives so significant speedup.

 [trgm_regex_optimize.1.patch ]

Yes, that fixes the problem, thanks.

Erik Rijkers




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] trgm regex index peculiarity

2013-06-20 Thread Erik Rijkers
9.4devel (but same in 9.3)

In a 112 MB test table (containing random generated text) with a trgm index 
(gin_trgm_ops), I consistently get these timings:

select txt from azjunk6 where txt ~ '^abcd';
   130 ms


select txt from azjunk6
where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
   3 ms

(a similar performance difference occurs when using a regex, i.e. 'abc[de]'  )

This difference is so large that I wonder if there is not something wrong in 
the first case. (The returned results are
correct though)

Here are the two explains:

explain analyze select txt from azjunk6 where txt ~ '^abcd';
 QUERY PLAN
-
 Bitmap Heap Scan on azjunk6  (cost=108.78..484.93 rows=100 width=81) (actual 
time=129.557..129.742 rows=1 loops=1)
   Recheck Cond: (txt ~ '^abcd'::text)
   Rows Removed by Index Recheck: 17
   -  Bitmap Index Scan on azjunk6_trgm_re_idx  (cost=0.00..108.75 rows=100 
width=0) (actual time=129.503..129.503 rows=18
loops=1)
 Index Cond: (txt ~ '^abcd'::text)
 Total runtime: 130.008 ms
(6 rows)

explain analyze select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) 
= 'abcd';
   QUERY PLAN
-
 Bitmap Heap Scan on azjunk6  (cost=56.75..433.40 rows=1 width=81) (actual 
time=2.064..3.379 rows=1 loops=1)
   Recheck Cond: (txt ~ 'abcd'::text)
   Rows Removed by Index Recheck: 14
   Filter: (substr(txt, 1, 4) = 'abcd'::text)
   Rows Removed by Filter: 112
   -  Bitmap Index Scan on azjunk6_trgm_re_idx  (cost=0.00..56.75 rows=100 
width=0) (actual time=1.911..1.911 rows=127
loops=1)
 Index Cond: (txt ~ 'abcd'::text)
 Total runtime: 3.409 ms
(8 rows)


The results in both cases are correct, but does this difference not almost 
amount to a bug?

( Interestingly, the variant WHERE txt ~ 'abcd$'
is as fast as the non-anchored variant )


Thanks,

Erik Rijkers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] trgm regex index peculiarity

2013-06-20 Thread Tom Lane
Erik Rijkers e...@xs4all.nl writes:
 In a 112 MB test table (containing random generated text) with a trgm index 
 (gin_trgm_ops), I consistently get these timings:
 select txt from azjunk6 where txt ~ '^abcd';
130 ms
 select txt from azjunk6
 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
3 ms

Hm, could you provide a self-contained test case?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers