from:"Alexander Korotkov"

[HACKERS] multibyte charater set in levenshtein function

2010-05-12 Thread Alexander Korotkov

Hackers,

The current version of levenshtein function in fuzzystrmatch contrib modulte
doesn't work properly with multibyte charater sets.

test=# select levenshtein('фыва','аыва');
 levenshtein
-
   2
(1 row)

My patch make this function works properly with multibyte charater sets.

test=# select levenshtein('фыва','аыва');
 levenshtein
-
   1
(1 row)

Also it avoids text_to_cstring call.

Regards,
Alexander Korotkov.


fuzzystrmatch.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte charater set in levenshtein function

2010-05-13 Thread Alexander Korotkov

 On Thu, May 13, 2010 at 6:03 AM, Alvaro Herrera alvhe...@commandprompt.com
 wrote:

 Well, since it's only used in one place, why are you defining a macro at
 all?

In order to structure code better. My question was about another. Is memcmp
function good choice to compare very short sequences of bytes (from 1 to 4
bytes)?

Re: [HACKERS] multibyte charater set in levenshtein function

2010-05-13 Thread Alexander Korotkov

On Wed, May 12, 2010 at 11:04 PM, Alvaro Herrera alvhe...@alvh.no-ip.orgwrote:

 On a quick look, I didn't like the way you separated the
 pg_database_encoding_max_length()  1 cases.  There seem to be too
 much common code.  Can that be refactored a bit better?

I did a little refactoring in order to avoid some similar code.
I'm not quite sure about my CHAR_CMP macro. Is it a good idea?


fuzzystrmatch-0.2.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte charater set in levenshtein function

2010-06-07 Thread Alexander Korotkov

Hello Hackers!

I have extended my patch by introducing levenshtein_less_equal function.
This function have additional argument max_d and stops calculating when
distance exceeds max_d. With low values of max_d function works much faster
than original one.

The example of original levenshtein function usage:

test=# select word, levenshtein(word, 'consistent') as dist from words where
levenshtein(word, 'consistent') = 2 order by dist;
word | dist
-+--
  consistent  |0
 insistent   |2
 consistency |2
 coexistent  |2
 consistence |2
(5 rows)

test=# explain analyze select word, levenshtein(word, 'consistent') as dist
from words where levenshtein(word, 'consistent') = 2 order by dist;
  QUERY PLAN

---
 Sort  (cost=2779.13..2830.38 rows=20502 width=8) (actual
time=203.652..203.658 rows=5 loops=1)
   Sort Key: (levenshtein(word, 'consistent'::text))
   Sort Method:  quicksort  Memory: 25kB
   -  Seq Scan on words  (cost=0.00..1310.83 rows=20502 width=8) (actual
time=19.019..203.601 rows=5 loops=1)
 Filter: (levenshtein(word, 'consistent'::text) = 2)
 Total runtime: 203.723 ms
(6 rows)

Example of levenshtein_less_equal usage in this case:

test=# select word, levenshtein_less_equal(word, 'consistent', 2) as dist
from words where levenshtein_less_equal(word, 'consistent', 2) = 2 order by
dist;
word | dist
-+--
 consistent  |0
 insistent   |2
 consistency |2
 coexistent  |2
 consistence |2

test=# explain analyze select word, levenshtein_less_equal(word,
'consistent', 2) as dist from words where levenshtein_less_equal(word,
'consistent', 2) = 2 order by dist;
 QUERY PLAN

-
 Sort  (cost=2779.13..2830.38 rows=20502 width=8) (actual
time=42.198..42.203 rows=5 loops=1)
   Sort Key: (levenshtein_less_equal(word, 'consistent'::text, 2))
   Sort Method:  quicksort  Memory: 25kB
   -  Seq Scan on words  (cost=0.00..1310.83 rows=20502 width=8) (actual
time=5.391..42.143 rows=5 loops=1)
 Filter: (levenshtein_less_equal(word, 'consistent'::text, 2) = 2)
 Total runtime: 42.292 ms
(6 rows)

In the example above levenshtein_less_equal works about 5 times faster.

With best regards,
Alexander Korotkov.


fuzzystrmatch-0.3.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Parameters of GiST indexes

2010-06-07 Thread Alexander Korotkov

Hi hackers,

I found that some parameters of GiST implementation are builin in the code.
For example, following can be found in the backend/utils/adt/tsgistidx.c:

#define SIGLENINT  31/* 121 = key will toast, so it will not
work
 * !!! */

#define SIGLEN( sizeof(int4) * SIGLENINT )
#define SIGLENBIT (SIGLEN * BITS_PER_BYTE)

I think that such parameters don't have optimal value for all the cases; and
it would be a great option to let user define such parameters for particular
index.
For example, following syntax can be used:

CREATE INDEX name ON table USING gist(column) WITH (SIGLENINT=63);

With best regards,
Korotkov Alexander.

[HACKERS] Using multidimensional indexes in ordinal queries

2010-06-07 Thread Alexander Korotkov

[v1,v2,v3]) @
cube(array[480,480,'-Infinity'::float],array[500,500,'+Infinity'::float])
order by cube(array[v1,v2,v3]) * 4 limit 10;
 QUERY PLAN


-
 Limit  (cost=0.00..39.10 rows=10 width=28) (actual time=5.186..20.047
rows=10 loops=1)
   -  Index Scan using test_cube_idx on test  (cost=0.00..39100.88
rows=1 width=28) (actual time=5.177..20.012 rows=10 loops=1)
 Index Cond: (cube(ARRAY[v1, v2, v3]) @ '(480, 480, -inf),(500,
500, inf)'::cube)
 Sort Cond: (cube(ARRAY[v1, v2, v3]) * 4)
 Total runtime: 20.145 ms
(5 rows)

So, the result is the same, but query runs about 150 time faster.
I think that usage of multidimensional index on ordinal data types can
produce following benefits:
1) It will be possible to create one multidimensional index instead of many
of unidimensional ones. (Of course, it will be slower on simple queries).
2) On some complex queries (like noted above) it can produce
great performance benefit.

And now there is my question. How huge changes in planner is needed to make
planner choose multidimensional index freely without rewriting query in
special manner? In my example I replaced v1 between 480 and 500 and v2
between 480 and 500 by cube(array[v1,v2,v3]) @
cube(array[480,480,'-Infinity'::float],array[500,500,'+Infinity'::float])
and order by v3 by order by cube(array[v1,v2,v3]), but it will be a
great option if planner could consider plan with gist cube index for
original query.

With best regards,
Alexander Korotkov.


cube.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using multidimensional indexes in ordinal queries

2010-06-21 Thread Alexander Korotkov

On Mon, Jun 21, 2010 at 5:42 PM, Robert Haas robertmh...@gmail.com wrote:

 It seems like you can get more or less the same benefit from a
 multicolumn btree index.  On my system, with the individual btree
 indices, the query ran in 7625 ms; with an additional index on (v1,
 v2, v3), it ran in 94 ms.  I didn't get the same plans as Alexander
 did, though, so it may not really be apples to apples.  See attached
 session trace.

Benefit of multicolumn btree index was more or less the same than cube
benefit because of very bad picksplit behavior in this case. I attached the
patch which significally improves cube index search performance:

test=# explain (analyze, buffers) select * from test where
cube(ARRAY[v1,v2,v3]) @ cube(ARRAY[480,480,'-inf'::float8],
ARRAY[500,500,'+inf'::float8]) order by cube(ARRAY[v1,v2,v3]) * 4 LIMIT
10;
 QUERY PLAN



 Limit  (cost=0.00..38.07 rows=10 width=28) (actual time=0.495..0.570
rows=10 loops=1)
   Buffers: shared hit=21
   -  Index Scan using test_cube_idx on test  (cost=0.00..38064.52
rows=1 width=28) (actual time=0.489..0.537 rows=10 loops=1)
 Index Cond: (cube(ARRAY[v1, v2, v3]) @ '(480, 480, -inf),(500,
500, inf)'::cube)
 Sort Cond: (cube(ARRAY[v1, v2, v3]) * 4)
 Buffers: shared hit=21
 Total runtime: 0.659 ms
(7 rows)

Now this patch greatly increases tree construction time, but I believe that
picksplit implementation, that is good enough for tree search and tree
construction, can be found.


 The trouble is that it's hard to think of
 a way of teaching the planner about these cases without hard-coding
 lots and lots of special-case kludges into the planner.  Still, if
 someone has a clever idea...


I think that two things can be done to improve the situation:
1) Make knngist deal with negative values. I think this will make easier
using knngist just for sorting, not only k-neighbor searching.
2) Let gist interface methods take care about multicolumn indexes. I think
that if cube index from the example above will be constructed on separate
columns v1, v2, v3 then it would be easier for planner to use cube index for
queries with filters on these columns. I don't know exactly how to do that.


cube.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using multidimensional indexes in ordinal queries

2010-06-22 Thread Alexander Korotkov

 On Tue, Jun 22, 2010 at 1:58 AM, Robert Haas robertmh...@gmail.com wrote:

 It doesn't?   I didn't think it was making any assumptions about the
 ordering data type beyond the fact that it had a default btree
 opclass.

Actually, the return type of consistent method was replaced by float8.
Negative values are used for unconsistent state. Non-negative values are
used for consistent and ordering.

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-13 Thread Alexander Korotkov

Hi!

* levenshtein_internal() and levenshtein_less_equal_internal() are very
  similar. Can you merge the code? We can always use less_equal_internal()
  if the overhead is ignorable. Did you compare them?

With big value of max_d overhead is significant. Here is example on
american-english dictionary from Openoffice.

test=# select sum(levenshtein('qweqweqweqweqwe',word)) from words;
   sum
-
 1386456
(1 row)

Time: 195,083 ms
test=# select sum(levenshtein_less_equal('qweqweqweqweqwe',word,100)) from
words;
   sum
-
 1386456
(1 row)

Time: 317,821 ms


 * There are many if (!multibyte_encoding) in levenshtein_internal().
  How about split the function into two funcs for single-byte chars and
  multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always
  use multi-byte version if the overhead is small.

The overhead of multi-byte version was about 4 times slower. But I have
rewritten my CHAR_CMP macro with inline function. And now it's only about
1.5 times slower.

In database with muti-byte encoding:
test=# select * from words where levenshtein('qweqweqwe',word)=5;
  id   |   word
---+--
 69053 | peewee
 69781 | pewee
 81279 | sequence
 88421 | sweetie
(4 rows)

Time: 136,742 ms

In database with single-byte encoding:
test2=# select * from words where levenshtein('qweqweqwe',word)=5;
  id   |   word
---+--
 69053 | peewee
 69781 | pewee
 81279 | sequence
 88421 | sweetie
(4 rows)

Time: 88,471 ms

Anyway I think that overhead is not ignorable. That's why I have splited
levenshtein_internal into levenshtein_internal and levenshtein_internal_mb,
and levenshtein_less_equal_internal into levenshtein_less_equal_internal and
levenshtein_less_equal_internal_mb.


 * I prefer a struct rather than an array.  4 * m and 3 * m look like
 magic
  numbers for me. Could you name the entries with definition of a struct?
/*
 * For multibyte encoding we'll also store array of lengths of
 * characters and array with character offsets in first string
 * in order to avoid great number of pg_mblen calls.
 */
prev = (int *) palloc(4 * m * sizeof(int));

I this line of code the memory is allocated for 4 arrays: prev, curr,
offsets, char_lens. So I have joined offsets and char_lens into struct. But
I can't join prev and curr because of this trick:
temp = curr;
curr = prev;
prev = temp;

* There are some compiler warnings. Avoid them if possible.
 fuzzystrmatch.c: In function ‘levenshtein_less_equal_internal’:
 fuzzystrmatch.c:428: warning: ‘char_lens’ may be used uninitialized in
 this function
 fuzzystrmatch.c:428: warning: ‘offsets’ may be used uninitialized in
 this function
 fuzzystrmatch.c:430: warning: ‘curr_right’ may be used uninitialized
 in this function
 fuzzystrmatch.c: In function ‘levenshtein_internal’:
 fuzzystrmatch.c:222: warning: ‘char_lens’ may be used uninitialized in
 this function

Fixed.

* Coding style: Use if (m == 0) instead of if (!m) when the type
 of 'm' is an integer.

Fixed.


 * Need to fix the caution in docs.
 http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html
 | Caution: At present, fuzzystrmatch does not work well with
 | multi-byte encodings (such as UTF-8).
 but now levenshtein supports multi-byte encoding!  We should
 mention which function supports mbchars not to confuse users.

I've updated this notification. Also I've added documentation for
levenshtein_less_equal function.

* (Not an issue for the patch, but...)
  Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to
  PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR?
  Unpacking versions make the core a bit faster.

 Fixed.

With best regards,
Alexander Korotkov.


fuzzystrmatch-0.4.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-21 Thread Alexander Korotkov

On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas robertmh...@gmail.com wrote:

 This patch still needs some work.  It includes a bunch of stylistic
 changes that aren't relevant to the purpose of the patch.  There's no
 reason that I can see to change the existing levenshtein_internal
 function to take text arguments instead of char *, or to change !m to
 m == 0 in existing code, or to change the whitespace in the comments
 of that function.  All of those need to be reverted before we can
 consider committing this.

I changed arguments of function from char * to text * in order to avoid
text_to_cstring call. Same benefit can be achived by replacing char * with
char * and length.
I changed !m to m == 0 because Itagaki asked me to make it conforming coding
style. Do you think there is no reason to fix coding style in existing
code?

 There is a huge amount of duplicated code here.  I think it makes
 sense to have the multibyte version of the function be separate, but
 perhaps we can merge the less-than-or-equal to bits  into the main
 code, so that we only have two copies instead of four.  Perhaps we
 can't just add a max_d argument max_distance to levenshtein_internal;
 and if this value is =0 then it represents the max allowable
 distance, but if it is 0 then there is no limit.  Sure, that might
 slow down the existing code a bit, but it might not be significant.
 I'd at least like to see some numbers showing that it is significant
 before we go to this much trouble.

In these case we should add many checks of max_d in levenshtein_internal
function which make code more complex.
Actually, we can merge all four functions into one function. But such
function will have many checks about multibyte encoding and max_d. So, I see
four cases here:
1) one function with checks for multibyte encoding and max_d
2) two functions with checks for multibyte encoding
3) two functions with checks for max_d
4) four separate functions
If you prefer case number 3 you should argue your position little more.

 The code doesn't follow the project coding style.  Braces must be
 uncuddled.  Comment paragraphs will be reflowed unless they begin and
 end with --.  Function definitions should have the type
 declaration on one line and the function name at the start of the
 next.

 Freeing memory with pfree is likely a waste of time; is there any
 reason not to just rely on the memory context reset, as the original
 coding did?

Ok, I'll fix this things.


 I think we might need to remove the acknowledgments section from this
 code.  If everyone who touches this code adds their name, we're
 quickly going to have a mess.  If we're not going to remove the
 acknowledgments section, then please add my name, too, because I've
 already patched this code once...

In that case I think we can leave original acknowledgments section.


With best regards,
Alexander Korotkov.

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-21 Thread Alexander Korotkov

 should
be
 * increased by ins_c + del_c. In the case of movement backwards this
 * diagonal cell value should not be increased.
 * 2) The range of indexes of previous row, where values, which is not
 * greater than max_d, are stored, is [prev_left, prev_right]. So, only
 * the cells connected with this range should be calculated.
 * For multibyte encoding we should increase x and y pointers by the
 * results of pg_mblen function. Also we should use CHAR_CMP macros
 * for character comparison.
 */
for (y = t, j = 1; j  n; y+= y_char_len, j++)
{
int   *temp;
y_char_len = pg_mblen(y);

if (max_d = 0)
{
prev_left = curr_left;
prev_right = curr_right;
curr_left = -1;
}else{
prev_left = 1;
}

if (delta = 0)
curr[0] = min_d + j * (ins_c + del_c);
else{
if (j = - delta)
curr[0] = min_d;
else
curr[0] = min_d + (j + delta) * (ins_c + del_c);
}

for (i = prev_left, x = s + lengths_and_offsets[i - 1].offset; i 
m; x+= lengths_and_offsets[i - 1].length, i++)
{
if (max_d = 0)
{
d = max_d + 1;

if (i = prev_right){
d = Min(prev[i] + ((j + delta  i)?(ins_c + del_c):0),
d);
}

if (i == 1 || i  prev_left)
{
d = Min(curr[i - 1] + ((i - delta  j)?(ins_c +
del_c):0), d);
d = Min(prev[i - 1] + (char_cmp(x, lengths_and_offsets[i
- 1].length, y, y_char_len) ? 0 : sub_c), d);
}

curr[i] = d;

if (curr_left == -1)
curr_left = i;
curr_right = i;
if (i  prev_right)
break;
}else{
d = Min(Min(prev[i] + ((j + delta  i)?(ins_c + del_c):0),
curr[i - 1] + ((i - delta  j)?(ins_c + del_c):0)),
prev[i - 1] + (char_cmp(x, lengths_and_offsets[i -
1].length, y, y_char_len) ? 0 : sub_c));
curr[i] = d;
}
}

if (curr_left == -1)
break;

temp = curr;
curr = prev;
prev = temp;
}

/*
 * Because the final value was swapped from the previous row to the
 * current row, that's where we'll find it.
 */
d = prev[m - 1];

/*
 * If the last cell wasn't filled than return max_d + 1 otherwise
 * return the final value.
 */
if (max_d = 0  (curr_left == -1 || curr_right  m - 1))
return max_d + 1;
else
return d;
}

I tested it with american-english dictionary with 98569 words.

test=# select sum(levenshtein(word, 'qwerqwerqwer')) from words;
   sum
-
 1074376
(1 row)

Time: 131,435 ms
test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',100)) from
words;
   sum
-
 1074376
(1 row)

Time: 221,078 ms
test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',-1)) from
words;
   sum
-
 1074376
(1 row)

Time: 254,819 ms

The function with negative value of max_d didn't become faster than with
just big value of max_d.


With best regards,
Alexander Korotkov.

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-22 Thread Alexander Korotkov

Such version with macros and includes can look like this:

#ifdef MULTIBYTE
#define NEXT_X (x+= char_lens[i-1])
#define NEXT_Y (y+= y_char_len)
#define CMP (char_cmp(x, char_lens[i-1], y, y_char_len))
#else
#define NEXT_X (x++)
#define NEXT_Y (y++)
#define CMP (*x == *y)
#endif

static int
levenshtein_internal(text *s, text *t, int ins_c, int del_c, int sub_c)
{
intm,
n,
d;
int   *prev;
int   *curr;
inti,
j;
const char *x;
const char *y;
#ifdef MULTIBYTE
int   *char_lens;
inty_char_len;
#endif

/*
 * We should calculate number of characters for multibyte encodings
 */
m = pg_mbstrlen_with_len(VARDATA_ANY(s), VARSIZE_ANY_EXHDR(s));
n = pg_mbstrlen_with_len(VARDATA_ANY(t), VARSIZE_ANY_EXHDR(t));

/*
 * We can transform an empty s into t with n insertions, or a non-empty
t
 * into an empty s with m deletions.
 */
if (m == 0)
return n * ins_c;
if (n == 0)
return m * del_c;

/*
 * For security concerns, restrict excessive CPU+RAM usage. (This
 * implementation uses O(m) memory and has O(mn) complexity.)
 */
if (m  MAX_LEVENSHTEIN_STRLEN ||
n  MAX_LEVENSHTEIN_STRLEN)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 errmsg(argument exceeds the maximum length of %d bytes,
MAX_LEVENSHTEIN_STRLEN)));

/* One more cell for initialization column and row. */
++m;
++n;

/*
 * Instead of building an (m+1)x(n+1) array, we'll use two different
 * arrays of size m+1 for storing accumulated values. At each step one
 * represents the previous row and one is the current row of the
 * notional large array.
 * For multibyte encoding we'll also store array of lengths of
 * characters of first string in order to avoid great number of
  * pg_mblen calls.
 */
#ifdef MULTIBYTE
prev = (int *) palloc(3 * m * sizeof(int));
curr = prev + m;
char_lens = prev + 2 * m;
for (i = 0, x = VARDATA_ANY(s); i  m - 1; i++)
{
char_lens[i] = pg_mblen(x);
x += char_lens[i];
}
char_lens[i] = 0;
#else
prev = (int *) palloc(2 * m * sizeof(int));
curr = prev + m;
#endif

/* Initialize the previous row to 0..cols */
for (i = 0; i  m; i++)
prev[i] = i * del_c;

/*
 * For multibyte encoding we should increase x and y pointers by the
 * results of pg_mblen function. Also we should use CHAR_CMP macros
 * for character comparison.
 */
for (y = VARDATA_ANY(t), j = 1; j  n; NEXT_Y, j++)
{
int   *temp;
#ifdef MULTIBYTE
y_char_len = pg_mblen(y);
#endif

curr[0] = j * ins_c;

for (x = VARDATA_ANY(s), i = 1; i  m; NEXT_X, i++)
{
intins;
intdel;
intsub;

ins = prev[i] + ins_c;
del = curr[i - 1] + del_c;
sub = prev[i - 1] + (CMP ? 0 : sub_c);
curr[i] = Min(ins, del);
curr[i] = Min(curr[i], sub);
}

temp = curr;
curr = prev;
prev = temp;
}

d = prev[m - 1];

/*
 * Because the final value was swapped from the previous row to the
 * current row, that's where we'll find it.
 */
return d;
}

What do you thing about it?


With best regards,
Alexander Korotkov.

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-22 Thread Alexander Korotkov

On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas robertmh...@gmail.com wrote:

 Ah, I see.  That's pretty compelling, I guess.  Although it still
 seems like a lot of code...

I think there is a way to merge single-byte and multi-byte versions of
functions without loss in performance using macros and includes (like in
'backend/utils/adt/like.c'). Do you think it is acceptable in this case?


With best regards,
Alexander Korotkov.

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-27 Thread Alexander Korotkov

Here is new version of my patch. There are following changes:

1) I've merged singlebyte and multibyte versions of levenshtein_internal and
levenshtein_less_equal_internal using macros and includes.
2) I found that levenshtein takes reasonable time even for long strings.
There is an example with strings which contain random combinations of words.

test=# select count(*) from words;
 count
---
 98569
(1 row)

test=# select * from words limit 10;
  id   |
word| next_id
---++-
   200 | Agnew diodes drilled composts Wassermann nonprofit adjusting
Chautauqua|   17370
  4589 | Dhaka craziness savvies teenager ploughs Barents's unwed
zither|   70983
  5418 | Eroses gauchos bowls smellier weeklies tormentors cornmeal
punctuality |   96685
  6103 | G prompt boron overthrew cricking begonia snuffers
blazed  |   34518
  4707 | Djibouti Mari pencilling approves pointing's pashas magpie
rams|   71200
 10125 | Lyle Edgardo livers gruelled capable cancels gnawing's
inhospitable|   28018
  9615 | Leos deputy evener yogis assault palatals Octobers
surceased   |   21878
 15640 | Starr's arrests malapropism Koufax's aesthetes Howell rustier
Algeria  |   19316
 15776 | Sucre battle's misapprehension's predicating nutmegged
electrification bowl planks |   65739
 16878 | Updike Forster ragout keenly exhalation grayed switch
guava's  |   43013
(10 rows)

Time: 26,125 ms

Time: 137,061 ms
test=# select avg(length(word)) from words;
 avg
-
 74.6052207083362923
(1 row)

Time: 96,415 ms
test=# select * from words where levenshtein(word, 'Dhaka craziness savvies
teeae luhs Barentss unwe zher')=10;
  id  |  word   |
next_id
--+-+-
 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 2957,426 ms
test=# select * from words where levenshtein_less_equal(word, 'Dhaka
craziness savvies teeae luhs Barentss unwe zher',10)=10;
  id  |  word   |
next_id
--+-+-
 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 84,755 ms

It takes O(max(n, m) * max_d / min(ins_c, max_c)) in worst case. That's why
I changed limitation of this function to max_d = 255 * min(ins_c, del_c)

3) I found that there is no need for storing character offsets in
levenshtein_less_equal_internal_mb. Instead of this first position in
string, where value of distance wasn't greater than max_d, can be stored.

4) I found that there is theoretical maximum distance based of string
lengths. It represents the case, when no characters are matching. If
theoretical maximum distance is less than given maximum distance we can use
theoretical maximum distance. It improves behavior of levenshtein_less_equal
with high max_d, but it's still slower than levenshtein:

test=# select * from words where levenshtein_less_equal(word, 'Dhaka
craziness savvies teeae luhs Barentss unwe zher',1000)=10;
  id  |  word   |
next_id
--+-+-
 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 4102,731 ms
test=# select * from words where levenshtein(word, 'Dhaka craziness savvies
teeae luhs Barentss unwe zher')=10;
  id  |  word   |
next_id
--+-+-
 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 2983,244 ms


With best regards,
Alexander Korotkov.


fuzzystrmatch-0.5.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-29 Thread Alexander Korotkov

I forgot attribution in levenshtein.c file.


With best regards,
Alexander Korotkov.


fuzzystrmatch-0.5.1.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] knngist - 0.8

2010-07-30 Thread Alexander Korotkov

I think that queries like this:
select * from test where val - 500  1 order by val - 500;
can also be optimized using knngist. In case of btree_gist this query can be
easily rewritten:
select * from test where val  499 and val  501 order by val - 500;
But, in pg_trgm it makes it possible to combine different similarity levels
in one query. For example:
select * from test_trgm order by t - 'asdf'  0.5 or t - 'qwer'  0.4;
Is there any chance to handle this syntax also?


With best regards,
Alexander Korotkov.

Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-04 Thread Alexander Korotkov

On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas robertmh...@gmail.com wrote:

 I reviewed this code in a fair amount of detail today and ended up
 rewriting it.  In general terms, it's best to avoid changing things
 that are not relevant to the central purpose of the patch.  This patch
 randomly adds a whole bunch of whitespace and removes a whole bunch of
 comments which I find useful and do not want to see removed.  It took
 me about an hour to reverse out all of those changes, which then let
 me get down to the business of actually analyzing the code.

I'm sorry. This is my first patch, I will be more accurate next time. But, I
think there is no unified opinion about some changes like replacement !m
with m==0.

I think this line is not correct:
if (m != s_bytes  n != t_bytes)
The correct negation of assumption, that both strings are single-byte, is
the assumption, that at least one string is not single-byte. In this patch
levenshtein function can calculate distance incorrectly:
test=# select levenshtein('фw', 'ww');
 levenshtein
-
   2
(1 row)
This line should be rewritten by this.
if (m != s_bytes || n != t_bytes)

The attached patch takes the approach of using a fast-path for just
 the innermost loop when neither string contains any multi-byte
 characters (regardless of whether or not the encoding allows
 multi-byte characters).  It's still slower than CVS HEAD, but for
 strings containing no multi-byte characters it's only marginally, if
 at all, slower than previous releases, because 9.0 doesn't contain
 your fix to avoid the extra string copy; and it's faster than your
 implementation even when multi-byte characters ARE present.  It is
 slightly slower on single-byte encoding databases (I tested
 SQL_ASCII), but still faster than previous releases.  It's also quite
 a bit less code duplication.


Can I look at the test which shows that your implementation is faster than
my when multi-byte characters are present? My tests show reverse. For
example, with russian dictionary of openoffice:

With my version of patch:
test=# select sum(levenshtein(word, 'фывафыва')) from test;
   sum
-
 1675281
(1 row)

Time: 277,190 ms

With your version of patch:
test=# select sum(levenshtein(word, 'фывафыва')) from test;
   sum
-
 1675281
(1 row)

Time: 398,011 ms

I think that the cause of slow down slowdown is memcmp function. This
function is very fast for long parts of memory, but of few bytes inline
function is faster, because compiler have more freedom for optimization.
After replacing memcmp with inline function like following in your
implementation:

static inline bool char_cmp(const char *s, const char *d, int len)
{
while (len--)
{
if (*s++ != *d++)
return false;
}
return true;
}

Code becomes much faster:

test=# select sum(levenshtein(word, 'фывафыва')) from test;
   sum
-
 1675281
(1 row)

Time: 241,272 ms


With best regards,
Alexander Korotkov.

Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-04 Thread Alexander Korotkov

Now I think patch is as good as can be. :)
I'm going to prepare less-or-equal function in same manner as this patch.


With best regards,
Alexander Korotkov.

Re: [HACKERS] knngist - 0.8

2010-08-09 Thread Alexander Korotkov

In gist consitent method support only filtering strategies. For such
strategies consistent method returns true if subtree can contain matching
node and false otherwise. Knngist introduce also order by strategies. For
filtering strategies knngist consistent method returns 0 if  subtree can
contain matching node and -1 otherwise. For order by strategies knngist
consistent method returns minimal possible distance in subtree. I think we
can use consistent method with order by strategies not only for ordering but
also for filtering. If query contain assertion that distance is less than
some value, than we can call consistent method with order by strategy and
compare result with query value in order to determine whether scan subtree.
Such approach can give benefit when we need to filter by similarity. For
example, in pg_trgm % is used for similarity filtering, but similarity
threshold is global for session. That's why we can't create complex queries
which contain similarity filtering with different threshold.


With best regards,
Alexander Korotkov.



On Mon, Aug 2, 2010 at 8:14 PM, Robert Haas robertmh...@gmail.com wrote:

 2010/7/29 Alexander Korotkov aekorot...@gmail.com:
  But, in pg_trgm it makes it possible to combine different similarity
 levels
  in one query. For example:
  select * from test_trgm order by t - 'asdf'  0.5 or t - 'qwer' 
 0.4;
  Is there any chance to handle this syntax also?

 Maybe I'm missing something, but I don't think that ORDER BY clause
 makes much sense.  OR is going to reduce a true or false value - and
 it's usually not that interesting to order by a column that can only
 take one of two values.

 Am I confused?

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise Postgres Company

Re: [HACKERS] knngist - 0.8

2010-08-10 Thread Alexander Korotkov

On Tue, Aug 10, 2010 at 1:35 AM, Euler Taveira de Oliveira 
eu...@timbira.com wrote:

 What do you mean by complex queries? You can always use the SET command.
 Sadly
 it doesn't work when you have different thresholds within distinct
 subqueries.
  (In pg_similarity I use this approach to set the function's thresholds).

 I mean exactly different thresholds in distinct subqueries.

What
 I am investigating is a way to build an index with some user-defined
 parameters. (We already have some infra-structure in reloptions for that
 but
 it needs some work to support my idea). I have some half-baked patch that
 I'm
 planning to submit to some of the CFs. Unfortunately, I don't have time for
 it
  ATM.

User-defined parameters for GiST would be a great feature. I'm performing
some experiments with GiST and I'm really feeling the need of it.


With best regards,
Alexander Korotkov.

Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-28 Thread Alexander Korotkov

Here is the patch which adds levenshtein_less_equal function. I'm going to
add it to current commitfest.


With best regards,
Alexander Korotkov.


On Tue, Aug 3, 2010 at 3:23 AM, Robert Haas robertmh...@gmail.com wrote:

 On Mon, Aug 2, 2010 at 5:07 PM, Alexander Korotkov aekorot...@gmail.com
 wrote:
  Now I think patch is as good as can be. :)

 OK, committed.

  I'm going to prepare less-or-equal function in same manner as this patch.

 Sounds good.  Since we're now more than half-way through this
 CommitFest and this patch has undergone quite a bit of change from
 what we started out with, I'd like to ask that you submit the new
 patch for the next CommitFest, to begin September 15th.

 https://commitfest.postgresql.org/action/commitfest_view/open

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise Postgres Company

*** a/contrib/fuzzystrmatch/fuzzystrmatch.c
--- b/contrib/fuzzystrmatch/fuzzystrmatch.c
***
*** 61,66  PG_MODULE_MAGIC;
--- 61,68 
   */
  extern Datum levenshtein_with_costs(PG_FUNCTION_ARGS);
  extern Datum levenshtein(PG_FUNCTION_ARGS);
+ extern Datum levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS);
+ extern Datum levenshtein_less_equal(PG_FUNCTION_ARGS);
  extern Datum metaphone(PG_FUNCTION_ARGS);
  extern Datum soundex(PG_FUNCTION_ARGS);
  extern Datum difference(PG_FUNCTION_ARGS);
***
*** 92,98  soundex_code(char letter)
  #define MAX_LEVENSHTEIN_STRLEN		255
  
  static int levenshtein_internal(text *s, text *t,
! 	 int ins_c, int del_c, int sub_c);
  
  
  /*
--- 94,100 
  #define MAX_LEVENSHTEIN_STRLEN		255
  
  static int levenshtein_internal(text *s, text *t,
! 	 int ins_c, int del_c, int sub_c, int max_d);
  
  
  /*
***
*** 202,211  rest_of_char_same(const char *s1, const char *s2, int len)
   *		  between supplied strings. Generally
   *		  (1, 1, 1) penalty costs suffices common
   *		  cases, but your mileage may vary.
   */
  static int
  levenshtein_internal(text *s, text *t,
! 	 int ins_c, int del_c, int sub_c)
  {
  	int			m,
  n,
--- 204,217 
   *		  between supplied strings. Generally
   *		  (1, 1, 1) penalty costs suffices common
   *		  cases, but your mileage may vary.
+  *		  Returns accurate value if max_d  0 or
+  *		  actual distance is less or equal than
+  *		  max_d, otherwise returns value greater
+  *		  than max_d.
   */
  static int
  levenshtein_internal(text *s, text *t,
! 	 int ins_c, int del_c, int sub_c, int max_d)
  {
  	int			m,
  n,
***
*** 219,224  levenshtein_internal(text *s, text *t,
--- 225,233 
  	const char *s_data;
  	const char *t_data;
  	const char *y;
+ 	const char *prev_x = NULL;
+ 	intcurr_left = 0, curr_right = 0, prev_left, prev_right, d;
+ 	intdelta = 0, min_d = 0, theor_max_d;
  
  	/* Extract a pointer to the actual character data. */
  	s_data = VARDATA_ANY(s);
***
*** 238,243  levenshtein_internal(text *s, text *t,
--- 247,266 
  		return n * ins_c;
  	if (!n)
  		return m * del_c;
+ 	
+ 	/*
+ 	 * There is theoretical maximum distance based of string lengths. It
+ 	 * represents the case, when no characters are matching. If max_d
+ 	 * reaches this value than we can use accurate calculation of distance
+ 	 * which is faster in this case.
+ 	 */
+ 	if (max_d = 0)
+ 	{
+ 		theor_max_d = Min(m*del_c + n*ins_c, (m  n) ?
+ 			(n * sub_c + (m - n) * del_c):(m * sub_c + (n - m) * ins_c)) - 1;
+ 		if (max_d = theor_max_d)
+ 			max_d = -1;
+ 	}
  
  	/*
  	 * For security concerns, restrict excessive CPU+RAM usage. (This
***
*** 251,256  levenshtein_internal(text *s, text *t,
--- 274,296 
  		MAX_LEVENSHTEIN_STRLEN)));
  
  	/*
+ 	 * We can find the minimal distance by the difference of string lengths.
+ 	 */
+ 	if (max_d = 0)
+ 	{
+ 		delta = m - n;
+ 		if (delta  0)
+ 			min_d = delta * del_c;
+ 		else if (delta  0)
+ 			min_d = - delta * ins_c;
+ 		else
+ 			min_d = 0;
+ 
+ 		if (min_d  max_d)
+ 			return max_d + 1;
+ 	}
+ 
+ 	/*
  	 * In order to avoid calling pg_mblen() repeatedly on each character in s,
  	 * we cache all the lengths before starting the main loop -- but if all the
  	 * characters in both strings are single byte, then we skip this and use
***
*** 286,369  levenshtein_internal(text *s, text *t,
  	curr = prev + m;
  
  	/* Initialize the previous row to 0..cols */
! 	for (i = 0; i  m; i++)
! 		prev[i] = i * del_c;
  
  	/* Loop through rows of the notional array */
  	for (y = t_data, j = 1; j  n; j++)
  	{
  		int		   *temp;
- 		const char *x = s_data;
  		int			y_char_len = n != t_bytes + 1 ? pg_mblen(y) : 1;
  
  		/*
! 		 * First cell must increment sequentially, as we're on the j'th row of
! 		 * the (m+1)x(n+1) array.
! 		 */
! 		curr[0] = j * ins_c;
! 
! 		/*
! 		 * This inner loop is critical to performance, so we include a
  		 * fast-path to handle

Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-28 Thread Alexander Korotkov

 протоколируются
сострадательный отрясший вывалить браманизм
 наполниться агафоновна испольный подтаскивающий запруживавшийся трофимович
перетаскиваемый обручавший процентщица передов
 вихрастый отволочённый дискотека пришей распасовывавший полиция возделавший
трехглавый битва загазованный
 обовьетесь перехитривший инулин стелить недоброжелательность изотрёте
пятиалтынный жабовидный щелочно дефицитность
 сиротиночка хлорбензол вгрызаться прокрашивать никарагуанец валентный
понесённый перестегивание воспретительный переименовываемый
 таявший раскупорившийся передислоцируется юлианович праздничный лачужка
присыхать отбеливший фехтовальный удобряющий
 слепнул зонт уластить удобрение тормозивший ушибся ошибся мкс
сейсмологический лесомелиорация
 рабовладельцем неудачный самовоспламенение карбидный круглопильный кубинцем
подлакированный наблюдение исцарапавшийся издёргавший
(10 rows)

test=# select * from phrases2 where levenshtein('таяй раскупорившийся
передислоцируется юлианович праздничный лачужка присыхать опппливший
ффехтовальный добряющий', a) = 10;

a

--

 таявший раскупорившийся передислоцируется юлианович праздничный лачужка
присыхать отбеливший фехтовальный удобряющий
(1 row)

Time: 1291,318 ms

test=# select * from phrases2 where levenshtein_less_equal('таяй
раскупорившийся передислоцируется юлианович праздничный лачужка присыхать
опппливший ффехтовальный добряющий', a, 10) = 10;

a

--

 таявший раскупорившийся передислоцируется юлианович праздничный лачужка
присыхать отбеливший фехтовальный удобряющий
(1 row)

Time: 55,405 ms

test=# select sum(levenshtein_less_equal('таяй раскупорившийся
передислоцируется юлианович праздничный лачужка присыхать опппливший
ффехтовальный добряющий', a, 200)) from phrases;
   sum
-
 1091878
(1 row)

Time: 674,734 ms

test=# select sum(levenshtein('таяй раскупорившийся передислоцируется
юлианович праздничный лачужка присыхать опппливший ффехтовальный
добряющий', a)) from phrases;
   sum
-
 1091878
(1 row)

Time: 673,515 ms


With best regards,
Alexander Korotkov.

Re: [HACKERS] Real-life range datasets

2011-12-23 Thread Alexander Korotkov

Hello,

On Thu, Dec 22, 2011 at 12:51 PM, Benedikt Grundmann 
bgrundm...@janestreet.com wrote:

 I should be able to give you a table with the same characteristics as
 the instruments table but bogus data by replacing all entries in the
 table with random strings of the same length or something like that.
 I can probably take a little bit of time during this or the next week
 to generate such fake real world data ;-)   Is there an ftp site to
 upload the gzipped pg_dump file to?


Thank you very much for your response! I'm going to send you accessories
for upload soon.

-
With best regards,
Alexander Korotkov.

Re: [HACKERS] Collect frequency statistics for arrays

2012-01-03 Thread Alexander Korotkov

Hi!

Thanks for your great work on reviewing this patch. Now I'm trying to find
memory corruption bug. Unfortunately it doesn't appears on my system. Can
you check if this bug remains in attached version of patch. If so, please
provide me information about system you're running (processor, OS etc.).

--
With best regards,
Alexander Korotkov.


arrayanalyze-0.8.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Collect frequency statistics for arrays

2012-01-03 Thread Alexander Korotkov

On Wed, Jan 4, 2012 at 12:33 AM, Noah Misch n...@leadboat.com wrote:

 On Wed, Jan 04, 2012 at 12:09:16AM +0400, Alexander Korotkov wrote:
  Thanks for your great work on reviewing this patch. Now I'm trying to
 find
  memory corruption bug. Unfortunately it doesn't appears on my system. Can
  you check if this bug remains in attached version of patch. If so, please
  provide me information about system you're running (processor, OS etc.).

 I get the same diagnostic from this version.  Opteron processor, operating
 system is Ubuntu 8.04 (64-bit).  You're using --enable-cassert, right?

Oh, actually no. Thanks for point.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] Collect frequency statistics for arrays

2012-01-07 Thread Alexander Korotkov

  +  * distribution id negligible there. In the column @ const case
summary
  +  * frequency of elements is high (otherwise we have selectivity close
to 0).

 What does the term summary frequency denote?

I meant summ of frequences of const array elements.

  + /*
  +  * Rest is a average length of elements which aren't present in
mcelem.
  +  */
  + rest = avg_length;

 You define rest here as an array length ...

  +
  + default_freq = Min(DEFAULT_CONT_SEL, minfreq / 2);
  +
  + mcelem_index = 0;
  +
  + /*
  +  * mult is the multiplier that presents estimate of probability
that each
  +  * mcelem which is not present in constant doesn't occur.
  +  */
  + mult = 1.0f;
  +
  + for (i = 0; i  nitems; i++)
  + {
  + boolfound = false;
  +
  + /* Comparison with previous value in order to guarantee
uniquness */
  + if (i  0)
  + {
  + if (!element_compare(array_data[i - 1],
array_data[i]))
  + continue;
  + }
  +
  + /*
  +  * Iterate over mcelem until find mcelem that is greater
or equal to
  +  * element of constant. Simultaneously taking care about
rest and
  +  * mult. If that mcelem is found then fill corresponding
elem_selec.
  +  */
  + while (mcelem_index  nmcelem)
  + {
  + int cmp =
element_compare(mcelem[mcelem_index], array_data[i]);
  +
  + if (cmp  0)
  + {
  + mult *= (1.0f - numbers[mcelem_index]);
  + rest -= numbers[mcelem_index];

 ... But here, you're subtracting a frequency from an array length?


Yes, because average distinct element count is summ of frequencies of
elements. Substracting mcelem frequencies from avg_length we have summ of
frequencies of non-mcelem elements.

--
With best regards,
Alexander Korotkov.


arrayanalyze-0.9.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Collect frequency statistics for arrays

2012-01-17 Thread Alexander Korotkov

-eq_opr)
  + return (Selectivity) 0.5;

 Punting on other operators this way creates a plan quality regression for
 operations like const  ANY (column).  Please do it some way that falls
 back on the somewhat-better existing scalararraysel() treatment for this.

I've made calc_scalararraysel return -1 in this case or in the case of no
comparison function. scalararraysel continues it's work
when calc_scalararraysel returns negative value.


  + /*
  +  * Calculate first n distinct element counts probabilities by
 histogram. We
  +  * assume that any interval between a and b histogram values gives
  +  * 1 / ((b - a + 1) * (nhist - 1)) probability to values between a and
 b and
  +  * half of that to a and b. Returns total probability that distinct
 element
  +  * count is less of equal to n.
  +  */
  + static float
  + calc_hist(Datum *hist, int nhist, float *hist_part, int n)

 To test this function, I ran the following test case:

 set default_statistics_target = 4;
 create table t3 as select array(select * from generate_series(1, v)) as arr
 from (values (2),(2),(2),(3),(5),(5),(5)) v(v), generate_series(1,100);
 analyze t3; -- length_histogram_bounds = {2,2,5,5}
 select * from t3 where arr @ array[6,7,8,9,10,11];

 Using gdb to observe calc_hist()'s result during the last command:

 (gdb) p calc_hist(hist, nhist, hist_part, unique_nitems)
 $23 = 0.66687
 (gdb) x/6f hist_part
 0xcd4bc8:   0   0   0.33343 0
 0xcd4bd8:   0   0.33343

 I expected an equal, nonzero probability in hist_part[3] and hist_part[4]
 and
 a total probability of 1.0.

  + {
  + int k,
  + i = 0,
  + prev_interval = 0,
  + next_interval = 0;
  + float   frac,
  + total = 0.0f;
  +
  + /*
  +  * frac is a probability contribution by each interval between
 histogram
  +  * values. We have nhist - 1 intervals. Contribution of one will
 be 1 /
  +  * (nhist - 1).
  +  */
  + frac = 1.0f / ((float) (nhist - 1));
  + for (k = 0; k = n; k++)
  + {
  + int count = 0;
  +
  + /* Count occurences of k distinct element counts in
 histogram. */
  + while (i  nhist  DatumGetInt32(hist[i]) = k)
  + {
  + if (DatumGetInt32(hist[i]) == k)
  + count++;
  + i++;
  + }
  +
  + if (count  0)
  + {
  + float   val;
  +
  + /* Find length between current histogram value and
 the next one */
  + if (i  nhist)
  + next_interval = DatumGetInt32(hist[i + 1])
 -

 Doesn't this read past the array end when i == nhist - 1?

It was a bug. It also causes wrong histogram calculation you noted above.
Fixed.


  + stats-extra_data = extra_data-std_extra_data;
  + old_context = CurrentMemoryContext;
  + extra_data-std_compute_stats(stats, fetchfunc, samplerows,
 totalrows);
  + MemoryContextSwitchTo(old_context);

 Is the callee known to change CurrentMemoryContext and not restore it?
 Offhand, I'm not seeing how it could do so.

Right. Saving of memory context is not needed. Removed.

 *** a/src/include/catalog/pg_type.h
  --- b/src/include/catalog/pg_type.h

 This now updates all array types except record[].  I'm don't know offhand
 how
 to even make a non-empty value of type record[], let alone get it into a
 context where ANALYZE would see it.  However, is there a particular reason
 to
 make that one different?

Oh, I didn't update all array types in 2 tries :) Fixed.

--
With best regards,
Alexander Korotkov.


arrayanalyze-0.11.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-01-19 Thread Alexander Korotkov

On Fri, Jan 20, 2012 at 12:30 AM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 The code badly needs comments. There is no explanation of how the trigram
 extraction code in trgm_regexp.c works.

 Sure. I hoped to find a time for comments before commitfest starts.
Unfortunately I didn't, sorry.


 Guessing from the variable names, it seems to be some sort of a coloring
 algorithm that works on a graph, but that all needs to be explained. Can
 this algorithm be found somewhere in literature, perhaps? A link to a paper
 would be nice.

I hope it's truly novel. At least application to regular expressions. I'm
going to write a paper about it.


 Apart from that, the multibyte issue seems like the big one. Any way
 around that?

Conversion of pg_wchar to multibyte character is the only way I found to
avoid serious hacking of existing regexp code. Do you think additional
function in pg_wchar_tbl which converts pg_wchar back to multibyte
character is possible solution?

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-01-19 Thread Alexander Korotkov

I also have a question about pg_wchar.

/*
 *---
 * encoding info table
 * XXX must be sorted by the same order as enum pg_enc (in mb/pg_wchar.h)
 *---
 */
pg_wchar_tbl pg_wchar_table[] = {
{pg_ascii2wchar_with_len, pg_ascii_mblen, pg_ascii_dsplen,
pg_ascii_verifier, 1}, /* PG_SQL_ASCII */
 {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen,
pg_eucjp_verifier, 3}, /* PG_EUC_JP */
 {pg_euccn2wchar_with_len, pg_euccn_mblen, pg_euccn_dsplen,
pg_euccn_verifier, 2}, /* PG_EUC_CN */
 {pg_euckr2wchar_with_len, pg_euckr_mblen, pg_euckr_dsplen,
pg_euckr_verifier, 3}, /* PG_EUC_KR */
 {pg_euctw2wchar_with_len, pg_euctw_mblen, pg_euctw_dsplen,
pg_euctw_verifier, 4}, /* PG_EUC_TW */
 {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen,
pg_eucjp_verifier, 3}, /* PG_EUC_JIS_2004 */
 {pg_utf2wchar_with_len, pg_utf_mblen, pg_utf_dsplen, pg_utf8_verifier, 4}, /*
PG_UTF8 */
 {pg_mule2wchar_with_len, pg_mule_mblen, pg_mule_dsplen, pg_mule_verifier,
4}, /* PG_MULE_INTERNAL */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN1 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN2 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN3 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN4 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN5 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN6 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN7 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN8 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN9 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN10 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1256 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1258 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN866 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN874 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_KOI8R */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1251 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1252 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* ISO-8859-5 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* ISO-8859-6 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* ISO-8859-7 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* ISO-8859-8 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1250 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1253 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1254 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1255 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1257 */
 {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_KOI8U */
 {0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2}, /* PG_SJIS */
{0, pg_big5_mblen, pg_big5_dsplen, pg_big5_verifier, 2}, /* PG_BIG5 */
 {0, pg_gbk_mblen, pg_gbk_dsplen, pg_gbk_verifier, 2}, /* PG_GBK */
{0, pg_uhc_mblen, pg_uhc_dsplen, pg_uhc_verifier, 2}, /* PG_UHC */
 {0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifier, 4}, /*
PG_GB18030 */
{0, pg_johab_mblen, pg_johab_dsplen, pg_johab_verifier, 3}, /* PG_JOHAB */
 {0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2} /*
PG_SHIFT_JIS_2004 */
};

What does last 7 zeros in the first column means? No conversion to pg_wchar
is possible from these encodings?

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-01-19 Thread Alexander Korotkov

On Fri, Jan 20, 2012 at 1:07 AM, Alexander Korotkov aekorot...@gmail.comwrote:

 What does last 7 zeros in the first column means? No conversion to
 pg_wchar is possible from these encodings?

Uh, I see. These encodings is not supported as server encodings.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-01-20 Thread Alexander Korotkov

Hi!

Thank you for your feedback!

On Fri, Jan 20, 2012 at 3:33 AM, Erik Rijkers e...@xs4all.nl wrote:

 The patch yields spectacular speedups with small, simple-enough regexen.
  But it does not do a
 good enough job when guessing where to use the index and where fall back
 to Seq Scan.  This can
 lead to (also spectacular) slow-downs, compared to Seq Scan.

Could you give some examples of regexes where index scan becomes slower
than seq scan?


 I guessed that MAX_COLOR_CHARS limits the character class size (to 4, in
 your patch), is that
 true?   I can understand you want that value to be low to limit the above
 risk, but now it reduces
 the usability of the feature a bit: one has to split up larger
 char-classes into several smaller
 ones to make a statement use the index: i.e.:

Yes, MAX_COLOR_CHARS is number of maximum character in automata color when
that color is divided to a separated characters. And it's likely there
could be better solution than just have this hard limit.


 Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I
 suppose this is
 normally necessary for perfomance reasons, but it would be useful te be
 able to compile a test
 version that allows it.  I don't know how hard that would be.

I seems that Ctrl-C was impossible because procedure of trigrams
exctraction becomes so long while it is not breakable. It's not difficult
to make this procedure breakable, but actually it just shouldn't take so
long.


 There is also a minor bug, I think, when running with  'set
 enable_seqscan=off'  in combination
 with a too-large regex:

Thanks for pointing. Will be fixed.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-01-20 Thread Alexander Korotkov

On Fri, Jan 20, 2012 at 8:45 PM, Marti Raudsepp ma...@juffo.org wrote:

 On Fri, Jan 20, 2012 at 01:33, Erik Rijkers e...@xs4all.nl wrote:
  Btw, it seems impossible to Ctrl-C out of a search once it is submitted;
 I suppose this is
  normally necessary for perfomance reasons, but it would be useful te be
 able to compile a test
  version that allows it.

 I believe being interruptible is a requirement for the patch to be
 accepted.

 CHECK_FOR_INTERRUPTS(); should be added to the indeterminate loops.

Sure. It's easy to fix. But it seems that in this case gin extract_query
method becomes slow (because index scan itself is breakable). So, it just
shouldn't work so long.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-01-20 Thread Alexander Korotkov

On Fri, Jan 20, 2012 at 12:54 AM, Alexander Korotkov
aekorot...@gmail.comwrote:

 On Fri, Jan 20, 2012 at 12:30 AM, Heikki Linnakangas 
 heikki.linnakan...@enterprisedb.com wrote:

 Apart from that, the multibyte issue seems like the big one. Any way
 around that?

 Conversion of pg_wchar to multibyte character is the only way I found to
 avoid serious hacking of existing regexp code. Do you think additional
 function in pg_wchar_tbl which converts pg_wchar back to multibyte
 character is possible solution?

Do you have any notes on it? I could make the patch which adds such
function into core.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] Collect frequency statistics for arrays

2012-01-22 Thread Alexander Korotkov

Hi!

Updated patch is attached. I've updated comment
of mcelem_array_contained_selec with more detailed description of
probability distribution assumption. Also, I found that rest behavious
should be better described by Poisson distribution, relevant changes were
made.

On Tue, Jan 17, 2012 at 2:33 PM, Noah Misch n...@leadboat.com wrote:

 By summary frequency of elements, do you mean literally P_0 + P_1 ... +
 P_N?
 If so, I can follow the above argument for column  const and column @
 const, but not for column @ const.  For column @ const, selectivity
 cannot exceed the smallest frequency among const elements.  A number of
 high-frequency elements will drive up the sum of the frequencies without
 changing the true selectivity much at all.

Referencing to summary frequency is not really correct. It would be more
correct to reference to number of element in const. When there are many
elements in const, column @ const selectivity tends to be close to 0
and  column @ const tends to be close to 1. Surely, it's true when
elements have some kind of middle values of frequencies (not very close to
0 and not very close to 1). I've replaced summary frequency of elements
by number of elements.

--
With best regards,
Alexander Korotkov.


arrayanalyze-0.12.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Collect frequency statistics for arrays

2012-01-23 Thread Alexander Korotkov

On Mon, Jan 23, 2012 at 7:58 PM, Noah Misch n...@leadboat.com wrote:

  + /* Take care about events with low probabilities. */
  + if (rest  DEFAULT_CONTAIN_SEL)
  + {

 Why the change from rest  0 to this in the latest version?

Ealier addition of rest distribution require O(m) time. Now there is a
more accurate and proved estimate, but it takes O(m^2) time.It doesn't make
general assymptotical time worse, but it significant. That's why I decided
to skip for low values of rest which don't change distribution
significantly.



  + /* emit some statistics for debug purposes */
  + elog(DEBUG3, array: target # mces = %d, bucket width =
 %d, 
  +  # elements = %llu, hashtable size = %d, usable
 entries = %d,
  +  num_mcelem, bucket_width, element_no, i,
 track_len);

 That should be UINT64_FMT.  (I introduced that error in v0.10.)


 I've attached a new version that includes the UINT64_FMT fix, some edits of
 your newest comments, and a rerun of pgindent on the new files.  I see no
 other issues precluding commit, so I am marking the patch Ready for
 Committer.

Great!


 If I made any of the comments worse, please post another update.

Changes looks reasonable for me. Thanks!

--
With best regards,
Alexander Korotkov.

Re: GiST for range types (was Re: [HACKERS] Range Types - typo + NULL string constructor)

2012-01-24 Thread Alexander Korotkov

Hi!

New version of patch is attached.
On Thu, Dec 22, 2011 at 11:46 AM, Jeff Davis pg...@j-davis.com wrote:

 A few comments:

 * In range_gist_picksplit, it would be nice to have a little bit more
 intuitive description of what's going on with the nonEmptyCount and
 nonInfCount numbers. For instance, it appears to depend on the fact that
 a range must either be in nonEmptyCount or in nonInfCount. Also, can you
 describe the reason you're multiplying by two and taking the absolute
 value? It seems to work, but I am missing the intuition behind those
 operations.

total_count - 2*nonEmptyCount = (total_count - nonEmptyCount) -
nonEmptyCount = emptyCount - nonEmptyCount
So, it's really not evident. I've simplified it.



 * The penalty function is fairly hard to read still. At a high level, I
 think we're trying to accomplish a few things (in order from most to
 least important):
  - Keep normal ranges separate.
  - Avoid broadening the class of the original predicate (e.g. turning
 single-infinite into double-infinite).
  - Avoid broadening (as determined by subtype_diff) the original
 predicate.
  - Favor adding ranges to narrower original predicates.

 Do you agree? If so, perhaps we can attack those more directly and it
 might be a little more readable.

 Additionally, the arbitrary numbers might become a problem. Can we
 choose better constants there? They would still be arbitrary when
 compared with real numbers derived from subtype_diff, but maybe we can
 still do better than what's there.

I've changes some comments and add constants for penalty values.


 * Regarding the leftover common entries that can go to either side:
 what is the delta measure trying to accomplish? When I work through
 some examples, it seems to favor putting larger common ranges on the
 left (small delta) and smaller common ranges on the right (smaller
 delta). Why is that good? Or did I misread the code?

 Intuitively, I would think that we'd want to push the ranges with lower
 upper bounds to the left and higher lower bounds to the right -- in
 other words, recurse. Obviously, we'd need to make sure it terminated at
 some point, but splitting the common entries does seem like a smaller
 version of the original problem. Thoughts?

That was a bug. Actually, no abs is needed. Indeed it doesn't affect
result significantly.

-
With best regards,
Alexander Korotkov.


rangetypegist-0.6.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: GiST for range types (was Re: [HACKERS] Range Types - typo + NULL string constructor)

2012-01-29 Thread Alexander Korotkov

On Mon, Jan 30, 2012 at 1:39 AM, Jeff Davis pg...@j-davis.com wrote:

 Thank you for the updates. I have a small patch attached.

 The only code change I made was very minor: I changed the constants used
 in the penalty function because your version used INFINITE_BOUND_PENALTY
 when adding an empty range, and that didn't quite make sense to me. If
 I'm mistaken you can leave it as-is.

 I also attached range-gist-test.sql, which I used for a performance
 test. I mix various types of ranges together in a larger table of 1.1M
 tuples. And then I create a smaller table that only contains normal
 ranges and empty ranges. There are two tests:
  1. Create an index on the big table
  2. Do a range join (using overlaps rather than equals) where the
 smaller table is on the outer side of a nested loop join and an index
 scan over the larger table on the inner.

 The index creation time reduces by a small amount with the patch, from
 around 16s without the patch to around 13s with the patch. The query
 time, however, dropped from around 26s to around 14s! Almost 2x speedup
 with the patch!

 Moreover, looking at the loop timing in the explain analyze output, it
 goes from about 7..24 ms per loop down to about 1.5..13 ms per loop.
 That seems to indicate that the index distribution is better, with more
 queries returning quickly.

 So, great work Alexander! Very convincing results.

Great! Thank you for reviewing this patch!


 Marking ready for committer, but please apply my comment fixes at your
 discretion.

Patch with your comment fixes is attached.

-
With best regards,
Alexander Korotkov.


rangetypegist-0.7.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] spgist text_ops and LIKE

2012-02-02 Thread Alexander Korotkov

On Thu, Feb 2, 2012 at 10:21 PM, Robert Haas robertmh...@gmail.com wrote:

 I'm having trouble figuring out under what set of circumstances spgist
 is expected to be the best available alternative.  It only supports a
 small subset of the data types that GiST does, so I suppose the point
 is that it should be faster for the cases that it does handle.  And,
 considering that this is all brand-new code, the fact that it's almost
 keeping pace with btree on both pattern-matching and equality
 comparisons is certainly respectable -- but I so far haven't found any
 cases where it's a clear win.  There's limited glory in being the
 almost-fastest way of indexing for a certain class of queries.


I think that current implementation of suffix tree (in particular
implementation of consistent methods) in spgist is far from reveal full
potential of suffix tree. I see at least two additional search types which
can be efficiently evaluated using suffix tree:
1) Search by prefix regular expression. For example, search for
/^a(bc|d)e[fgh]/ in single index scan.
2) Search by levenshtein distance, i.e. SELECT * FROM tbl WHERE
levenshtein(col, 'some constant')  k; using index (surely actual syntax
for such index scan would be anouther).
This is only two additional applications which first comes to my mind,
there could be other. Unfortunately, I don't have enough of time for it
now. But I like to put my hands on this in future.

Admittedly, I haven't tried the point-in-box stuff yet.


In short, I believe spgist for geometrical data-types is more CPU-effective
and less IO-effective than gist. Gist have to call consistent method for
all index tuples of the page it uses in scan, while spgist scans smaller
nodes. Sure, comprehensive testing is required for doing final conclusion
about that.

-
With best regards,
Alexander Korotkov.

Re: [HACKERS] Fast GiST index build - further improvements

2012-02-02 Thread Alexander Korotkov

On Thu, Sep 8, 2011 at 8:35 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

Now that the main GiST index build patch has been committed, there's a few
further improvements that could make it much faster still:

Better management of the buffer pages on disk. At the moment, the
temporary file is used as a heap of pages belonging to all the buffers in
random order. I think we could speed up the algorithm considerably by
reading/writing the buffer pages sequentially. For example, when an
internal page is split, and all the tuples in its buffer are relocated,
that would be a great chance to write the new pages of the new buffers in
sequential order, instead of writing them back to the pages freed up by the
original buffer, which can be spread all around the temp file. I wonder if
we could use a separate file for each buffer? Or at least, a separate file
for all buffers that are larger than, say 100 MB in size.

Better management of in-memory buffer pages. When we start emptying a
buffer, we currently flush all the buffer pages in memory to the temporary
file, to make room for new buffer pages. But that's a waste of time, if
some of the pages we had in memory belonged to the buffer we're about to
empty next, or that we empty tuples to. Also, if while emptying a buffer,
all the tuples go to just one or two lower level buffers, it would be
beneficial to keep more than one page in-memory for those buffers.

Buffering leaf pages. This I posted on a separate thread already:
http://archives.postgresql.**org/message-id/4E5350DB.**
3060...@enterprisedb.comhttp://archives.postgresql.org/message-id/4e5350db.3060...@enterprisedb.com

Also, at the moment there is one issue with the algorithm that we have
glossed over this far: For each buffer, we keep some information in memory,
in the hash table, and in the auxiliary lists. That means that the amount
of memory needed for the build scales with the size of the index. If you're
dealing with very large indexes, hopefully you also have a lot of RAM in
your system, so I don't think this is a problem in practice. Still, it
would be nice to do something about that. A straightforward idea would be
to swap some of the information to disk. Another idea that, simpler to
implement, would be to completely destroy a buffer, freeing all the memory
it uses, when it becomes completely empty. Then, if you're about to run out
of memory (as defined by maintenance_work_mem), you can empty some low
level buffers to disk to free up some.

Unfortunately, I hadn't enough of time to implement something of this
before 9.2 release. Work on my Phd. thesis and web-projects takes too much
time.

But, I think there is one thing we should fix before 9.2 release. We assume
that gist index build have at least effective_cache_size/4 of cache. This
assumption could easily be false on high concurrency systems. I don't see
the way for convincing estimate here, but we could document this behaviour.
So, users could just tune effective_cache_size for gist index build on high
concurrency.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-08 Thread Alexander Korotkov

On Tue, Feb 7, 2012 at 11:26 PM, Jay Levitt jay.lev...@gmail.com wrote:

 [*] psql:slowcube.sql:20: ERROR:  node buffer of page being split (121550)
 does not exist

This looks like a bug in buffering GiST index build I've implemented during
my GSoC 2011 project. It looks especially strange with following setting:

 effective_cache_size = 3734MB

because buffering GiST index build just shouldn't turn on in this case when
index fits to cache. I'm goint to take a detailed look on this.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-14 Thread Alexander Korotkov

ITSM, I found the problem. This piece of code is triggering an error. It
assumes each page of corresponding to have initialized buffer. That should
be true because we're inserting index tuples from up to down while
splits propagate from down to up.

if (!found)
{
/*
 * Node buffer should exist at this point. If it didn't exist before,
 * the insertion that caused the page to split should've created it.
 */
elog(ERROR, node buffer of page being split (%u) does not exist,
 blocknum);
}

But this assumptions becomes false we turn buffer off in the root page. So,
root page can produce pages without initialized buffers when splits.

/*
 * Does specified level have buffers? (Beware of multiple evaluation of
 * arguments.)
 */
#define LEVEL_HAS_BUFFERS(nlevel, gfbb) \
((nlevel) != 0  (nlevel) % (gfbb)-levelStep == 0  \
 (nlevel) != (gfbb)-rootitem-level)

So, I think we should just do silent return from the function instead of
triggering error. Patch is attached.

--
With best regards,
Alexander Korotkov.
*** a/src/backend/access/gist/gistbuildbuffers.c
--- b/src/backend/access/gist/gistbuildbuffers.c
***
*** 607,617  gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
  	if (!found)
  	{
  		/*
! 		 * Node buffer should exist at this point. If it didn't exist before,
! 		 * the insertion that caused the page to split should've created it.
  		 */
! 		elog(ERROR, node buffer of page being split (%u) does not exist,
! 			 blocknum);
  	}
  
  	/*
--- 607,616 
  	if (!found)
  	{
  		/*
! 		 * Page without buffer could be produced by split of root page. So
! 		 * we've just nothing to do here when there is no buffer.
  		 */
! 		return;
  	}
  
  	/*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-15 Thread Alexander Korotkov

On Wed, Feb 15, 2012 at 2:54 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov aekorot...@gmail.com writes:
  ITSM, I found the problem. This piece of code is triggering an error. It
  assumes each page of corresponding to have initialized buffer. That
 should
  be true because we're inserting index tuples from up to down while
  splits propagate from down to up.
  But this assumptions becomes false we turn buffer off in the root page.
 So,
  root page can produce pages without initialized buffers when splits.

 Hmm ... can we tighten the error check rather than just remove it?  It
 feels less than safe to assume that a hash-entry-not-found condition
 *must* reflect a corner-case situation like that.  At the very least
 I'd like to see it verify that we'd turned off buffering before deciding
 this is OK.  Better, would it be practical to make dummy entries in the
 hash table even after turning buffers off, so that the logic here
 becomes

if (!found) error;
else if (entry is dummy) return without doing anything;
else proceed;

regards, tom lane


Ok, there is another patch fixes this problem. Instead of error triggering
remove it adds empty buffers on root page split if needed.

--
With best regards,
Alexander Korotkov.
*** a/src/backend/access/gist/gistbuild.c
--- b/src/backend/access/gist/gistbuild.c
***
*** 668,677  gistbufferinginserttuples(GISTBuildState *buildstate, Buffer buffer,
  	if (is_split  BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO)
  	{
  		GISTBufferingInsertStack *oldroot = gfbb-rootitem;
! 		Page		page = BufferGetPage(buffer);
! 		ItemId		iid;
! 		IndexTuple	idxtuple;
! 		BlockNumber leftmostchild;
  
  		gfbb-rootitem = (GISTBufferingInsertStack *) MemoryContextAlloc(
  			gfbb-context, sizeof(GISTBufferingInsertStack));
--- 668,678 
  	if (is_split  BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO)
  	{
  		GISTBufferingInsertStack *oldroot = gfbb-rootitem;
! 		Page		 page = BufferGetPage(buffer);
! 		ItemId		 iid;
! 		IndexTuple	 idxtuple;
! 		BlockNumber  leftmostchild;
! 		OffsetNumber maxoff, i;
  
  		gfbb-rootitem = (GISTBufferingInsertStack *) MemoryContextAlloc(
  			gfbb-context, sizeof(GISTBufferingInsertStack));
***
*** 694,699  gistbufferinginserttuples(GISTBuildState *buildstate, Buffer buffer,
--- 695,719 
  		oldroot-parent = gfbb-rootitem;
  		oldroot-blkno = leftmostchild;
  		oldroot-downlinkoffnum = InvalidOffsetNumber;
+ 		
+ 		/* 
+ 		 * If root page split produce new pages on leven which have buffers
+ 		 * then initialize empty buffers there.
+ 		 */
+ 		if (LEVEL_HAS_BUFFERS(oldroot-level, gfbb))
+ 		{
+ 			maxoff = PageGetMaxOffsetNumber(page);
+ 			for (i = FirstOffsetNumber; i = maxoff; i = OffsetNumberNext(i))
+ 			{
+ iid = PageGetItemId(page, i);
+ idxtuple = (IndexTuple) PageGetItem(page, iid);
+ gistGetNodeBuffer(gfbb,
+   buildstate-giststate,
+   ItemPointerGetBlockNumber((idxtuple-t_tid)),
+   i,
+   gfbb-rootitem);
+ 			}
+ 		}
  	}
  
  	if (splitinfo)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-15 Thread Alexander Korotkov

On Wed, Feb 15, 2012 at 4:26 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 Actually, I think it made sense to simply do nothing if the buffer doesn't
 exist. The algorithm doesn't require that all the buffers must exist at all
 times. It is quite accidental that whenever we call
 gistRelocateBuildBuffersOnSpli**t(), the page must already have its
 buffer created (and as we found out, the assumption doesn't hold after a
 root split, anyway). Also, we talked earlier that it would be good to
 destroy buffers that become completely empty, to save memory. If we do
 that, we'd have to remove that check anyway.

 So, I think we should go with your original fix and simply do nothing in
 gistRelocateBuildBuffersOnSpli**t() if the page doesn't have a buffer.
 Moreover, if the page has a buffer but it's empty,
 gistRelocateBuildBuffersOnSpli**t() doesn't need to create buffers for
 the new sibling pages. In the final emptying phase, that's a waste of time,
 the buffers we create will never be used, and even before that I think it's
 better to create the buffers lazily.


I agree.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-16 Thread Alexander Korotkov

On Thu, Feb 16, 2012 at 12:34 AM, Jay Levitt jay.lev...@gmail.com wrote:

 - But a dimension might be in any domain, not just floats
 - The distance along each dimension is a domain-specific function


What exact domains do you expect? Some domains could appear to be quite
hard for index-based similarity search using GiST (for example, sets,
strings etc.).

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-17 Thread Alexander Korotkov

On Fri, Feb 17, 2012 at 11:00 PM, Jay Levitt jay.lev...@gmail.com wrote:

 Tom Lane wrote:

 Jay Levittjay.lev...@gmail.com  writes:

 - Does KNN-GiST run into problems when-  returns values that don't
 make

 sense in the physical world?


 If the indexed entities are records, it would be
 entirely your own business how you handled individual fields being NULL.


 This turns out to be a bit challenging. Let's say I'm building a
 nullable_point type that allows the Y axis to be NULL (or any sentinel
 value for missing data), where the semantics are NULL is infinitely far
 from the query.   I'll need my GiST functions to return useful results
 with NULL - not just correct results, but results that help partition the
 tree nicely.

 At first I thought this posed a challenge for union; if I have these
 points:

 (1,2)
 (2,1)
 (1,NULL)

 what's the union? I think the answer is to treat NULL box coordinates like
 LL = -infinity, UR = infinity, or (equivalently, I think) to store a
 saw_nulls bit in addition to LL and UR.

 The real challenge is probably in picksplit and penalty - where in the
 tree should I stick (1,NULL)? - at which point you say Yes, algorithms for
 efficient indexes are hard work and computer-science-y and point me at
 surrogate splitters.

 Just thinking out loud, I guess; if other GiST types have addressed this
 problem, I'd love to hear about it.


Similar problem appears at GiST indexing of ranges, because range can be
empty. There additional contain empty flag was introduced. This contain
empty flag indicates that underlying value can be empty. So, this flag is
set when union with empty range or other range with this flag set. It's
likely you need similar flag for each dimension.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-17 Thread Alexander Korotkov

On Fri, Feb 17, 2012 at 11:32 PM, Jay Levitt jay.lev...@gmail.com wrote:

 Alexander Korotkov wrote:

 On Fri, Feb 17, 2012 at 11:00 PM, Jay Levitt jay.lev...@gmail.com
 mailto:jay.lev...@gmail.com wrote:

 At first I thought this posed a challenge for union; if I have these
 points:


(1,2)
(2,1)
(1,NULL)

what's the union? I think the answer is to treat NULL box coordinates
like LL = -infinity, UR = infinity, or (equivalently, I think) to store
a saw_nulls bit in addition to LL and UR.

 Similar problem appears at GiST indexing of ranges, because range can be
 empty. There additional contain empty flag was introduced. This contain
 empty flag indicates that underlying value can be empty. So, this flag is
 set when union with empty range or other range with this flag set. It's
 likely you need similar flag for each dimension.


 Ah, yes, exactly the same problem. So what led you to add a flag instead
 of using the range NULL..NULL? I'm on the fence about choosing.


At first, range bounds can't be NULL :) At second, if we have range
(a;b)+contain empty in internal page, both facts:
1) All normal underlying ranges are contained in (a;b).
2) There can be empty underlying ranges.
are useful for search.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] Incorrect behaviour when using a GiST index on points

2012-02-20 Thread Alexander Korotkov

On Thu, Feb 16, 2012 at 11:43 AM, Alexander Korotkov
aekorot...@gmail.comwrote:

 Described differences leads to incorrect behaviour of GiST index.
 The question is: what is correct way to fix it? Should on_pb also use FP*
 or consistent method should behave like on_pb?


Any comments on this? Current behaviour definitely indicates a bug, and I'm
ready to fix it. The only question: is this bug in on_pb or gist?

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] Google Summer of Code? Call for mentors.

2012-02-20 Thread Alexander Korotkov

On Thu, Feb 16, 2012 at 12:56 PM, Dave Page dp...@pgadmin.org wrote:

 On Wed, Feb 15, 2012 at 10:18 PM, Josh Berkus j...@agliodbs.com wrote:
  Hackers,
 
  The call is now open for Google Summer of Code.
 
  If you are interested in being a GSoC mentor this summer, please reply
  to this email.  I want to gauge whether or not we should participate
  this summer.

 I'll play, but I think I'm going to have to be very sure about any
 project before I agree to mentor it this year (not that there was
 anything wrong with my student last year - quite the opposite in fact
 - I just need to be sure I'm spending my time on the right things).


FYI, I found myself to be eligible for this year. So, if PostgreSQL will
participate this year, I'll do some proposals on indexing.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-20 Thread Alexander Korotkov

On Wed, Feb 15, 2012 at 7:28 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov aekorot...@gmail.com writes:
  On Wed, Feb 15, 2012 at 4:26 PM, Heikki Linnakangas 
  heikki.linnakan...@enterprisedb.com wrote:
  So, I think we should go with your original fix and simply do nothing in
  gistRelocateBuildBuffersOnSpli**t() if the page doesn't have a buffer.

  I agree.

 OK, I won't object.


So, I think we can just commit first version of fix now.

--
With best regards,
Alexander Korotkov.
*** a/src/backend/access/gist/gistbuildbuffers.c
--- b/src/backend/access/gist/gistbuildbuffers.c
***
*** 607,617  gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
  	if (!found)
  	{
  		/*
! 		 * Node buffer should exist at this point. If it didn't exist before,
! 		 * the insertion that caused the page to split should've created it.
  		 */
! 		elog(ERROR, node buffer of page being split (%u) does not exist,
! 			 blocknum);
  	}
  
  	/*
--- 607,616 
  	if (!found)
  	{
  		/*
! 		 * Page without buffer could be produced by split of root page. So
! 		 * we've just nothing to do here when there is no buffer.
  		 */
! 		return;
  	}
  
  	/*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Incorrect behaviour when using a GiST index on points

2012-02-20 Thread Alexander Korotkov

On Mon, Feb 20, 2012 at 7:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov aekorot...@gmail.com writes:
  On Thu, Feb 16, 2012 at 11:43 AM, Alexander Korotkov
  aekorot...@gmail.comwrote:
  Described differences leads to incorrect behaviour of GiST index.
  The question is: what is correct way to fix it? Should on_pb also use
 FP*
  or consistent method should behave like on_pb?

  Any comments on this? Current behaviour definitely indicates a bug, and
 I'm
  ready to fix it. The only question: is this bug in on_pb or gist?

 I'm inclined to think the right answer is to make on_pb use the FP*
 macros, for consistency with other geometric operators.  But it's worth
 asking whether that will actually fix the problem.  I've thought for
 some time that we'd eventually find cases where geo_ops' use of fuzzy
 comparisons breaks index behavior entirely, because it destroys natural
 assumptions like the transitive law.  So that could eventually lead us
 to rip out the FP* macros everywhere.

 In any case, this doesn't seem like something we could back-patch;
 it'd be a behavioral change in HEAD only.


Analyzing GiST interface methods more detail I found one other place where
fuzzy comparison was used. It is gist_box_same function. And it's really
scary thing. It means that entry of internal page is not extended when
inserting index tuple extends it in less than EPSILON. Correspondingly
current GiST search behaviour is neither exact search or fuzzy search with
given EPSILON. It is something in the middle. See following example for
proof.

test=# create table test(a box);
CREATE TABLE
test=# insert into test (select (case when i%2= 0 then
box(point(1,1),point(1,1)) else box(point(2,2),point(2,2)) end) from
generate_series(1,300) i);
INSERT 0 300
test=# create index test_idx on test using gist(a);
CREATE INDEX
test=#
test=# insert into test values (box(point(1.009,1.009),
point(1.009,1.009)));
INSERT 0 1
test=# select * from test where a  box(point(1.018,1.018),
point(1.018,1.018));
  a
-
 (1.009,1.009),(1.009,1.009)
(1 row)

test=# set enable_seqscan = off;
SET
test=# select * from test where a  box(point(1.018,1.018),
point(1.018,1.018));
 a
---
(0 rows)

But, I believe we still can bachpatch it in one of following ways:
1) Fix consistent and same functions. Add to release note notice that users
should rebuild indexes if they want correct behaviour.
2) Leave same function as is. Add kluges to consistent functions which
provides correct search on current not truly correct trees.

But anyway +1 for rip out the FP* macros everywhere in HEAD. Because I
really dislike the way FP* are currently implemented. Why EPSILON
is 1.0E-06? We don't know anything about nature of data that users store in
geometrical datatypes. The lesser double precision value is 5E-324. For
some datasets EPSILON can easily cover whole range.

--
With bestregards,
Alexander Korotkov.

Re: [HACKERS] Incorrect behaviour when using a GiST index on points

2012-02-22 Thread Alexander Korotkov

Attached patch fixes GiST behaviour without altering operators behaviour.

--
With best regards,
Alexander Korotkov.
*** a/src/backend/access/gist/gistproc.c
--- b/src/backend/access/gist/gistproc.c
***
*** 836,842  gist_box_picksplit(PG_FUNCTION_ARGS)
  }
  
  /*
!  * Equality method
   *
   * This is used for both boxes and points.
   */
--- 836,843 
  }
  
  /*
!  * Equality method. Returns true only when boxes are exact same. We can't
!  * ignore small extents because of index consistency.
   *
   * This is used for both boxes and points.
   */
***
*** 848,856  gist_box_same(PG_FUNCTION_ARGS)
  	bool	   *result = (bool *) PG_GETARG_POINTER(2);
  
  	if (b1  b2)
! 		*result = DatumGetBool(DirectFunctionCall2(box_same,
!    PointerGetDatum(b1),
!    PointerGetDatum(b2)));
  	else
  		*result = (b1 == NULL  b2 == NULL) ? TRUE : FALSE;
  	PG_RETURN_POINTER(result);
--- 849,857 
  	bool	   *result = (bool *) PG_GETARG_POINTER(2);
  
  	if (b1  b2)
! 		*result = (b1-low.x == b2-low.x  b1-low.y == b2-low.y  
!    b1-high.x == b2-high.x  b1-high.y == b2-high.y)
!   ? TRUE : FALSE;
  	else
  		*result = (b1 == NULL  b2 == NULL) ? TRUE : FALSE;
  	PG_RETURN_POINTER(result);
***
*** 1326,1331  gist_point_consistent(PG_FUNCTION_ARGS)
--- 1327,1333 
  	bool	   *recheck = (bool *) PG_GETARG_POINTER(4);
  	bool		result;
  	StrategyNumber strategyGroup = strategy / GeoStrategyNumberOffset;
+ 	BOX		   *query, *key;
  
  	switch (strategyGroup)
  	{
***
*** 1337,1348  gist_point_consistent(PG_FUNCTION_ARGS)
  			*recheck = false;
  			break;
  		case BoxStrategyNumberGroup:
! 			result = DatumGetBool(DirectFunctionCall5(
! 	  gist_box_consistent,
! 	  PointerGetDatum(entry),
! 	  PG_GETARG_DATUM(1),
! 	  Int16GetDatum(RTOverlapStrategyNumber),
! 			   0, PointerGetDatum(recheck)));
  			break;
  		case PolygonStrategyNumberGroup:
  			{
--- 1339,1356 
  			*recheck = false;
  			break;
  		case BoxStrategyNumberGroup:
! 			/* 
! 			 * This code repeats logic of on_ob which uses simple comparison
! 			 * rather than FP* functions.
! 			 */
! 			query = PG_GETARG_BOX_P(1);
! 			key = DatumGetBoxP(entry-key);
! 			
! 			*recheck = false;
! 			result = key-high.x = query-low.x  
! 	 key-low.x = query-high.x 
! 	 key-high.y = query-low.y  
! 	 key-low.y = query-high.y;
  			break;
  		case PolygonStrategyNumberGroup:
  			{

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Collect frequency statistics for arrays

2012-02-29 Thread Alexander Korotkov

On Thu, Mar 1, 2012 at 12:39 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 I am starting to look at this patch now.  I'm wondering exactly why the
 decision was made to continue storing btree-style statistics for arrays,
 in addition to the new stuff.  The pg_statistic rows for array columns
 tend to be unreasonably wide already, and as-is this patch will make
 them quite a lot wider.  I think it requires more than a little bit of
 evidence to continue storing stats that seem to have only small
 probability of usefulness.

 In particular, if we didn't store that stuff, we'd not need to widen the
 number of columns in pg_statistic, which would noticeably reduce both
 the footprint of the patch and the probability of breaking external
 code.


Initially, I used existing slots for new statistics, but I've changed this
after the first review:
http://archives.postgresql.org/pgsql-hackers/2011-07/msg00780.php

Probably, btree statistics really does matter for some sort of arrays? For
example, arrays representing paths in the tree. We could request a subtree
in a range query on such arrays.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] Collect frequency statistics for arrays

2012-02-29 Thread Alexander Korotkov

On Thu, Mar 1, 2012 at 1:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov aekorot...@gmail.com writes:
  On Thu, Mar 1, 2012 at 12:39 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  I am starting to look at this patch now.  I'm wondering exactly why the
  decision was made to continue storing btree-style statistics for arrays,

  Probably, btree statistics really does matter for some sort of arrays?
 For
  example, arrays representing paths in the tree. We could request a
 subtree
  in a range query on such arrays.

 That seems like a pretty narrow, uncommon use-case.  Also, to get
 accurate stats for such queries that way, you'd need really enormous
 histograms.  I doubt that the existing parameters for histogram size
 will permit meaningful estimation of more than the first array entry
 (since we don't make the histogram any larger than we do for a scalar
 column).

 The real point here is that the fact that we're storing btree-style
 stats for arrays is an accident, backed into by having added btree
 comparators for arrays plus analyze.c's habit of applying default
 scalar-oriented analysis functions to any type without an explicit
 typanalyze entry.  I don't recall that we ever thought hard about
 it or showed that those stats were worth anything.


OK. I don't object to removing btree stats from arrays.
What do you thinks about pg_stats view in this case? Should it combine
values histogram and array length histogram in single column like do for
MCV and MCELEM?

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] Collect frequency statistics for arrays

2012-03-01 Thread Alexander Korotkov

On Thu, Mar 1, 2012 at 1:19 AM, Alexander Korotkov aekorot...@gmail.comwrote:

 On Thu, Mar 1, 2012 at 1:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 That seems like a pretty narrow, uncommon use-case.  Also, to get
 accurate stats for such queries that way, you'd need really enormous
 histograms.  I doubt that the existing parameters for histogram size
 will permit meaningful estimation of more than the first array entry
 (since we don't make the histogram any larger than we do for a scalar
 column).

 The real point here is that the fact that we're storing btree-style
 stats for arrays is an accident, backed into by having added btree
 comparators for arrays plus analyze.c's habit of applying default
 scalar-oriented analysis functions to any type without an explicit
 typanalyze entry.  I don't recall that we ever thought hard about
 it or showed that those stats were worth anything.


 OK. I don't object to removing btree stats from arrays.
 What do you thinks about pg_stats view in this case? Should it combine
 values histogram and array length histogram in single column like do for
 MCV and MCELEM?


Btree statistics for arrays and additional statistics slot are removed from
attached version of patch. pg_stats view is untouched for while.

--
With best regards,
Alexander Korotkov.


arrayanalyze-0.13.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Collect frequency statistics for arrays

2012-03-04 Thread Alexander Korotkov

On Sun, Mar 4, 2012 at 5:38 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 1. I'm still unhappy about the loop that fills the count histogram,
 as I noted earlier today.  It at least needs a decent comment and some
 overflow protection, and I'm not entirely convinced that it doesn't have
 more bugs than the overflow issue.


Attached patch is focused on fixing this. The frac variable overflow is
evaded by making it int64. I hope comments is clarifying something. In
general this loop copies behaviour of histogram constructing loop of
compute_scalar_stats function. But instead of values array we've array of
unique DEC and it's frequency.

--
With best regards,
Alexander Korotkov.
*** a/src/backend/utils/adt/array_typanalyze.c
--- b/src/backend/utils/adt/array_typanalyze.c
***
*** 581,587  compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
  			DECountItem **sorted_count_items;
  			int			count_item_index;
  			int			delta;
! 			int			frac;
  			float4	   *hist;
  
  			/* num_hist must be at least 2 for the loop below to work */
--- 581,587 
  			DECountItem **sorted_count_items;
  			int			count_item_index;
  			int			delta;
! 			int64		frac;
  			float4	   *hist;
  
  			/* num_hist must be at least 2 for the loop below to work */
***
*** 612,633  compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
  			hist[num_hist] = (double) element_no / (double) nonnull_cnt;
  
  			/*
! 			 * Construct the histogram.
! 			 *
! 			 * XXX this needs work: frac could overflow, and it's not clear
! 			 * how or why the code works.  Even if it does work, it needs
! 			 * documented.
  			 */
  			delta = analyzed_rows - 1;
  			count_item_index = 0;
! 			frac = sorted_count_items[0]-frequency * (num_hist - 1);
  			for (i = 0; i  num_hist; i++)
  			{
  while (frac = 0)
  {
  	count_item_index++;
  	Assert(count_item_index  count_items_count);
! 	frac += sorted_count_items[count_item_index]-frequency * (num_hist - 1);
  }
  hist[i] = sorted_count_items[count_item_index]-count;
  frac -= delta;
--- 612,642 
  			hist[num_hist] = (double) element_no / (double) nonnull_cnt;
  
  			/*
! 			 * Construct the histogram of DECs. The object of this loop is to
! 			 * copy the max and min DECs and evenly-spaced DECs in between
! 			 * (space here is number of arrays corresponding to DEC). If we
! 			 * imagine ordered array of DECs where each input array have a
! 			 * corresponding DEC item, i'th value of histogram will be 
! 			 * DECs[i * (analyzed_rows - 1) / (num_hist - 1)]. But instead
! 			 * of such array we've sorted_count_items which holds unique DEC
! 			 * values with their frequencies. We can imagine frac variable as
! 			 * an (index in DECs corresponding to next sorted_count_items
! 			 * element - index in DECs corresponding to last histogram value) *
! 			 * (num_hist - 1). In this case negative fraction leads us to
! 			 * iterate over sorted_count_items. 
  			 */
  			delta = analyzed_rows - 1;
  			count_item_index = 0;
! 			frac = (int64)sorted_count_items[0]-frequency * 
!    (int64)(num_hist - 1);
  			for (i = 0; i  num_hist; i++)
  			{
  while (frac = 0)
  {
  	count_item_index++;
  	Assert(count_item_index  count_items_count);
! 	frac += (int64)sorted_count_items[count_item_index]-frequency * 
! 		(int64)(num_hist - 1);
  }
  hist[i] = sorted_count_items[count_item_index]-count;
  frac -= delta;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Collect frequency statistics for arrays

2012-03-04 Thread Alexander Korotkov

On Sun, Mar 4, 2012 at 5:38 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 2. The tests in the above-mentioned message show that in most cases
 where mcelem_array_contained_selec falls through to the rough
 estimate, the resulting rowcount estimate is just 1, ie we are coming
 out with very small selectivities.  Although that path will now only be
 taken when there are no stats, it seems like we'd be better off to
 return DEFAULT_CONTAIN_SEL instead of what it's doing.  I think there
 must be something wrong with the rough estimate logic.  Could you
 recheck that?


I think the wrong think with rough estimate is that assumption about
independent occurrences of items is very unsuitable even for rough
estimate. The following example shows that rough estimate really works
in the case of independent occurrences of items.

Generate test table where item occurrences are really independent.

test=# create table test as select ('{'||(select string_agg(s,',') from
(select case when (t*0 + random())  0.1 then i::text else null end from
generate_series(1,100) i) as x(s))||'}')::int[] AS val  from
generate_series(1,1) t;

SELECT 1

test=# analyze test;
ANALYZE

Do some test.

test=# explain analyze select * from test where val @
array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60];

  QUERY PLAN


--
 Seq Scan on test  (cost=0.00..239.00 rows=151 width=61) (actual
time=0.325..32.556 rows=163 loops=1
)
   Filter: (val @
'{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60}'::integer[])
   Rows Removed by Filter: 9837
 Total runtime: 32.806 ms
(4 rows)

Delete DECHIST statistics.

test=# update pg_statistic set stakind1 = 0, staop1 = 0, stanumbers1 =
null, stavalues1 = null where starelid = (select oid from pg_class where
relname = 'test') and stakind1 = 5;
UPDATE 0
test=# update pg_statistic set stakind2 = 0, staop2 = 0, stanumbers2 =
null, stavalues2 = null where starelid = (select oid from pg_class where
relname = 'test') and stakind2 = 5;
UPDATE 0
test=# update pg_statistic set stakind3 = 0, staop3 = 0, stanumbers3 =
null, stavalues3 = null where starelid = (select oid from pg_class where
relname = 'test') and stakind3 = 5;
UPDATE 0
test=# update pg_statistic set stakind4 = 0, staop4 = 0, stanumbers4 =
null, stavalues4 = null where starelid = (select oid from pg_class where
relname = 'test') and stakind4 = 5;
UPDATE 1
test=# update pg_statistic set stakind5 = 0, staop5 = 0, stanumbers5 =
null, stavalues5 = null where starelid = (select oid from pg_class where
relname = 'test') and stakind5 = 5;
UPDATE 0

Do another test.

test=# explain analyze select * from test where val @
array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60];

  QUERY PLAN


--
 Seq Scan on test  (cost=0.00..239.00 rows=148 width=61) (actual
time=0.332..32.952 rows=163 loops=1)
   Filter: (val @
'{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60}'::integer[])
   Rows Removed by Filter: 9837
 Total runtime: 33.225 ms
(4 rows)

It this particular case rough estimate is quite accurate. But in most
part of cases it behaves really bad. It is why I started to invent
calc_distr and etc. So, I think return DEFAULT_CONTAIN_SEL is OK unless
we've some better ideas.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] Collect frequency statistics for arrays

2012-03-05 Thread Alexander Korotkov

On Mon, Mar 5, 2012 at 1:11 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 BTW, one other thing about the count histogram: seems like we are
 frequently generating uselessly large ones.  For instance, do ANALYZE
 in the regression database and then run

 select tablename,attname,elem_count_histogram from pg_stats
  where elem_count_histogram is not null;

 You get lots of entries that look like this:

  pg_proc | proallargtypes |
 {1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,6,6,6,2.80556}
  pg_proc | proargmodes|
 {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1.6}
  pg_proc | proargnames|
 {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,7,7,7,7,8,8,8,14,14,15,16,3.8806}
  pg_proc | proconfig  |
 {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}
  pg_class| reloptions |
 {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}

 which seems to me to be a rather useless expenditure of space.
 Couldn't we reduce the histogram size when there aren't many
 different counts?

 It seems fairly obvious to me that we could bound the histogram
 size with (max count - min count + 1), but maybe something even
 tighter would work; or maybe I'm missing something and this would
 sacrifice accuracy.


True. If (max count - min count + 1) is small, enumerating of frequencies
is both more compact and more precise representation. Simultaneously,
if (max count - min count + 1) is large, we can run out of
statistics_target with such representation. We can use same representation
of count distribution as for scalar column value: MCV and HISTOGRAM, but it
would require additional statkind and statistics slot. Probably, you've
better ideas?

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] Collect frequency statistics for arrays

2012-03-12 Thread Alexander Korotkov

On Thu, Mar 8, 2012 at 4:51 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov aekorot...@gmail.com writes:
  True. If (max count - min count + 1) is small, enumerating of frequencies
  is both more compact and more precise representation. Simultaneously,
  if (max count - min count + 1) is large, we can run out of
  statistics_target with such representation. We can use same
 representation
  of count distribution as for scalar column value: MCV and HISTOGRAM, but
 it
  would require additional statkind and statistics slot. Probably, you've
  better ideas?

 I wasn't thinking of introducing two different representations,
 but just trimming the histogram length when it's larger than necessary.

 On reflection my idea above is wrong; for example assume that we have a
 column with 900 arrays of length 1 and 100 arrays of length 2.  Going by
 what I said, we'd reduce the histogram to {1,2}, which might accurately
 capture the set of lengths present but fails to show that 1 is much more
 common than 2.  However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
 would capture the situation perfectly in one-tenth the space that the
 current logic does.

 More generally, by limiting the histogram to statistics_target entries,
 we are already accepting errors of up to 1/(2*statistics_target) in the
 accuracy of the bin-boundary representation.  What the above example
 shows is that sometimes we could meet the same accuracy requirement with
 fewer entries.  I'm not sure how this could be mechanized but it seems
 worth thinking about.


I can propose following representation of histogram.

If (max_count - min_count + 1) = statistics_target then
1) store max_count and min_count in stavalues
2) store frequencies from min_count ot max_count in numvalues

If (max_count - min_count + 1)  statistics_target then
store histogram in current manner. I think in this case it's unlikely to be
many repeating values.

I can propose patch which change histogram representation to this.

Comments?

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] Incorrect behaviour when using a GiST index on points

2012-03-12 Thread Alexander Korotkov

I believe that attached version of patch can be backpatched. It fixes this
problem without altering of index building procedure. It just makes checks
in internal pages softener enough to compensate effect of gist_box_same
implementation.

--
With best regards,
Alexander Korotkov.
*** a/src/backend/access/gist/gistproc.c
--- b/src/backend/access/gist/gistproc.c
***
*** 70,76  gist_box_consistent(PG_FUNCTION_ARGS)
  
  	if (DatumGetBoxP(entry-key) == NULL || query == NULL)
  		PG_RETURN_BOOL(FALSE);
! 
  	/*
  	 * if entry is not leaf, use rtree_internal_consistent, else use
  	 * gist_box_leaf_consistent
--- 70,76 
  
  	if (DatumGetBoxP(entry-key) == NULL || query == NULL)
  		PG_RETURN_BOOL(FALSE);
! 	
  	/*
  	 * if entry is not leaf, use rtree_internal_consistent, else use
  	 * gist_box_leaf_consistent
***
*** 80,88  gist_box_consistent(PG_FUNCTION_ARGS)
  query,
  strategy));
  	else
! 		PG_RETURN_BOOL(rtree_internal_consistent(DatumGetBoxP(entry-key),
   query,
   strategy));
  }
  
  static void
--- 80,102 
  query,
  strategy));
  	else
! 	{
! 		/*
! 		 * Box in internal page can be narrower than box in leaf page not
! 		 * more than EPSILON in each boundary. Do corresponding correction.
! 		 */
! 		BOX			key, *entrykey;
! 		
! 		entrykey = DatumGetBoxP(entry-key);
! 		key.low.x = entrykey-low.x - EPSILON;
! 		key.low.y = entrykey-low.y - EPSILON;
! 		key.high.x = entrykey-high.x + EPSILON;
! 		key.high.y = entrykey-high.y + EPSILON;
! 		
! 		PG_RETURN_BOOL(rtree_internal_consistent(key,
   query,
   strategy));
+ 	}
  }
  
  static void
***
*** 847,852  gist_box_same(PG_FUNCTION_ARGS)
--- 861,871 
  	BOX		   *b2 = PG_GETARG_BOX_P(1);
  	bool	   *result = (bool *) PG_GETARG_POINTER(2);
  
+ 	/*
+ 	 * box_same function allow difference between boxes limited by EPSILON.
+ 	 * Thus box in internal page can be narrower than box in leaf page not
+ 	 * more than EPSILON in each boundary.
+ 	 */
  	if (b1  b2)
  		*result = DatumGetBool(DirectFunctionCall2(box_same,
     PointerGetDatum(b1),
***
*** 1072,1077  gist_poly_consistent(PG_FUNCTION_ARGS)
--- 1091,1097 
  	GISTENTRY  *entry = (GISTENTRY *) PG_GETARG_POINTER(0);
  	POLYGON*query = PG_GETARG_POLYGON_P(1);
  	StrategyNumber strategy = (StrategyNumber) PG_GETARG_UINT16(2);
+ 	BOX			key, *entrykey;
  
  	/* Oid		subtype = PG_GETARG_OID(3); */
  	bool	   *recheck = (bool *) PG_GETARG_POINTER(4);
***
*** 1083,1094  gist_poly_consistent(PG_FUNCTION_ARGS)
  	if (DatumGetBoxP(entry-key) == NULL || query == NULL)
  		PG_RETURN_BOOL(FALSE);
  
  	/*
  	 * Since the operators require recheck anyway, we can just use
  	 * rtree_internal_consistent even at leaf nodes.  (This works in part
  	 * because the index entries are bounding boxes not polygons.)
  	 */
! 	result = rtree_internal_consistent(DatumGetBoxP(entry-key),
  	   (query-boundbox), strategy);
  
  	/* Avoid memory leak if supplied poly is toasted */
--- 1103,1128 
  	if (DatumGetBoxP(entry-key) == NULL || query == NULL)
  		PG_RETURN_BOOL(FALSE);
  
+ 	entrykey = DatumGetBoxP(entry-key);
+ 	if (!GIST_LEAF(entry))
+ 	{
+ 		/*
+ 		 * Box in internal page can be narrower than box in leaf page not
+ 		 * more than EPSILON in each boundary. Do corresponding correction.
+ 		 */
+ 		key.low.x = entrykey-low.x - EPSILON;
+ 		key.low.y = entrykey-low.y - EPSILON;
+ 		key.high.x = entrykey-high.x + EPSILON;
+ 		key.high.y = entrykey-high.y + EPSILON;
+ 		entrykey = key;
+ 	}
+ 	
  	/*
  	 * Since the operators require recheck anyway, we can just use
  	 * rtree_internal_consistent even at leaf nodes.  (This works in part
  	 * because the index entries are bounding boxes not polygons.)
  	 */
! 	result = rtree_internal_consistent(entrykey,
  	   (query-boundbox), strategy);
  
  	/* Avoid memory leak if supplied poly is toasted */
***
*** 1152,1158  gist_circle_consistent(PG_FUNCTION_ARGS)
  
  	/* Oid		subtype = PG_GETARG_OID(3); */
  	bool	   *recheck = (bool *) PG_GETARG_POINTER(4);
! 	BOX			bbox;
  	bool		result;
  
  	/* All cases served by this function are inexact */
--- 1186,1192 
  
  	/* Oid		subtype = PG_GETARG_OID(3); */
  	bool	   *recheck = (bool *) PG_GETARG_POINTER(4);
! 	BOX			bbox, *entrykey, key;
  	bool		result;
  
  	/* All cases served by this function are inexact */
***
*** 1170,1177  gist_circle_consistent(PG_FUNCTION_ARGS)
  	bbox.low.x = query-center.x - query-radius;
  	bbox.high.y = query-center.y + query-radius;
  	bbox.low.y = query-center.y - query-radius;
  
! 	result = rtree_internal_consistent(DatumGetBoxP(entry-key),
  	   bbox, strategy);
  
  	PG_RETURN_BOOL(result);
--- 1204,1225 
  	bbox.low.x = query-center.x - query-radius;
  	bbox.high.y = query-center.y + query-radius;
  	bbox.low.y

[HACKERS] Wildcard search support for pg_trgm

2010-12-11 Thread Alexander Korotkov

Hackers,

Here is first version of patch, which enable index support of wildcard
search in pg_trgm contrib module. The idea of the patch is to extract from
wildcard trigrams which should occurs in wildcard matching string. For
example, for '%sector%' wildcard such trigrams would be: 'sec', 'ect',
'tor'.

create table words (word text);
copy words from '/usr/share/dict/american-english';

test=# explain analyze select * from words where word ilike '%independ%';
  QUERY PLAN

--
 Seq Scan on words  (cost=0.00..1703.11 rows=10 width=9) (actual
time=18.818..174.146 rows=7 loops=1)
   Filter: (word ~~* '%independ%'::text)
 Total runtime: 174.200 ms
(3 rows)

CREATE INDEX trgm_idx ON words USING gist (word gist_trgm_ops);

test=# explain analyze select * from words where word ilike '%independ%';
QUERY PLAN


--
 Bitmap Heap Scan on words  (cost=4.36..40.11 rows=10 width=9) (actual
time=2.445..2.529 rows=7 loops=1)
   Recheck Cond: (word ~~* '%independ%'::text)
   -  Bitmap Index Scan on trgm_idx  (cost=0.00..4.35 rows=10 width=0)
(actual time=2.406..2.406 rows=7 loops=1)
 Index Cond: (word ~~* '%independ%'::text)
 Total runtime: 2.612 ms
(5 rows)

CREATE INDEX trgm_idx ON words USING gin (word gin_trgm_ops);

test=# explain analyze select * from words where word ilike '%independ%';
QUERY PLAN


---
 Bitmap Heap Scan on words  (cost=76.08..111.83 rows=10 width=9) (actual
time=2.675..2.755 rows=7 loops=1)
   Recheck Cond: (word ~~* '%independ%'::text)
   -  Bitmap Index Scan on trgm_idx  (cost=0.00..76.07 rows=10 width=0)
(actual time=2.642..2.642 rows=7 loops=1)
 Index Cond: (word ~~* '%independ%'::text)
 Total runtime: 2.839 ms
(5 rows)

I've encountered with following problems:
1) Indexing support for ilike is possible only with case-insensetive
wildcards, e.g. when IGNORECASE macro is enabled. But I can't use this macro
in pg_trgm.sql.in, where list of operators is defined. Probably, is it
enuogh to put comment near IGNORECASE, which tells that if one disable this
macro he should also remove oparators from pg_trgm.sql.in?
2) I found gist index not very useful with default SIGLENINT = 3. I've
changed this value to 15 and I found gist index performs very good on
dictionary. But on longer strings greater values of SIGLENINT may be
required (probably even SIGLENINT  122 will give benefit in some cases in
spite of TOAST).


With best regards,
Alexander Korotkov.
*** a/contrib/pg_trgm/pg_trgm.sql.in
--- b/contrib/pg_trgm/pg_trgm.sql.in
***
*** 113,118  FOR TYPE text USING gist
--- 113,120 
  AS
  OPERATOR1   % (text, text),
  OPERATOR2   - (text, text) FOR ORDER BY pg_catalog.float_ops,
+ OPERATOR3   ~~ (text, text),
+ OPERATOR4   ~~* (text, text),
  FUNCTION1   gtrgm_consistent (internal, text, int, oid, internal),
  FUNCTION2   gtrgm_union (bytea, internal),
  FUNCTION3   gtrgm_compress (internal),
***
*** 144,149  CREATE OPERATOR CLASS gin_trgm_ops
--- 146,153 
  FOR TYPE text USING gin
  AS
  OPERATOR1   % (text, text),
+ OPERATOR3   ~~ (text, text),
+ OPERATOR4   ~~* (text, text),
  FUNCTION1   btint4cmp (int4, int4),
  FUNCTION2   gin_extract_trgm (text, internal),
  FUNCTION3   gin_extract_trgm (text, internal, int2, internal, internal),
*** a/contrib/pg_trgm/trgm.h
--- b/contrib/pg_trgm/trgm.h
***
*** 19,24 
--- 19,26 
  /* operator strategy numbers */
  #define	SimilarityStrategyNumber	1
  #define	DistanceStrategyNumber		2
+ #define LikeStrategyNumber			3
+ #define ILikeStrategyNumber			4
  
  
  typedef char trgm[3];
***
*** 53,59  typedef struct
  
  /* gist */
  #define BITBYTE 8
! #define SIGLENINT  3			/* 122 = key will toast, so very slow!!! */
  #define SIGLEN	( sizeof(int)*SIGLENINT )
  
  #define SIGLENBIT (SIGLEN*BITBYTE - 1)	/* see makesign */
--- 55,61 
  
  /* gist */
  #define BITBYTE 8
! #define SIGLENINT  15			/* 122 = key will toast, so very slow!!! */
  #define SIGLEN	( sizeof(int)*SIGLENINT )
  
  #define SIGLENBIT (SIGLEN*BITBYTE - 1)	/* see makesign */
***
*** 89,94  typedef char *BITVECP;
--- 91,101 
  extern float4 trgm_limit;
  
  TRGM	   *generate_trgm(char *str, int slen);
+ TRGM	   *generate_wildcard_trgm(char *str, int slen);
  float4		cnt_sml

Re: [HACKERS] Wildcard search support for pg_trgm

2010-12-12 Thread Alexander Korotkov

I found another problem. GIN index suffers from GIN indexes do not support
whole-index scans when no trigram can be extracted from pattern.


With best regards,
Alexander Korotkov.

Re: [HACKERS] Wildcard search support for pg_trgm

2010-12-12 Thread Alexander Korotkov

On Mon, Dec 13, 2010 at 12:14 AM, Dimitri Fontaine
dimi...@2ndquadrant.frwrote:

 How different (and better) is it from wildspeed?


The general advantage is possibility of usage wildcard search and trigram
similarity search using the same index.
I expect that GIN trigram index is slightly less space demanding, but
slightly slower on search than wildspeed. Also, I expect GiST trigram index
to be slower on search, but faster on updates. While I didn't check these
assumptions in details.
I've lack of test datasets for sufficient testing, and I would like to ask
community to help me with it. Because testing on dictionaries is good,
but obviously
not enough.


With best regards,
Alexander Korotkov.

Re: [HACKERS] Wildcard search support for pg_trgm

2011-01-08 Thread Alexander Korotkov

=0x9a25ee8, altdest=0x9a25ee8, completionTag=0xbfcd9d9c ) at
pquery.c:791
#22 0x0825ec78 in exec_simple_query (query_string=value optimized out) at
postgres.c:1059
#23 0x0825fe01 in PostgresMain (argc=2, argv=0x99aeb68, username=0x99aead8
smagen)
at postgres.c:3939
#24 0x08229250 in BackendRun () at postmaster.c:3577
#25 BackendStartup () at postmaster.c:3262
#26 ServerLoop () at postmaster.c:1448
#27 0x0822be12 in PostmasterMain (argc=3, argv=0x99acc58) at
postmaster.c:1109
#28 0x081ce3b7 in main (argc=3, argv=0x99acc58) at main.c:199


With best regards,
Alexander Korotkov.
*** a/contrib/pg_trgm/pg_trgm.sql.in
--- b/contrib/pg_trgm/pg_trgm.sql.in
***
*** 113,118  FOR TYPE text USING gist
--- 113,120 
  AS
  OPERATOR1   % (text, text),
  OPERATOR2   - (text, text) FOR ORDER BY pg_catalog.float_ops,
+ OPERATOR3   ~~ (text, text),
+ OPERATOR4   ~~* (text, text),
  FUNCTION1   gtrgm_consistent (internal, text, int, oid, internal),
  FUNCTION2   gtrgm_union (bytea, internal),
  FUNCTION3   gtrgm_compress (internal),
***
*** 129,140  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_extract_trgm(text, internal, int2, internal, internal)
  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal)
  RETURNS bool
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
--- 131,142 
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_extract_query_trgm(text, internal, int2, internal, internal, internal, internal)
  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal, internal, internal)
  RETURNS bool
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
***
*** 144,151  CREATE OPERATOR CLASS gin_trgm_ops
  FOR TYPE text USING gin
  AS
  OPERATOR1   % (text, text),
  FUNCTION1   btint4cmp (int4, int4),
  FUNCTION2   gin_extract_trgm (text, internal),
! FUNCTION3   gin_extract_trgm (text, internal, int2, internal, internal),
! FUNCTION4   gin_trgm_consistent (internal, int2, text, int4, internal, internal),
  STORAGE int4;
--- 146,155 
  FOR TYPE text USING gin
  AS
  OPERATOR1   % (text, text),
+ OPERATOR3   ~~ (text, text),
+ OPERATOR4   ~~* (text, text),
  FUNCTION1   btint4cmp (int4, int4),
  FUNCTION2   gin_extract_trgm (text, internal),
! FUNCTION3   gin_extract_query_trgm (text, internal, int2, internal, internal, internal, internal),
! FUNCTION4   gin_trgm_consistent (internal, int2, text, int4, internal, internal, internal, internal),
  STORAGE int4;
*** a/contrib/pg_trgm/trgm.h
--- b/contrib/pg_trgm/trgm.h
***
*** 19,24 
--- 19,26 
  /* operator strategy numbers */
  #define	SimilarityStrategyNumber	1
  #define	DistanceStrategyNumber		2
+ #define LikeStrategyNumber			3
+ #define ILikeStrategyNumber			4
  
  
  typedef char trgm[3];
***
*** 53,59  typedef struct
  
  /* gist */
  #define BITBYTE 8
! #define SIGLENINT  3			/* 122 = key will toast, so very slow!!! */
  #define SIGLEN	( sizeof(int)*SIGLENINT )
  
  #define SIGLENBIT (SIGLEN*BITBYTE - 1)	/* see makesign */
--- 55,61 
  
  /* gist */
  #define BITBYTE 8
! #define SIGLENINT  15			/* 122 = key will toast, so very slow!!! */
  #define SIGLEN	( sizeof(int)*SIGLENINT )
  
  #define SIGLENBIT (SIGLEN*BITBYTE - 1)	/* see makesign */
***
*** 89,94  typedef char *BITVECP;
--- 91,101 
  extern float4 trgm_limit;
  
  TRGM	   *generate_trgm(char *str, int slen);
+ TRGM	   *generate_wildcard_trgm(char *str, int slen);
  float4		cnt_sml(TRGM *trg1, TRGM *trg2);
+ booltrgm_contain(TRGM *trg1, TRGM *trg2);
+ 
+ #define ISESCAPECHAR(x) (*(x) == '\\')
+ #define ISWILDCARDCHAR(x) (*(x) == '_' || *(x) == '%')
  
  #endif /* __TRGM_H__ */
*** a/contrib/pg_trgm/trgm_gin.c
--- b/contrib/pg_trgm/trgm_gin.c
***
*** 6,11 
--- 6,12 
  #include trgm.h
  
  #include access/gin.h
+ #include access/skey.h
  #include access/itup.h
  #include access/tuptoaster.h
  #include storage/bufpage.h
***
*** 16,21 
--- 17,25 
  PG_FUNCTION_INFO_V1(gin_extract_trgm);
  Datum		gin_extract_trgm(PG_FUNCTION_ARGS);
  
+ PG_FUNCTION_INFO_V1(gin_extract_query_trgm);
+ Datum		gin_extract_query_trgm(PG_FUNCTION_ARGS);
+ 
  PG_FUNCTION_INFO_V1(gin_trgm_consistent);
  Datum		gin_trgm_consistent

Re: [HACKERS] Wildcard search support for pg_trgm

2011-01-17 Thread Alexander Korotkov

Hi,

Here is updated version of this patch.


With best regards,
Alexander Korotkov.
*** a/contrib/pg_trgm/pg_trgm.sql.in
--- b/contrib/pg_trgm/pg_trgm.sql.in
***
*** 113,118  FOR TYPE text USING gist
--- 113,120 
  AS
  OPERATOR1   % (text, text),
  OPERATOR2   - (text, text) FOR ORDER BY pg_catalog.float_ops,
+ OPERATOR3   ~~ (text, text),
+ OPERATOR4   ~~* (text, text),
  FUNCTION1   gtrgm_consistent (internal, text, int, oid, internal),
  FUNCTION2   gtrgm_union (bytea, internal),
  FUNCTION3   gtrgm_compress (internal),
***
*** 129,140  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_extract_trgm(text, internal, int2, internal, internal)
  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal)
  RETURNS bool
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
--- 131,142 
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_extract_query_trgm(text, internal, int2, internal, internal, internal, internal)
  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal, internal, internal)
  RETURNS bool
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
***
*** 144,151  CREATE OPERATOR CLASS gin_trgm_ops
  FOR TYPE text USING gin
  AS
  OPERATOR1   % (text, text),
  FUNCTION1   btint4cmp (int4, int4),
  FUNCTION2   gin_extract_trgm (text, internal),
! FUNCTION3   gin_extract_trgm (text, internal, int2, internal, internal),
! FUNCTION4   gin_trgm_consistent (internal, int2, text, int4, internal, internal),
  STORAGE int4;
--- 146,155 
  FOR TYPE text USING gin
  AS
  OPERATOR1   % (text, text),
+ OPERATOR3   ~~ (text, text),
+ OPERATOR4   ~~* (text, text),
  FUNCTION1   btint4cmp (int4, int4),
  FUNCTION2   gin_extract_trgm (text, internal),
! FUNCTION3   gin_extract_query_trgm (text, internal, int2, internal, internal, internal, internal),
! FUNCTION4   gin_trgm_consistent (internal, int2, text, int4, internal, internal, internal, internal),
  STORAGE int4;
*** a/contrib/pg_trgm/trgm.h
--- b/contrib/pg_trgm/trgm.h
***
*** 19,24 
--- 19,26 
  /* operator strategy numbers */
  #define	SimilarityStrategyNumber	1
  #define	DistanceStrategyNumber		2
+ #define LikeStrategyNumber			3
+ #define ILikeStrategyNumber			4
  
  
  typedef char trgm[3];
***
*** 53,59  typedef struct
  
  /* gist */
  #define BITBYTE 8
! #define SIGLENINT  3			/* 122 = key will toast, so very slow!!! */
  #define SIGLEN	( sizeof(int)*SIGLENINT )
  
  #define SIGLENBIT (SIGLEN*BITBYTE - 1)	/* see makesign */
--- 55,61 
  
  /* gist */
  #define BITBYTE 8
! #define SIGLENINT  15			/* 122 = key will toast, so very slow!!! */
  #define SIGLEN	( sizeof(int)*SIGLENINT )
  
  #define SIGLENBIT (SIGLEN*BITBYTE - 1)	/* see makesign */
***
*** 89,94  typedef char *BITVECP;
--- 91,101 
  extern float4 trgm_limit;
  
  TRGM	   *generate_trgm(char *str, int slen);
+ TRGM	   *generate_wildcard_trgm(char *str, int slen);
  float4		cnt_sml(TRGM *trg1, TRGM *trg2);
+ booltrgm_contain(TRGM *trg1, TRGM *trg2);
+ 
+ #define ISESCAPECHAR(x) (*(x) == '\\')
+ #define ISWILDCARDCHAR(x) (*(x) == '_' || *(x) == '%')
  
  #endif /* __TRGM_H__ */
*** a/contrib/pg_trgm/trgm_gin.c
--- b/contrib/pg_trgm/trgm_gin.c
***
*** 6,11 
--- 6,12 
  #include trgm.h
  
  #include access/gin.h
+ #include access/skey.h
  #include access/itup.h
  #include access/tuptoaster.h
  #include storage/bufpage.h
***
*** 16,21 
--- 17,25 
  PG_FUNCTION_INFO_V1(gin_extract_trgm);
  Datum		gin_extract_trgm(PG_FUNCTION_ARGS);
  
+ PG_FUNCTION_INFO_V1(gin_extract_query_trgm);
+ Datum		gin_extract_query_trgm(PG_FUNCTION_ARGS);
+ 
  PG_FUNCTION_INFO_V1(gin_trgm_consistent);
  Datum		gin_trgm_consistent(PG_FUNCTION_ARGS);
  
***
*** 58,90  gin_extract_trgm(PG_FUNCTION_ARGS)
  }
  
  Datum
  gin_trgm_consistent(PG_FUNCTION_ARGS)
  {
  	bool	   *check = (bool *) PG_GETARG_POINTER(0);
! 	/* StrategyNumber strategy = PG_GETARG_UINT16(1); */
  	/* text*query = PG_GETARG_TEXT_P(2); */
! 	int32		nkeys = PG_GETARG_INT32(3);
! 	/* Pointer*extra_data = (Pointer *) PG_GETARG_POINTER(4); */
  	bool	   *recheck = (bool *) PG_GETARG_POINTER(5);
  	bool		res = FALSE;
  	int32		i

Re: [HACKERS] wildcard search support for pg_trgm

2011-01-24 Thread Alexander Korotkov

Hi!

On Mon, Jan 24, 2011 at 3:07 AM, Jan Urbański wulc...@wulczer.org wrote:

 I see two issues with this patch. First of them is the resulting index
 size. I created a table with 5 copies of
 /usr/share/dict/american-english in it and a gin index on it, using
 gin_trgm_ops. The results were:

  * relation size: 18MB
  * index size: 109 MB

 while without the patch the GIN index was 43 MB. I'm not really sure
 *why* this happens, as it's not obvious from reading the patch what
 exactly is this extra data that gets stored in the index, making it more
 than double its size.

Do you sure that you did comparison correctly? The sequence of index
building and data insertion does matter. I tried to build gin index on  5
copies of /usr/share/dict/american-english with patch and got 43 MB index
size.


 That leads me to the second issue. The pg_trgm code is already woefully
 uncommented, and after spending quite some time reading it back and
 forth I have to admit that I don't really understand what the code does
 in the first place, and so I don't understand what does that patch
 change. I read all the changes in detail and I could't find any obvious
 mistakes like reading over array boundaries or dereferencing
 uninitialized pointers, but I can't tell if the patch is correct
 semantically. All test cases I threw at it work, though.

I'll try to write sufficient comment and send new revision of patch.


 This patch changes the names and signatures of some support functions
 for GIN, and I'm not sure how that affects binary compatibility and
 pg_upgrade. I tried to create an index with the vanilla source, and then
 recompile pg_trgm and reindex the table, but it still was not using the
 index. I think it's because it's missing entries in the catalogs about
 the index supporting the like strategy. How should this be handled?

This patch don't alters structure of index. It only adds strategies for
index scan. In order update this index one should recreate operator class
(it will require to drop index). It can be done by
sequential uninstall_pg_trgm.sql and pg_trgm.sql. After that new index can
be created and it will support like strategy. Although actually there is no
need of index recreation, I don't see easier way to do this.


With best regards,
Alexander Korotkov.

Re: [HACKERS] wildcard search support for pg_trgm

2011-01-29 Thread Alexander Korotkov

Hello!

New version of patch is in the attachment. Some comments was added in this
version. Likely these comments need significant correction because of my
english.
Some notes abount gin interface functions. Extract query and extract value
functions was separated, because with wildcard search these funtions no
longer do the same. New arguments was added to sql description of gin
interface functions in order to make it confom to new gin interface. See
docs of development version:
http://developer.postgresql.org/pgdocs/postgres/gin-extensibility.html.


With best regards,
Alexander Korotkov.
*** a/contrib/pg_trgm/pg_trgm.sql.in
--- b/contrib/pg_trgm/pg_trgm.sql.in
***
*** 113,118  FOR TYPE text USING gist
--- 113,120 
  AS
  OPERATOR1   % (text, text),
  OPERATOR2   - (text, text) FOR ORDER BY pg_catalog.float_ops,
+ OPERATOR3   ~~ (text, text),
+ OPERATOR4   ~~* (text, text),
  FUNCTION1   gtrgm_consistent (internal, text, int, oid, internal),
  FUNCTION2   gtrgm_union (bytea, internal),
  FUNCTION3   gtrgm_compress (internal),
***
*** 129,140  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_extract_trgm(text, internal, int2, internal, internal)
  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal)
  RETURNS bool
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
--- 131,142 
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_extract_query_trgm(text, internal, int2, internal, internal, internal, internal)
  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal, internal, internal)
  RETURNS bool
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
***
*** 144,151  CREATE OPERATOR CLASS gin_trgm_ops
  FOR TYPE text USING gin
  AS
  OPERATOR1   % (text, text),
  FUNCTION1   btint4cmp (int4, int4),
  FUNCTION2   gin_extract_trgm (text, internal),
! FUNCTION3   gin_extract_trgm (text, internal, int2, internal, internal),
! FUNCTION4   gin_trgm_consistent (internal, int2, text, int4, internal, internal),
  STORAGE int4;
--- 146,155 
  FOR TYPE text USING gin
  AS
  OPERATOR1   % (text, text),
+ OPERATOR3   ~~ (text, text),
+ OPERATOR4   ~~* (text, text),
  FUNCTION1   btint4cmp (int4, int4),
  FUNCTION2   gin_extract_trgm (text, internal),
! FUNCTION3   gin_extract_query_trgm (text, internal, int2, internal, internal, internal, internal),
! FUNCTION4   gin_trgm_consistent (internal, int2, text, int4, internal, internal, internal, internal),
  STORAGE int4;
*** a/contrib/pg_trgm/trgm.h
--- b/contrib/pg_trgm/trgm.h
***
*** 13,24 
--- 13,32 
  #define LPADDING		2
  #define RPADDING		1
  #define KEEPONLYALNUM
+ /*
+  * IGNORECASE macro means that trigrams is case-insensetive. If this macro is
+  * disabled, then ~~* operator should be excluded from operator class, because
+  * we can't handle case-insensetive wildcard search with case-sensetive
+  * trigrams.
+  */
  #define IGNORECASE
  #define DIVUNION
  
  /* operator strategy numbers */
  #define	SimilarityStrategyNumber	1
  #define	DistanceStrategyNumber		2
+ #define LikeStrategyNumber			3
+ #define ILikeStrategyNumber			4
  
  
  typedef char trgm[3];
***
*** 53,59  typedef struct
  
  /* gist */
  #define BITBYTE 8
! #define SIGLENINT  3			/* 122 = key will toast, so very slow!!! */
  #define SIGLEN	( sizeof(int)*SIGLENINT )
  
  #define SIGLENBIT (SIGLEN*BITBYTE - 1)	/* see makesign */
--- 61,67 
  
  /* gist */
  #define BITBYTE 8
! #define SIGLENINT  15			/* 122 = key will toast, so very slow!!! */
  #define SIGLEN	( sizeof(int)*SIGLENINT )
  
  #define SIGLENBIT (SIGLEN*BITBYTE - 1)	/* see makesign */
***
*** 89,94  typedef char *BITVECP;
--- 97,107 
  extern float4 trgm_limit;
  
  TRGM	   *generate_trgm(char *str, int slen);
+ TRGM	   *generate_wildcard_trgm(char *str, int slen);
  float4		cnt_sml(TRGM *trg1, TRGM *trg2);
+ booltrgm_contain(TRGM *trg1, TRGM *trg2);
+ 
+ #define ISESCAPECHAR(x) (*(x) == '\\') /* Wildcard escape character */
+ #define ISWILDCARDCHAR(x) (*(x) == '_' || *(x) == '%')  /* Wildcard meta-character */
  
  #endif /* __TRGM_H__ */
*** a/contrib/pg_trgm/trgm_gin.c
--- b/contrib/pg_trgm/trgm_gin.c
***
*** 6,11 
--- 6,12 
  #include trgm.h

Re: [HACKERS] wildcard search support for pg_trgm

2011-01-30 Thread Alexander Korotkov

Hi!

On Mon, Jan 31, 2011 at 12:52 AM, Jan Urbański wulc...@wulczer.org wrote:

 I saw that the code tries to handle ILIKE searches, but apparently it's
 failing somewhere.

It was just a typo. Corrected version attached.


With best regards,
Alexander Korotkov.
*** a/contrib/pg_trgm/pg_trgm.sql.in
--- b/contrib/pg_trgm/pg_trgm.sql.in
***
*** 113,118  FOR TYPE text USING gist
--- 113,120 
  AS
  OPERATOR1   % (text, text),
  OPERATOR2   - (text, text) FOR ORDER BY pg_catalog.float_ops,
+ OPERATOR3   ~~ (text, text),
+ OPERATOR4   ~~* (text, text),
  FUNCTION1   gtrgm_consistent (internal, text, int, oid, internal),
  FUNCTION2   gtrgm_union (bytea, internal),
  FUNCTION3   gtrgm_compress (internal),
***
*** 129,140  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_extract_trgm(text, internal, int2, internal, internal)
  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal)
  RETURNS bool
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
--- 131,142 
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_extract_query_trgm(text, internal, int2, internal, internal, internal, internal)
  RETURNS internal
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal, internal, internal)
  RETURNS bool
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
***
*** 144,151  CREATE OPERATOR CLASS gin_trgm_ops
  FOR TYPE text USING gin
  AS
  OPERATOR1   % (text, text),
  FUNCTION1   btint4cmp (int4, int4),
  FUNCTION2   gin_extract_trgm (text, internal),
! FUNCTION3   gin_extract_trgm (text, internal, int2, internal, internal),
! FUNCTION4   gin_trgm_consistent (internal, int2, text, int4, internal, internal),
  STORAGE int4;
--- 146,155 
  FOR TYPE text USING gin
  AS
  OPERATOR1   % (text, text),
+ OPERATOR3   ~~ (text, text),
+ OPERATOR4   ~~* (text, text),
  FUNCTION1   btint4cmp (int4, int4),
  FUNCTION2   gin_extract_trgm (text, internal),
! FUNCTION3   gin_extract_query_trgm (text, internal, int2, internal, internal, internal, internal),
! FUNCTION4   gin_trgm_consistent (internal, int2, text, int4, internal, internal, internal, internal),
  STORAGE int4;
*** a/contrib/pg_trgm/trgm.h
--- b/contrib/pg_trgm/trgm.h
***
*** 13,24 
--- 13,32 
  #define LPADDING		2
  #define RPADDING		1
  #define KEEPONLYALNUM
+ /*
+  * IGNORECASE macro means that trigrams is case-insensetive. If this macro is
+  * disabled, then ~~* operator should be excluded from operator class, because
+  * we can't handle case-insensetive wildcard search with case-sensetive
+  * trigrams.
+  */
  #define IGNORECASE
  #define DIVUNION
  
  /* operator strategy numbers */
  #define	SimilarityStrategyNumber	1
  #define	DistanceStrategyNumber		2
+ #define LikeStrategyNumber			3
+ #define ILikeStrategyNumber			4
  
  
  typedef char trgm[3];
***
*** 53,59  typedef struct
  
  /* gist */
  #define BITBYTE 8
! #define SIGLENINT  3			/* 122 = key will toast, so very slow!!! */
  #define SIGLEN	( sizeof(int)*SIGLENINT )
  
  #define SIGLENBIT (SIGLEN*BITBYTE - 1)	/* see makesign */
--- 61,67 
  
  /* gist */
  #define BITBYTE 8
! #define SIGLENINT  15			/* 122 = key will toast, so very slow!!! */
  #define SIGLEN	( sizeof(int)*SIGLENINT )
  
  #define SIGLENBIT (SIGLEN*BITBYTE - 1)	/* see makesign */
***
*** 89,94  typedef char *BITVECP;
--- 97,107 
  extern float4 trgm_limit;
  
  TRGM	   *generate_trgm(char *str, int slen);
+ TRGM	   *generate_wildcard_trgm(char *str, int slen);
  float4		cnt_sml(TRGM *trg1, TRGM *trg2);
+ booltrgm_contain(TRGM *trg1, TRGM *trg2);
+ 
+ #define ISESCAPECHAR(x) (*(x) == '\\') /* Wildcard escape character */
+ #define ISWILDCARDCHAR(x) (*(x) == '_' || *(x) == '%')  /* Wildcard meta-character */
  
  #endif /* __TRGM_H__ */
*** a/contrib/pg_trgm/trgm_gin.c
--- b/contrib/pg_trgm/trgm_gin.c
***
*** 6,11 
--- 6,12 
  #include trgm.h
  
  #include access/gin.h
+ #include access/skey.h
  #include access/itup.h
  #include access/tuptoaster.h
  #include storage/bufpage.h
***
*** 16,21 
--- 17,25 
  PG_FUNCTION_INFO_V1(gin_extract_trgm);
  Datum		gin_extract_trgm(PG_FUNCTION_ARGS);
  
+ PG_FUNCTION_INFO_V1(gin_extract_query_trgm);
+ Datum

Re: [HACKERS] wildcard search support for pg_trgm

2011-02-01 Thread Alexander Korotkov

On Tue, Feb 1, 2011 at 5:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 AFAICT that would break on-disk compatibility of pg_trgm GIST indexes.
 I don't believe we have adequate evidence to justify doing that, and
 in any case it ought to be a separate patch rather than buried inside a
 mostly unrelated feature patch.

Ok. Actually, I don't think just increasement of SIGLENINT as a solution. I
beleive that we need to have it as index parameter. I'll try to provide more
of tests in order to motivate this.


With best regards,
Alexander Korotkov.

[HACKERS] WIP: store additional info in GIN index

2012-11-18 Thread Alexander Korotkov

Hackers,

Attached patch enables GIN to store additional information with item
pointers in posting lists and trees.
Such additional information could be positions of words, positions of
trigrams, lengths of arrays and so on.
This is the first and most huge patch of serie of GIN improvements which
was presented at PGConf.EU
http://wiki.postgresql.org/images/2/25/Full-text_search_in_PostgreSQL_in_milliseconds-extended-version.pdf

Patch modifies GIN interface as following:
1) Two arguments are added to extractValue
Datum **addInfo, bool **addInfoIsNull
2) Two arguments are added to consistent
Datum addInfo[], bool addInfoIsNull[]
3) New method config is introduced which returns datatype oid of addtional
information (analogy with SP-GiST config method).

Patch completely changes storage in posting lists and leaf pages of posting
trees. It uses varbyte encoding for BlockNumber and OffsetNumber.
BlockNumber are stored incremental in page. Additionally one bit of
OffsetNumber is reserved for additional information NULL flag. To be able
to find position in leaf data page quickly patch introduces small index in
the end of page.

--
With best regards,
Alexander Korotkov.


ginaddinfo.1.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_trgm partial-match

2012-11-18 Thread Alexander Korotkov

Hi!

On Thu, Nov 15, 2012 at 11:39 PM, Fujii Masao masao.fu...@gmail.com wrote:

 Note that we cannot do a partial-match if KEEPONLYALNUM is disabled,
 i.e., if query key contains multibyte characters. In this case, byte
 length of
 the trigram string might be larger than three, and its CRC is used as a
 trigram key instead of the trigram string itself. Because of using CRC, we
 cannot do a partial-match. Attached patch extends pg_trgm so that it
 compares a partial-match query key only when KEEPONLYALNUM is
 enabled.


Didn't get this point. How does KEEPONLYALNUM guarantee that each trigram
character is singlebyte?

CREATE TABLE test (val TEXT);
INSERT INTO test VALUES ('aa'), ('aaa'), ('шaaш');
CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
ANALYZE test;
test=# SELECT * FROM test WHERE val LIKE '%aa%';
 val
--
 aa
 aaa
 шaaш
(3 rows)
test=# set enable_seqscan = off;
SET
test=# SELECT * FROM test WHERE val LIKE '%aa%';
 val
-
 aa
 aaa
(2 rows)

I think we can use partial match only for singlebyte encodings. Or, at
most, in cases when all alpha-numeric characters are singlebyte (have no
idea how to check this).

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] pg_trgm partial-match

2012-11-19 Thread Alexander Korotkov

On Mon, Nov 19, 2012 at 10:05 AM, Alexander Korotkov
aekorot...@gmail.comwrote:

 On Thu, Nov 15, 2012 at 11:39 PM, Fujii Masao masao.fu...@gmail.comwrote:

 Note that we cannot do a partial-match if KEEPONLYALNUM is disabled,
 i.e., if query key contains multibyte characters. In this case, byte
 length of
 the trigram string might be larger than three, and its CRC is used as a
 trigram key instead of the trigram string itself. Because of using CRC, we
 cannot do a partial-match. Attached patch extends pg_trgm so that it
 compares a partial-match query key only when KEEPONLYALNUM is
 enabled.


 Didn't get this point. How does KEEPONLYALNUM guarantee that each trigram
 character is singlebyte?

 CREATE TABLE test (val TEXT);
 INSERT INTO test VALUES ('aa'), ('aaa'), ('шaaш');
 CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
 ANALYZE test;
 test=# SELECT * FROM test WHERE val LIKE '%aa%';
  val
 --
  aa
   aaa
  шaaш
 (3 rows)
 test=# set enable_seqscan = off;
 SET
 test=# SELECT * FROM test WHERE val LIKE '%aa%';
  val
 -
  aa
  aaa
 (2 rows)

 I think we can use partial match only for singlebyte encodings. Or, at
 most, in cases when all alpha-numeric characters are singlebyte (have no
 idea how to check this).


Actually, I also was fiddling around idea of partial match on trigrams when
I was working on initial LIKE patch. But, I concluded that we would need a
separate opclass which always keeps full trigram in entry.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-11-19 Thread Alexander Korotkov

Hi!

New version of patch is attached. Changes are following:
1) Right way to convert from pg_wchar to multibyte.
2) Optimization of producing CFNA-like graph on trigrams (produce smaller,
but equivalent, graphs in less time).
3) Comments and refactoring.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.2.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-19 Thread Alexander Korotkov

Some quick comments.

On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra t...@fuzzy.cz wrote:

 6) It does not compile - I do get a bunch of errors like this

Fixed.

7) Once fixed, it seems to work

 CREATE EXTENSION pg_trgm ;
 CREATE TABLE TEST (val TEXT);
 INSERT INTO test
SELECT md5(i::text) FROM generate_series(1,100) s(i);
 CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
 ANALYZE test;

 EXPLAIN SELECT * FROM test WHERE val ~ '.*qqq.*';


QUERY PLAN
 -
  Bitmap Heap Scan on test  (cost=16.77..385.16 rows=100 width=33)
Recheck Cond: (val ~ '.*qqq.*'::text)
-  Bitmap Index Scan on trgm_idx  (cost=0.00..16.75 rows=100
width=0)
  Index Cond: (val ~ '.*qqq.*'::text)
 (4 rows)

 but I do get a bunch of NOTICE messages with debugging info (no matter
 if the GIN index is used or not, so it's somewhere in the common regexp
 code). But I guess that's due to WIP status.

It's due to TRGM_REGEXP_DEBUG macro. I disabled it by default. But I think
pieces of code hidden by that macro could be useful for debug even after
WIP status.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.3.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-20 Thread Alexander Korotkov

On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra t...@fuzzy.cz wrote:

 2) It's common to use upper-case names for macros, but trgm.h defines
macro iswordchr - I see it's moved from trgm_op.c but maybe we
could make it a bit more correct?

 3) I see there are two '#ifdef KEEPONLYALNUM blocks right next to each
other in trgm.h - maybe it'd be a good idea to join them?

 4) The two new method prototypes at the end of trgm.h use different
indendation than the rest (spaces only instead of tabs).

These issues are fixed in attached patch.
Additionally 3 new macros are
introduced: MAX_RESULT_STATES, MAX_RESULT_ARCS, MAX_RESULT_PATHS. They are
limiting resources usage during regex processing.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.4.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-25 Thread Alexander Korotkov

On Tue, Nov 20, 2012 at 1:43 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 Glad to see this patch hasn't been totally forgotten. Being able to use
 indexes for regular expressions would be really cool!

 Back in January, I asked for some high-level description of how the
 algorithm works (http://archives.postgresql.**
 org/message-id/4F187D5C.30701@**enterprisedb.comhttp://archives.postgresql.org/message-id/4f187d5c.30...@enterprisedb.com).
 That's still sorely needed. Googling around, I found the slides for your
 presentation on this from PGConf.EU - it would be great to have the
 information from that presentation included in the patch.


New version of patch is attached. The changes are following:
1) A big comment with high-level description of what is going on.
2) Regression tests.
3) Documetation update.
4) Some more refactoring.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.5.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-25 Thread Alexander Korotkov

Hi!

On Wed, Nov 21, 2012 at 12:51 AM, Pavel Stehule pavel.steh...@gmail.comwrote:

 do you plan to support GiST?


At first, I would note that pg_trgm GiST opclass is quite ridiculous for
support regex search (and, actually for LIKE/ILIKE search which is already
implemented too). Because in GiST opclass we store set of trigrams in leaf
pages. In was designed for trigram similarity search and have sense for it
because of elimination of trigram set computation. But for regex or
LIKE/ILIKE search this representation is both lossy and bigger than just
original string. Probably we could think about another opclass for GiST
focusing on regex and LIKE/ILIKE search?

However, amyway I can create additional patch for current GiST opclass.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-11-26 Thread Alexander Korotkov

On Mon, Nov 26, 2012 at 4:55 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 Great, that top-level comment helped tremendously! I feel enlightened.

 I fixed some spelling, formatting etc. trivial stuff while reading through
 the patch, see attached. Below is some feedback on the details:

 * I don't like the PG_TRY/CATCH trick. It's not generally safe to catch an
 error, without propagating it further or rolling back the whole
 (sub)transation. It might work in this case, as you're only suppressing
 errors with the special sqlcode that are used in the same file, but it
 nevertheless feels naughty. I believe none of the limits that are being
 checked are strict; it's OK to exceed the limits somewhat, as long as you
 terminate the processing in a reasonable time, in case of pathological
 input. I'd suggest putting an explicit check for the limits somewhere, and
 not rely on ereport(). Something like this, in the code that recurses:

 if (trgmCNFA-arcsCount  MAX_RESULT_ARCS ||
 hash_get_num_entries(trgmCNFA-**states)  MAX_RESULT_STATES)
 {
 trgmCNFA-overflowed = true;
 return;
 }


PG_TRY/CATCH trick is replaced with some number of if/return. Code becomes
a bit more complicated, but your notes does matter.

And then check for the overflowed flag at the top level.

 * This part of the high-level comment was not clear to me:

   * States of the graph produced in the first stage are marked with
 keys. Key is a pair
  * of a prefix and a state of the original automaton. Prefix is a last
  * characters. So, knowing the prefix is enough to know what is a trigram
 when we read some new
  * character. However, we can know single character of prefix or don't
 know any
  * characters of it. Each state of resulting graph have an enter key
 (with that
  * key we've entered this state) and a set of keys which are reachable
 without
  * reading any predictable trigram. The algorithm of processing each state
  * of resulting graph are so:
  * 1) Add all keys which achievable without reading of any predictable
 trigram.
  * 2) Add outgoing arcs labeled with trigrams.
  * Step 2 leads to creation of new states and recursively processing
 them. So,
  * we use depth-first algorithm.


 I didn't understand that. Can you elaborate? It might help to work through
 an example, with some ascii art depicting the graph.


This comment is extended with example.


 * It would be nice to add some comments to TrgmCNFA struct, explaining
 which fields are valid at which stages. For example, it seems that 'trgms'
 array is calculated only after building the CNFA, by getTrgmVector()
 function, while arcsCount is updated on the fly, while recursing in the
 getState() function.


TrgmCNFA comment is extended with this.


 * What is the representation used for the path matrix? Needs a comment.


Comments are added to PackedTrgmPaths and TrgmStatePath.


 * What do the getColorinfo()


getColorInfo comment now references to ColorInfo struct which have comment.

and scanColorMap() functions do?


 scanColorMap comment now have description of colormap structure.


 What exactly does a color represent?


This is added to the top comment.


 What's the tradeoff in choosing MAX_COLOR_CHARS?


Comment is added to the macro.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.6.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-30 Thread Alexander Korotkov

On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 One thing that bothers me with this algoritm is that the overflow
 mechanism is all-or-nothing. In many cases, even when there is a huge
 number of states in the diagram, you could still extract at least a few
 trigrams that must be present in any matching string, with little effort.
 At least, it seems like that to a human :-).

 For example, consider this:

 explain analyze select count(*) from azjunk4 where txt ~ ('^**
 aabaacaadaaeaafaagaahaaiaajaak**aalaamaanaaoaapaaqaaraasaataau**
 aavaawaaxaayaazabaabbabcabdabe**abfabgabhabiabjabkablabmabnabo**
 abpabqabrabsabtabuabvabwabxaby**abzacaacbaccacdaceacfacgachaci**
 acjackaclacmacnacoacpacqacracs**actacuacvacwacxacyaczadaadbadc**
 addadeadfadgadhadiadjadkadladm**adnadoadpadqadradsadtaduadvadw**
 adxadyadzaeaaebaecaedaeeaefaeg**aehaeiaejaekaelaemaenaeoaepaeq**
 aeraesaetaeuaevaewaexaeyaezafa**afbafcafdafeaffafgafhafiafjafk**
 aflafmafnafoafpafqafrafsaftafu**afvafwafxafyafzagaagbagcagdage**
 agfaggaghagiagjagkaglagmagnago**agpagqagragsagtaguagvagwagxagy**
 agzahaahbahcahdaheahfahgahhahi**ahjahkahlahmahnahoahpahqahrahs**$');

 you get a query plan like this (the long regexp string edited out):

  Aggregate  (cost=228148.02..228148.03 rows=1 width=0) (actual
 time=131.100..131
 .101 rows=1 loops=1)
-  Bitmap Heap Scan on azjunk4  (cost=228144.01..228148.02 rows=1
 width=0) (
 actual time=131.096..131.096 rows=0 loops=1)
  Recheck Cond: (txt ~ ridiculously long regexp)
  Rows Removed by Index Recheck: 1
  -  Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx
 (cost=0.00..228144
 .01 rows=1 width=0) (actual time=82.914..82.914 rows=1 loops=1)
Index Cond: (txt ~ ridiculously long regexp)
  Total runtime: 131.230 ms
 (7 rows)

 That ridiculously long string exceeds the number of states (I think, could
 be number of paths or arcs too), and the algorithm gives up, resorting to
 scanning the whole index as can be seen by the Rows Removed by Index
 Recheck line. However, it's easy to see that any matching string must
 contain *any* of the possible trigrams the algorithm extracts. If it could
 safely return just a few of them, say aab and abz, and discard the
 rest, that would already be much better than a full index scan.

 Would it be safe to simply stop short the depth-first search on overflow,
 and proceed with the graph that was constructed up to that point?


For depth-first it's not. But your proposal naturally makes sense. I've
changed it to breadth-first search. And then it's safe to mark all
unprocessed states as final when overflow. It means that we assume every
unprocessed branch to immediately finish with matching (this can give us
more false positives but no false negatives).
For overflow of matrix collection, it's safe to do just OR between all the
trigrams.
New version of patch is attached.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.7.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-11-30 Thread Alexander Korotkov

Hi!

On Thu, Nov 29, 2012 at 12:58 PM, er e...@xs4all.nl wrote:

 On Mon, November 26, 2012 20:49, Alexander Korotkov wrote:

  trgm-regexp-0.6.patch.gz

 I ran the simple-minded tests against generated data (similar to the ones
 I did in January 2012).
 The problems of that older version seem pretty much all removed. (although
 I didn't do much work
 on it -- just reran these tests).


Thanks a lot for testing! Could you repeat for 0.7 version of patch which
has new overflow handling?

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-11-30 Thread Alexander Korotkov

On Fri, Nov 30, 2012 at 3:20 PM, Alexander Korotkov aekorot...@gmail.comwrote:

 For depth-first it's not.


Oh, I didn't explained it.
In order to stop graph processing we need to be sure that we put all
outgoing arcs from state or assume that state to be final. In DFS we can be
in the final part of graph producing but still didn't add some arc (with
new trigram) from initial state directly to the final state. It obviously
leads to false negatives.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-12-02 Thread Alexander Korotkov

On Fri, Nov 30, 2012 at 6:23 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 30.11.2012 13:20, Alexander Korotkov wrote:

 On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangashlinnakangas@**
 vmware.com hlinnakan...@vmware.com

 wrote:


  Would it be safe to simply stop short the depth-first search on overflow,
 and proceed with the graph that was constructed up to that point?


 For depth-first it's not. But your proposal naturally makes sense. I've
 changed it to breadth-first search. And then it's safe to mark all
 unprocessed states as final when overflow. It means that we assume every
 unprocessed branch to immediately finish with matching (this can give us
 more false positives but no false negatives).
 For overflow of matrix collection, it's safe to do just OR between all the
 trigrams.
 New version of patch is attached.


 Thanks, sounds good.

 I've spent quite a long time trying to understand the transformation the
 getState/addKeys/addAcrs functions do to the original CNFA graph. I think
 that still needs more comments to explain the steps involved in it.

 One thing that occurs to me is that it might be better to delay expanding
 colors to characters. You could build and transform the graph, and even
 create the paths, all while operating on colors. You would end up with
 lists of color trigrams, consisting of sequences of three colors that
 must appear in the source string. Only at the last step you would expand
 the color trigrams to real character trigrams. I think that would save a
 lot of processing while building the graph, if you have colors that contain
 many characters. It would allow us to do better than the fixed small
 MAX_COLOR_CHARS limit. For example with MAX_COLOR_CHARS = 4 in the current
 patch, it's a shame that we can't do anything with a fairly simple regexp
 like ^a[b-g]h$


Nice idea to delay expanding colors to characters! Obviously, we should
delay expanding inly alphanumerical characters. Because non-alphanumberical
characters influence graph structure. Trying to implement...

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-12-02 Thread Alexander Korotkov

On Sat, Dec 1, 2012 at 3:22 PM, Erik Rijkers e...@xs4all.nl wrote:

 On Fri, November 30, 2012 12:22, Alexander Korotkov wrote:
  Hi!
 
  On Thu, Nov 29, 2012 at 12:58 PM, er e...@xs4all.nl wrote:
 
  On Mon, November 26, 2012 20:49, Alexander Korotkov wrote:
 
 
  I ran the simple-minded tests against generated data (similar to the
 ones
  I did in January 2012).
  The problems of that older version seem pretty much all removed.
 (although
  I didn't do much work
  on it -- just reran these tests).
 
 
  Thanks a lot for testing! Could you repeat for 0.7 version of patch which
  has new overflow handling?
 

 I've attached a similar test re-run that compares HEAD with patch versions
 0.6, and 0.7.


Thanks! Did you write scripts for automated testing? I would be nice if you
share them.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-12-03 Thread Alexander Korotkov

On Mon, Dec 3, 2012 at 2:05 PM, Heikki Linnakangas
hlinnakan...@vmware.comwrote:

 On 02.12.2012 20:19, Tom Lane wrote:

 Alexander Korotkovaekorot...@gmail.com  writes:

 Nice idea to delay expanding colors to characters! Obviously, we should
 delay expanding inly alphanumerical characters. Because
 non-alphanumberical
 characters influence graph structure. Trying to implement...


 Uh, why would that be?  Colors are colors.  The regexp machinery doesn't
 care whether they represent alphanumerics or not.  (Or to be more
 precise, if there is a situation where it makes a difference, separate
 colors will have been created for each set of characters that need to be
 distinguished.)


 The regexp machinery doesn't care, but the trigrams that pg_trgm extracts
 only contain alphanumeric characters. So if by looking at the CNFA graph
 produced by the regexp machinery you conclude that any matching strings
 must contain three-letter sequences %oo and #oo, you can just club them
 together into  oo trigram.

 I think you can run a pre-processing step to the colors, and merge colors
 that are equivalent as far as trigrams are considered. For example, if you
 have a color that contains only character '%', and another that contains
 character '#', you can treat them as the same hcolor. You might then be
 able to simplify the CNFA. Actually, it would be even better if you could
 apply the pre-processing to the regexp before the regexp machinery turns it
 into a CNFA. Not sure how easy it would be to do such pre-processing.


Treating colors as same should be possible only for colors which has no
alphanumeric characters, because colors are non-overlapping. However, this
optimization could be significant in some cases.

BTW, why create the path matrix? You could check the check array of
 trigrams in the consistent function directly against the graph. Consistent
 should return true, iff there is a path through the graph following only
 arcs that contain trigrams present in the check array. Finding a path
 through a complex graph could be expensive, O(|E|), but if the path is
 complex, the path matrix would be large as well, and checking against a
 large matrix isn't exactly free either. It would allow you to avoid
 overflows caused by having too many paths through the graph.


Actually, I generally dislike path matrix for same reasons. But:
1) Output graphs could contain trigrams which are completely useless for
search. For example, for regex /(abcdefgh)*ijk/ we need only ijk trigram
while graph would contain much more.Path matrix is a method to get rid of
all of them.
2) If we use color trigrams then we need some criteria for which color
trigrams to expand into trigrams. Simultaneously, we shouldn't allow path
from initial state to the final by unexpanded trigrams. It seems much
harder to do with graph than with matrix.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: store additional info in GIN index

2012-12-04 Thread Alexander Korotkov

Hi!

On Sun, Dec 2, 2012 at 5:02 AM, Tomas Vondra t...@fuzzy.cz wrote:

 I've tried to apply the patch with the current HEAD, but I'm getting
 segfaults whenever VACUUM runs (either called directly or from autovac
 workers).

 The patch applied cleanly against 9b3ac49e and needed a minor fix when
 applied on HEAD (because of an assert added to ginRedoCreatePTree), but
 that shouldn't be a problem.


Thanks for testing! Patch is rebased with HEAD. The bug you reported was
fixed.

--
With best regards,
Alexander Korotkov.


ginaddinfo.2.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch for removng unused targets

2012-12-04 Thread Alexander Korotkov

On Mon, Dec 3, 2012 at 8:31 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Etsuro Fujita fujita.ets...@lab.ntt.co.jp writes:
  Sorry for the delay.  I've reviewed the patch.  It was applied
  successfully, and it worked well for tests I did including the example
  you showed.  I think it's worth the work, but I'm not sure you go
  about it in the right way.  (I feel the patch decreases code
  readability more than it gives an advantage.)

 One thought here is that I don't particularly like adding a field like
 resorderbyonly to TargetEntry in the first place.  That makes this
 optimization the business of the parser, which it should not be; and
 furthermore makes it incumbent on the rewriter, as well as anything else
 that manipulates parsetrees, to maintain the flag correctly while
 rearranging queries.  It would be better if this were strictly the
 business of the planner.

 But having said that, I'm wondering (without having read the patch)
 why you need anything more than the existing resjunk field.


Actually, I don't know all the cases when resjunk flag is set. Is it
reliable to decide target to be used only for ORDER BY if it's resjunk
and neither system or used in grouping? If it's so or there are some other
cases which are easy to determine then I'll remove resorderbyonly flag.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: store additional info in GIN index

2012-12-04 Thread Alexander Korotkov

On Tue, Dec 4, 2012 at 9:34 PM, Robert Haas robertmh...@gmail.com wrote:

 On Sun, Nov 18, 2012 at 4:54 PM, Alexander Korotkov
 aekorot...@gmail.com wrote:
  Patch completely changes storage in posting lists and leaf pages of
 posting
  trees. It uses varbyte encoding for BlockNumber and OffsetNumber.
  BlockNumber are stored incremental in page. Additionally one bit of
  OffsetNumber is reserved for additional information NULL flag. To be
 able to
  find position in leaf data page quickly patch introduces small index in
 the
  end of page.

 This sounds like it means that this would break pg_upgrade, about
 which I'm not too keen.  Ideally, we'd like to have a situation where
 new indexes have additional capabilities, but old indexes are still
 usable for things that they could do before.  I am not sure whether
 that's a realistic goal.


This means to have two versions of code which deals with posting trees and
lists. For me it seems unlikely we have resources for maintenance of this.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] Patch for removng unused targets

2012-12-04 Thread Alexander Korotkov

On Tue, Dec 4, 2012 at 11:52 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alexander Korotkov aekorot...@gmail.com writes:
  On Mon, Dec 3, 2012 at 8:31 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  But having said that, I'm wondering (without having read the patch)
  why you need anything more than the existing resjunk field.

  Actually, I don't know all the cases when resjunk flag is set. Is it
  reliable to decide target to be used only for ORDER BY if it's
 resjunk
  and neither system or used in grouping? If it's so or there are some
 other
  cases which are easy to determine then I'll remove resorderbyonly flag.

 resjunk means that the target is not supposed to be output by the query.
 Since it's there at all, it's presumably referenced by ORDER BY or GROUP
 BY or DISTINCT ON, but the meaning of the flag doesn't depend on that.

 What you would need to do is verify that the target is resjunk and not
 used in any clause besides ORDER BY.  I have not read your patch, but
 I rather imagine that what you've got now is that the parser checks this
 and sets the new flag for consumption far downstream.  Why not just make
 the same check in the planner?

 A more invasive, but possibly cleaner in the long run, approach is to
 strip all resjunk targets from the query's tlist at the start of
 planning and only put them back if needed.

 BTW, when I looked at this a couple years ago, it seemed like the major
 problem was that the planner assumes that all plans for the query should
 emit the same tlist, and thus that tlist eval cost isn't a
 distinguishing factor.  Breaking that assumption seemed to require
 rather significant refactoring.  I never found the time to try to
 actually do it.


May be there is some way to not remove items from tlist, but evade actual
calculation?

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: store additional info in GIN index

2012-12-05 Thread Alexander Korotkov

On Wed, Dec 5, 2012 at 1:56 AM, Tomas Vondra t...@fuzzy.cz wrote:

 On 4.12.2012 20:12, Alexander Korotkov wrote:
  Hi!
 
  On Sun, Dec 2, 2012 at 5:02 AM, Tomas Vondra t...@fuzzy.cz
  mailto:t...@fuzzy.cz wrote:
 
  I've tried to apply the patch with the current HEAD, but I'm getting
  segfaults whenever VACUUM runs (either called directly or from
 autovac
  workers).
 
  The patch applied cleanly against 9b3ac49e and needed a minor fix
 when
  applied on HEAD (because of an assert added to ginRedoCreatePTree),
 but
  that shouldn't be a problem.
 
 
  Thanks for testing! Patch is rebased with HEAD. The bug you reported was
  fixed.

 Applies fine, but I get a segfault in dataPlaceToPage at gindatapage.c.
 The whole backtrace is here: http://pastebin.com/YEPuWeuV

 The messages written into PostgreSQL log are quite variable - usually it
 looks like this:

 2012-12-04 22:31:08 CET 31839 LOG:  database system was not properly
 shut down; automatic recovery in progress
 2012-12-04 22:31:08 CET 31839 LOG:  redo starts at 0/68A76E48
 2012-12-04 22:31:08 CET 31839 LOG:  unexpected pageaddr 0/1BE64000 in
 log segment 00010069, offset 15089664
 2012-12-04 22:31:08 CET 31839 LOG:  redo done at 0/69E63638

 but I've seen this message too

 2012-12-04 22:20:29 CET 31709 LOG:  database system was not properly
 shut down; automatic recovery in progress
 2012-12-04 22:20:29 CET 31709 LOG:  redo starts at 0/AEAFAF8
 2012-12-04 22:20:29 CET 31709 LOG:  record with zero length at 0/C7D5698
 2012-12-04 22:20:29 CET 31709 LOG:  redo done at 0/C7D55E


 I wasn't able to prepare a simple testcase to reproduce this, so I've
 attached two files from my fun project where I noticed it. It's a
 simple DB + a bit of Python for indexing mbox archives inside Pg.

 - create.sql - a database structure with a bunch of GIN indexes on
tsvector columns on messages table

 - load.py - script for parsing mbox archives / loading them into the
 messages table (warning: it's a bit messy)


 Usage:

 1) create the DB structure
 $ createdb archives
 $ psql archives  create.sql

 2) fetch some archives (I consistently get SIGSEGV after first three)
 $ wget
 http://archives.postgresql.org/pgsql-hackers/mbox/pgsql-hackers.1997-01.gz
 $ wget
 http://archives.postgresql.org/pgsql-hackers/mbox/pgsql-hackers.1997-02.gz
 $ wget
 http://archives.postgresql.org/pgsql-hackers/mbox/pgsql-hackers.1997-03.gz

 3) gunzip and load them using the python script
 $ gunzip pgsql-hackers.*.gz
 $ ./load.py --db archives pgsql-hackers.*

 4) et voila - a SIGSEGV :-(


 I suspect this might be related to the fact that the load.py script uses
 savepoints quite heavily to handle UNIQUE_VIOLATION (duplicate messages).


Thanks for bug report. It is fixed in the attached patch.

--
With best regards,
Alexander Korotkov.


ginaddinfo.3.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Statistics and selectivity estimation for ranges

2012-12-13 Thread Alexander Korotkov

Hi, Jeff!

Thanks a lot for review!

On Mon, Dec 10, 2012 at 11:21 PM, Jeff Davis pg...@j-davis.com wrote:

 It looks like there are still some problems with this patch.

   CREATE TABLE foo(ir int4range);
   insert into foo select 'empty' from generate_series(1,1);
   insert into foo select int4range(NULL, g, '(]')
 from generate_series(1,100) g;
   insert into foo select int4range(g, NULL, '[)')
 from generate_series(1,100) g;
   insert into foo select int4range(g, ((g*1.01)+10)::int4, '[]')
 from generate_series(1,100) g;
   CREATE TABLE bar(ir) AS select * from foo order by random();
   ANALYZE bar;

 Now:
   EXPLAIN ANALYZE SELECT * FROM bar
 WHERE ir @ int4range(1,2);

 The estimates are -nan. Similar for many other queries.


Oh, yeah! It appears that infinities require much more cautious work with
them than I supposed. That should be fixes in the attached version of
patch. However, it require significant rethinking of comments. Will update
comments and address your questions in a couple of days. Could you recheck
if attached patch really fixes problem you reported?

--
With best regards,
Alexander Korotkov.


range_stat-0.9.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gistchoose vs. bloat

2012-12-13 Thread Alexander Korotkov

Hi!

On Sat, Dec 8, 2012 at 7:05 PM, Andres Freund and...@2ndquadrant.comwrote:

 I notice there's no documentation about the new reloption at all?


Thanks for notice! I've added small description to docs in the attached
patch.

--
With best regards,
Alexander Korotkov.


gist_choose_bloat-0.5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SP-GiST for ranges based on 2d-mapping and quad-tree

2012-12-13 Thread Alexander Korotkov

Hi!

On Sun, Nov 4, 2012 at 11:41 PM, Jeff Davis pg...@j-davis.com wrote:

 On Fri, 2012-11-02 at 12:47 +0400, Alexander Korotkov wrote:

  Right version of patch is attached.
 
 * In bounds_adjacent, there's no reason to flip the labels back.


Fixed.


 * Comment should indicate more explicitly that bounds_adjacent is
 sensitive to the argument order.


Fixed.


 * In bounds_adjacent, it appears that bound1 corresponds to B in the
 comment above, and bound2 corresponds to A in the comment above. I
 would have guessed from reading the comment that bound1 corresponded to
 A. We should just use consistent names between the comment and the code
 (e.g. boundA and boundB).


Fixed.


 * I could be getting confused, but I think that line 645 of
 rangetypes_spgist.c should be inverted (!bounds_adjacent(...)).


Good catch! Actually, code was in direct contradiction with comment. Fixed.


 * I think needPrevious should have an explanatory comment somewhere. It
 looks like you are using it to store some state as you descend the tree,
 but it doesn't look like it's used to reconstruct the value (because we
 already have the value anyway). Since it's being used for a purpose
 other than what's intended, that should be explained.


Yes, it's a some kind of hack now. Comment is added.

--
With best regards,
Alexander Korotkov.


range_spgist_adjacent-0.4.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-12-13 Thread Alexander Korotkov

On Mon, Dec 3, 2012 at 4:31 PM, Alexander Korotkov aekorot...@gmail.comwrote:

 Actually, I generally dislike path matrix for same reasons. But:
 1) Output graphs could contain trigrams which are completely useless for
 search. For example, for regex /(abcdefgh)*ijk/ we need only ijk trigram
 while graph would contain much more.Path matrix is a method to get rid of
 all of them.
 2) If we use color trigrams then we need some criteria for which color
 trigrams to expand into trigrams. Simultaneously, we shouldn't allow path
 from initial state to the final by unexpanded trigrams. It seems much
 harder to do with graph than with matrix.


Now, I have an idea about doing some not comprehensive but simple and fast
simplification of graph. I'm doing experiments now. In case of success we
could get rid of path matrix.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] gistchoose vs. bloat

2012-12-14 Thread Alexander Korotkov

On Fri, Dec 14, 2012 at 12:46 PM, Jeff Davis pg...@j-davis.com wrote:

  Thanks for notice! I've added small description to docs in the
  attached patch.

 Here is an edited version of the documentation note. Please review to
 see if you like my version.


Edited version looks good for me.


 Also, I fixed a compiler warning.


Thanks!

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-12-16 Thread Alexander Korotkov

On Fri, Dec 14, 2012 at 1:34 AM, Alexander Korotkov aekorot...@gmail.comwrote:

 On Mon, Dec 3, 2012 at 4:31 PM, Alexander Korotkov 
 aekorot...@gmail.comwrote:

 Actually, I generally dislike path matrix for same reasons. But:
 1) Output graphs could contain trigrams which are completely useless for
 search. For example, for regex /(abcdefgh)*ijk/ we need only ijk trigram
 while graph would contain much more.Path matrix is a method to get rid of
 all of them.
 2) If we use color trigrams then we need some criteria for which color
 trigrams to expand into trigrams. Simultaneously, we shouldn't allow path
 from initial state to the final by unexpanded trigrams. It seems much
 harder to do with graph than with matrix.


 Now, I have an idea about doing some not comprehensive but simple and fast
 simplification of graph. I'm doing experiments now. In case of success we
 could get rid of path matrix.


Attached patch have following changes:
1) Postphone expansion of colors. Graph are building on color trigrams.
2) Selective expansion of color trigrams into simple trigrams. All
non-expanded color trigrams are removed. Such removal leads to union of all
states pairs connected with corresponding arcs. Surely, this must no lead
to union of initial and final states: that could do all previous work
senseless.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.8.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-12-17 Thread Alexander Korotkov

Hi!

On Mon, Dec 17, 2012 at 12:54 PM, Erik Rijkers e...@xs4all.nl wrote:

 On Sun, December 16, 2012 22:25, Alexander Korotkov wrote:

  trgm-regexp-0.8.patch.gz   22 k

 Hi Alexander,

 I gave this a quick try; the patch works when compiled for DEBUG, but
 crashes as a
 'speed'-compiled binary:

 Compile for speed:

 $ pg_config --configure
 '--prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.trgm_regex8'
 '--with-pgport=6556'
 '--enable-depend' '--with-openssl' '--with-perl' '--with-libxml'

 $ psql
 psql
 (9.3devel-trgm_regex8-20121216_2336-c299477229559d4ee7db68720d86d3fb391db761)
 Type help for help.

 testdb=# explain analyze select txt from azjunk5 where txt ~
 'x[aeiouy]{2,5}q';
 The connection to the server was lost. Attempting reset: Failed.
 ! \q


Didn't reproduce it yet. Can you retry it with this line uncommented:
#define TRGM_REGEXP_DEBUG
Then we can see which stage it fails.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-12-17 Thread Alexander Korotkov

On Mon, Dec 17, 2012 at 1:16 PM, Alexander Korotkov aekorot...@gmail.comwrote:

 Didn't reproduce it yet. Can you retry it with this line uncommented:
 #define TRGM_REGEXP_DEBUG
 Then we can see which stage it fails.


Bug is found and fixed in attached patch.

--
With best regards,
Alexander Korotkov.


trgm-regexp-0.9.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

2012-12-18 Thread Alexander Korotkov

On Tue, Dec 18, 2012 at 11:45 AM, Erik Rijkers e...@xs4all.nl wrote:

 On Tue, December 18, 2012 08:04, Alexander Korotkov wrote:
 I ran the same test again: HEAD versus trgm_regex v6, 7 and 9.  In v9
 there is some gain but also
 some regression.

 It remains a difficult problem...

 If I get some time in the holidays I'll try to diversify the test program;
 it is now too simple.


Note, that regexes which contains {,n} are likely not what do you expect.

test=# select 'xq' ~ 'x[aeiou]{,2}q';
 ?column?
--
 f
(1 row)

test=# select 'xa{,2}q' ~ 'x[aeiou]{,2}q';
 ?column?
--
 t
(1 row)

You should use {0,n} to express from 0 to n occurences.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: index support for regexp search

2012-12-18 Thread Alexander Korotkov

On Tue, Dec 18, 2012 at 12:51 PM, Erik Rijkers e...@xs4all.nl wrote:

 On Tue, December 18, 2012 09:45, Alexander Korotkov wrote:
 
  You should use {0,n} to express from 0 to n occurences.
 


 Thanks, but I know that of course.  It's a testing program; and in the end
 robustness with
 unexpected or even wrong input is as important as performance.  (to put it
 bluntly, I am also
 trying to get your patch to fall over ;-))


I found most of regressions in 0.9 version to be in {,n} cases. New version
of patch use more of trigrams than previous versions.
For example for regex 'x[aeiou]{,2}q'.
In 0.7 version we use trigrams '__2', '_2_' and '__q'.
In 0.9 version we use trigrams 'xa_', 'xe_', 'xi_', 'xo_', 'xu_', '__2',
'_2_' and '__q'.

But, actually trigram '__2' or '_2_' never occurs. It enough to have one of
them, all others are just causing a slowdown. Simultaneously, we can't
decide reasonably which trigrams to use without knowing their frequencies.
For example, if trigrams 'xa_', 'xe_', 'xi_', 'xo_', 'xu_' were altogether
more rare than '__2', newer version of patch would be faster.


--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: store additional info in GIN index

2012-12-22 Thread Alexander Korotkov

Hi!

On Thu, Dec 6, 2012 at 5:44 AM, Tomas Vondra t...@fuzzy.cz wrote:

 Then I've run a simple benchmarking script, and the results are not as
 good as I expected, actually I'm getting much worse performance than
 with the original GIN index.

 The following table contains the time of loading the data (not a big
 difference), and number of queries per minute for various number of
 words in the query.

 The queries looks like this

 SELECT id FROM messages
  WHERE body_tsvector @@ plainto_tsquery('english', 'word1 word2 ...')

 so it's really the simplest form of FTS query possible.

without patch |  with patch
 
 loading   750 sec| 770 sec
 1 word   1500|1100
 2 words 23000|9800
 3 words 24000|9700
 4 words 16000|7200
 

 I'm not saying this is a perfect benchmark, but the differences (of
 querying) are pretty huge. Not sure where this difference comes from,
 but it seems to be quite consistent (I usually get +-10% results, which
 is negligible considering the huge difference).

 Is this an expected behaviour that will be fixed by another patch?


Another patches which significantly accelerate index search will be
provided. This patch changes only GIN posting lists/trees storage. However,
it wasn't expected that this patch significantly changes index scan speed
in any direction.

The database contains ~680k messages from the mailing list archives,
 i.e. about 900 MB of data (in the table), and the GIN index on tsvector
 is about 900MB too. So the whole dataset nicely fits into memory (8GB
 RAM), and it seems to be completely CPU bound (no I/O activity at all).

 The configuration was exactly the same in both cases

 shared buffers = 1GB
 work mem = 64 MB
 maintenance work mem = 256 MB

 I can either upload the database somewhere, or provide the benchmarking
 script if needed.


Unfortunately, I can't reproduce such huge slowdown on my testcases. Could
you share both database and benchmarking script?

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] Statistics and selectivity estimation for ranges

2013-01-04 Thread Alexander Korotkov

On Mon, Dec 10, 2012 at 11:21 PM, Jeff Davis pg...@j-davis.com wrote:

 And I have a few other questions/comments:

 * Why is summ spelled with two ms? Is it short for summation? If
 so, might be good to use summation of instead of integrate in the
 comment.


Fixed.


 * Why does get_length_hist_frac return 0.0 when i is the last value? Is
 that a mistake?


Comment was wrong. Actually it return fraction fraction of ranges which
length is *greater*.


 * I am still confused by the distinction between rbound_bsearch and
 rbound_bsearch_bin. What is the intuitive purpose of each?


I've added corresponding comments. rbound_bsearch is for scalar operators
and for bin corresponding to upper bound. rbound_bsearch_bin is
now rbound_bsearch_bin_lower. It is for bin corresponding to lower bound.

* You use constant value in the comments in several places. Would
 query value or search key be better?


Yes. Fixed.

I also renamed get_length_hist_frac to get_length_hist_summ and rewrote
comments about it. Hope it becomes more understandable.

--
With best regards,
Alexander Korotkov.


range_stat-0.10.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1043 matches

Mail list logo