[HACKERS] multibyte charater set in levenshtein function
Hackers, The current version of levenshtein function in fuzzystrmatch contrib modulte doesn't work properly with multibyte charater sets. test=# select levenshtein('фыва','аыва'); levenshtein - 2 (1 row) My patch make this function works properly with multibyte charater sets. test=# select levenshtein('фыва','аыва'); levenshtein - 1 (1 row) Also it avoids text_to_cstring call. Regards, Alexander Korotkov. fuzzystrmatch.diff.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
On Thu, May 13, 2010 at 6:03 AM, Alvaro Herrera alvhe...@commandprompt.com wrote: Well, since it's only used in one place, why are you defining a macro at all? In order to structure code better. My question was about another. Is memcmp function good choice to compare very short sequences of bytes (from 1 to 4 bytes)?
Re: [HACKERS] multibyte charater set in levenshtein function
On Wed, May 12, 2010 at 11:04 PM, Alvaro Herrera alvhe...@alvh.no-ip.orgwrote: On a quick look, I didn't like the way you separated the pg_database_encoding_max_length() 1 cases. There seem to be too much common code. Can that be refactored a bit better? I did a little refactoring in order to avoid some similar code. I'm not quite sure about my CHAR_CMP macro. Is it a good idea? fuzzystrmatch-0.2.diff.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
Hello Hackers! I have extended my patch by introducing levenshtein_less_equal function. This function have additional argument max_d and stops calculating when distance exceeds max_d. With low values of max_d function works much faster than original one. The example of original levenshtein function usage: test=# select word, levenshtein(word, 'consistent') as dist from words where levenshtein(word, 'consistent') = 2 order by dist; word | dist -+-- consistent |0 insistent |2 consistency |2 coexistent |2 consistence |2 (5 rows) test=# explain analyze select word, levenshtein(word, 'consistent') as dist from words where levenshtein(word, 'consistent') = 2 order by dist; QUERY PLAN --- Sort (cost=2779.13..2830.38 rows=20502 width=8) (actual time=203.652..203.658 rows=5 loops=1) Sort Key: (levenshtein(word, 'consistent'::text)) Sort Method: quicksort Memory: 25kB - Seq Scan on words (cost=0.00..1310.83 rows=20502 width=8) (actual time=19.019..203.601 rows=5 loops=1) Filter: (levenshtein(word, 'consistent'::text) = 2) Total runtime: 203.723 ms (6 rows) Example of levenshtein_less_equal usage in this case: test=# select word, levenshtein_less_equal(word, 'consistent', 2) as dist from words where levenshtein_less_equal(word, 'consistent', 2) = 2 order by dist; word | dist -+-- consistent |0 insistent |2 consistency |2 coexistent |2 consistence |2 test=# explain analyze select word, levenshtein_less_equal(word, 'consistent', 2) as dist from words where levenshtein_less_equal(word, 'consistent', 2) = 2 order by dist; QUERY PLAN - Sort (cost=2779.13..2830.38 rows=20502 width=8) (actual time=42.198..42.203 rows=5 loops=1) Sort Key: (levenshtein_less_equal(word, 'consistent'::text, 2)) Sort Method: quicksort Memory: 25kB - Seq Scan on words (cost=0.00..1310.83 rows=20502 width=8) (actual time=5.391..42.143 rows=5 loops=1) Filter: (levenshtein_less_equal(word, 'consistent'::text, 2) = 2) Total runtime: 42.292 ms (6 rows) In the example above levenshtein_less_equal works about 5 times faster. With best regards, Alexander Korotkov. fuzzystrmatch-0.3.diff.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Parameters of GiST indexes
Hi hackers, I found that some parameters of GiST implementation are builin in the code. For example, following can be found in the backend/utils/adt/tsgistidx.c: #define SIGLENINT 31/* 121 = key will toast, so it will not work * !!! */ #define SIGLEN( sizeof(int4) * SIGLENINT ) #define SIGLENBIT (SIGLEN * BITS_PER_BYTE) I think that such parameters don't have optimal value for all the cases; and it would be a great option to let user define such parameters for particular index. For example, following syntax can be used: CREATE INDEX name ON table USING gist(column) WITH (SIGLENINT=63); With best regards, Korotkov Alexander.
[HACKERS] Using multidimensional indexes in ordinal queries
[v1,v2,v3]) @ cube(array[480,480,'-Infinity'::float],array[500,500,'+Infinity'::float]) order by cube(array[v1,v2,v3]) * 4 limit 10; QUERY PLAN - Limit (cost=0.00..39.10 rows=10 width=28) (actual time=5.186..20.047 rows=10 loops=1) - Index Scan using test_cube_idx on test (cost=0.00..39100.88 rows=1 width=28) (actual time=5.177..20.012 rows=10 loops=1) Index Cond: (cube(ARRAY[v1, v2, v3]) @ '(480, 480, -inf),(500, 500, inf)'::cube) Sort Cond: (cube(ARRAY[v1, v2, v3]) * 4) Total runtime: 20.145 ms (5 rows) So, the result is the same, but query runs about 150 time faster. I think that usage of multidimensional index on ordinal data types can produce following benefits: 1) It will be possible to create one multidimensional index instead of many of unidimensional ones. (Of course, it will be slower on simple queries). 2) On some complex queries (like noted above) it can produce great performance benefit. And now there is my question. How huge changes in planner is needed to make planner choose multidimensional index freely without rewriting query in special manner? In my example I replaced v1 between 480 and 500 and v2 between 480 and 500 by cube(array[v1,v2,v3]) @ cube(array[480,480,'-Infinity'::float],array[500,500,'+Infinity'::float]) and order by v3 by order by cube(array[v1,v2,v3]), but it will be a great option if planner could consider plan with gist cube index for original query. With best regards, Alexander Korotkov. cube.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using multidimensional indexes in ordinal queries
On Mon, Jun 21, 2010 at 5:42 PM, Robert Haas robertmh...@gmail.com wrote: It seems like you can get more or less the same benefit from a multicolumn btree index. On my system, with the individual btree indices, the query ran in 7625 ms; with an additional index on (v1, v2, v3), it ran in 94 ms. I didn't get the same plans as Alexander did, though, so it may not really be apples to apples. See attached session trace. Benefit of multicolumn btree index was more or less the same than cube benefit because of very bad picksplit behavior in this case. I attached the patch which significally improves cube index search performance: test=# explain (analyze, buffers) select * from test where cube(ARRAY[v1,v2,v3]) @ cube(ARRAY[480,480,'-inf'::float8], ARRAY[500,500,'+inf'::float8]) order by cube(ARRAY[v1,v2,v3]) * 4 LIMIT 10; QUERY PLAN Limit (cost=0.00..38.07 rows=10 width=28) (actual time=0.495..0.570 rows=10 loops=1) Buffers: shared hit=21 - Index Scan using test_cube_idx on test (cost=0.00..38064.52 rows=1 width=28) (actual time=0.489..0.537 rows=10 loops=1) Index Cond: (cube(ARRAY[v1, v2, v3]) @ '(480, 480, -inf),(500, 500, inf)'::cube) Sort Cond: (cube(ARRAY[v1, v2, v3]) * 4) Buffers: shared hit=21 Total runtime: 0.659 ms (7 rows) Now this patch greatly increases tree construction time, but I believe that picksplit implementation, that is good enough for tree search and tree construction, can be found. The trouble is that it's hard to think of a way of teaching the planner about these cases without hard-coding lots and lots of special-case kludges into the planner. Still, if someone has a clever idea... I think that two things can be done to improve the situation: 1) Make knngist deal with negative values. I think this will make easier using knngist just for sorting, not only k-neighbor searching. 2) Let gist interface methods take care about multicolumn indexes. I think that if cube index from the example above will be constructed on separate columns v1, v2, v3 then it would be easier for planner to use cube index for queries with filters on these columns. I don't know exactly how to do that. cube.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using multidimensional indexes in ordinal queries
On Tue, Jun 22, 2010 at 1:58 AM, Robert Haas robertmh...@gmail.com wrote: It doesn't? I didn't think it was making any assumptions about the ordering data type beyond the fact that it had a default btree opclass. Actually, the return type of consistent method was replaced by float8. Negative values are used for unconsistent state. Non-negative values are used for consistent and ordering.
Re: [HACKERS] multibyte charater set in levenshtein function
Hi! * levenshtein_internal() and levenshtein_less_equal_internal() are very similar. Can you merge the code? We can always use less_equal_internal() if the overhead is ignorable. Did you compare them? With big value of max_d overhead is significant. Here is example on american-english dictionary from Openoffice. test=# select sum(levenshtein('qweqweqweqweqwe',word)) from words; sum - 1386456 (1 row) Time: 195,083 ms test=# select sum(levenshtein_less_equal('qweqweqweqweqwe',word,100)) from words; sum - 1386456 (1 row) Time: 317,821 ms * There are many if (!multibyte_encoding) in levenshtein_internal(). How about split the function into two funcs for single-byte chars and multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always use multi-byte version if the overhead is small. The overhead of multi-byte version was about 4 times slower. But I have rewritten my CHAR_CMP macro with inline function. And now it's only about 1.5 times slower. In database with muti-byte encoding: test=# select * from words where levenshtein('qweqweqwe',word)=5; id | word ---+-- 69053 | peewee 69781 | pewee 81279 | sequence 88421 | sweetie (4 rows) Time: 136,742 ms In database with single-byte encoding: test2=# select * from words where levenshtein('qweqweqwe',word)=5; id | word ---+-- 69053 | peewee 69781 | pewee 81279 | sequence 88421 | sweetie (4 rows) Time: 88,471 ms Anyway I think that overhead is not ignorable. That's why I have splited levenshtein_internal into levenshtein_internal and levenshtein_internal_mb, and levenshtein_less_equal_internal into levenshtein_less_equal_internal and levenshtein_less_equal_internal_mb. * I prefer a struct rather than an array. 4 * m and 3 * m look like magic numbers for me. Could you name the entries with definition of a struct? /* * For multibyte encoding we'll also store array of lengths of * characters and array with character offsets in first string * in order to avoid great number of pg_mblen calls. */ prev = (int *) palloc(4 * m * sizeof(int)); I this line of code the memory is allocated for 4 arrays: prev, curr, offsets, char_lens. So I have joined offsets and char_lens into struct. But I can't join prev and curr because of this trick: temp = curr; curr = prev; prev = temp; * There are some compiler warnings. Avoid them if possible. fuzzystrmatch.c: In function ‘levenshtein_less_equal_internal’: fuzzystrmatch.c:428: warning: ‘char_lens’ may be used uninitialized in this function fuzzystrmatch.c:428: warning: ‘offsets’ may be used uninitialized in this function fuzzystrmatch.c:430: warning: ‘curr_right’ may be used uninitialized in this function fuzzystrmatch.c: In function ‘levenshtein_internal’: fuzzystrmatch.c:222: warning: ‘char_lens’ may be used uninitialized in this function Fixed. * Coding style: Use if (m == 0) instead of if (!m) when the type of 'm' is an integer. Fixed. * Need to fix the caution in docs. http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html | Caution: At present, fuzzystrmatch does not work well with | multi-byte encodings (such as UTF-8). but now levenshtein supports multi-byte encoding! We should mention which function supports mbchars not to confuse users. I've updated this notification. Also I've added documentation for levenshtein_less_equal function. * (Not an issue for the patch, but...) Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR? Unpacking versions make the core a bit faster. Fixed. With best regards, Alexander Korotkov. fuzzystrmatch-0.4.diff.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas robertmh...@gmail.com wrote: This patch still needs some work. It includes a bunch of stylistic changes that aren't relevant to the purpose of the patch. There's no reason that I can see to change the existing levenshtein_internal function to take text arguments instead of char *, or to change !m to m == 0 in existing code, or to change the whitespace in the comments of that function. All of those need to be reverted before we can consider committing this. I changed arguments of function from char * to text * in order to avoid text_to_cstring call. Same benefit can be achived by replacing char * with char * and length. I changed !m to m == 0 because Itagaki asked me to make it conforming coding style. Do you think there is no reason to fix coding style in existing code? There is a huge amount of duplicated code here. I think it makes sense to have the multibyte version of the function be separate, but perhaps we can merge the less-than-or-equal to bits into the main code, so that we only have two copies instead of four. Perhaps we can't just add a max_d argument max_distance to levenshtein_internal; and if this value is =0 then it represents the max allowable distance, but if it is 0 then there is no limit. Sure, that might slow down the existing code a bit, but it might not be significant. I'd at least like to see some numbers showing that it is significant before we go to this much trouble. In these case we should add many checks of max_d in levenshtein_internal function which make code more complex. Actually, we can merge all four functions into one function. But such function will have many checks about multibyte encoding and max_d. So, I see four cases here: 1) one function with checks for multibyte encoding and max_d 2) two functions with checks for multibyte encoding 3) two functions with checks for max_d 4) four separate functions If you prefer case number 3 you should argue your position little more. The code doesn't follow the project coding style. Braces must be uncuddled. Comment paragraphs will be reflowed unless they begin and end with --. Function definitions should have the type declaration on one line and the function name at the start of the next. Freeing memory with pfree is likely a waste of time; is there any reason not to just rely on the memory context reset, as the original coding did? Ok, I'll fix this things. I think we might need to remove the acknowledgments section from this code. If everyone who touches this code adds their name, we're quickly going to have a mess. If we're not going to remove the acknowledgments section, then please add my name, too, because I've already patched this code once... In that case I think we can leave original acknowledgments section. With best regards, Alexander Korotkov.
Re: [HACKERS] multibyte charater set in levenshtein function
should be * increased by ins_c + del_c. In the case of movement backwards this * diagonal cell value should not be increased. * 2) The range of indexes of previous row, where values, which is not * greater than max_d, are stored, is [prev_left, prev_right]. So, only * the cells connected with this range should be calculated. * For multibyte encoding we should increase x and y pointers by the * results of pg_mblen function. Also we should use CHAR_CMP macros * for character comparison. */ for (y = t, j = 1; j n; y+= y_char_len, j++) { int *temp; y_char_len = pg_mblen(y); if (max_d = 0) { prev_left = curr_left; prev_right = curr_right; curr_left = -1; }else{ prev_left = 1; } if (delta = 0) curr[0] = min_d + j * (ins_c + del_c); else{ if (j = - delta) curr[0] = min_d; else curr[0] = min_d + (j + delta) * (ins_c + del_c); } for (i = prev_left, x = s + lengths_and_offsets[i - 1].offset; i m; x+= lengths_and_offsets[i - 1].length, i++) { if (max_d = 0) { d = max_d + 1; if (i = prev_right){ d = Min(prev[i] + ((j + delta i)?(ins_c + del_c):0), d); } if (i == 1 || i prev_left) { d = Min(curr[i - 1] + ((i - delta j)?(ins_c + del_c):0), d); d = Min(prev[i - 1] + (char_cmp(x, lengths_and_offsets[i - 1].length, y, y_char_len) ? 0 : sub_c), d); } curr[i] = d; if (curr_left == -1) curr_left = i; curr_right = i; if (i prev_right) break; }else{ d = Min(Min(prev[i] + ((j + delta i)?(ins_c + del_c):0), curr[i - 1] + ((i - delta j)?(ins_c + del_c):0)), prev[i - 1] + (char_cmp(x, lengths_and_offsets[i - 1].length, y, y_char_len) ? 0 : sub_c)); curr[i] = d; } } if (curr_left == -1) break; temp = curr; curr = prev; prev = temp; } /* * Because the final value was swapped from the previous row to the * current row, that's where we'll find it. */ d = prev[m - 1]; /* * If the last cell wasn't filled than return max_d + 1 otherwise * return the final value. */ if (max_d = 0 (curr_left == -1 || curr_right m - 1)) return max_d + 1; else return d; } I tested it with american-english dictionary with 98569 words. test=# select sum(levenshtein(word, 'qwerqwerqwer')) from words; sum - 1074376 (1 row) Time: 131,435 ms test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',100)) from words; sum - 1074376 (1 row) Time: 221,078 ms test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',-1)) from words; sum - 1074376 (1 row) Time: 254,819 ms The function with negative value of max_d didn't become faster than with just big value of max_d. With best regards, Alexander Korotkov.
Re: [HACKERS] multibyte charater set in levenshtein function
Such version with macros and includes can look like this: #ifdef MULTIBYTE #define NEXT_X (x+= char_lens[i-1]) #define NEXT_Y (y+= y_char_len) #define CMP (char_cmp(x, char_lens[i-1], y, y_char_len)) #else #define NEXT_X (x++) #define NEXT_Y (y++) #define CMP (*x == *y) #endif static int levenshtein_internal(text *s, text *t, int ins_c, int del_c, int sub_c) { intm, n, d; int *prev; int *curr; inti, j; const char *x; const char *y; #ifdef MULTIBYTE int *char_lens; inty_char_len; #endif /* * We should calculate number of characters for multibyte encodings */ m = pg_mbstrlen_with_len(VARDATA_ANY(s), VARSIZE_ANY_EXHDR(s)); n = pg_mbstrlen_with_len(VARDATA_ANY(t), VARSIZE_ANY_EXHDR(t)); /* * We can transform an empty s into t with n insertions, or a non-empty t * into an empty s with m deletions. */ if (m == 0) return n * ins_c; if (n == 0) return m * del_c; /* * For security concerns, restrict excessive CPU+RAM usage. (This * implementation uses O(m) memory and has O(mn) complexity.) */ if (m MAX_LEVENSHTEIN_STRLEN || n MAX_LEVENSHTEIN_STRLEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg(argument exceeds the maximum length of %d bytes, MAX_LEVENSHTEIN_STRLEN))); /* One more cell for initialization column and row. */ ++m; ++n; /* * Instead of building an (m+1)x(n+1) array, we'll use two different * arrays of size m+1 for storing accumulated values. At each step one * represents the previous row and one is the current row of the * notional large array. * For multibyte encoding we'll also store array of lengths of * characters of first string in order to avoid great number of * pg_mblen calls. */ #ifdef MULTIBYTE prev = (int *) palloc(3 * m * sizeof(int)); curr = prev + m; char_lens = prev + 2 * m; for (i = 0, x = VARDATA_ANY(s); i m - 1; i++) { char_lens[i] = pg_mblen(x); x += char_lens[i]; } char_lens[i] = 0; #else prev = (int *) palloc(2 * m * sizeof(int)); curr = prev + m; #endif /* Initialize the previous row to 0..cols */ for (i = 0; i m; i++) prev[i] = i * del_c; /* * For multibyte encoding we should increase x and y pointers by the * results of pg_mblen function. Also we should use CHAR_CMP macros * for character comparison. */ for (y = VARDATA_ANY(t), j = 1; j n; NEXT_Y, j++) { int *temp; #ifdef MULTIBYTE y_char_len = pg_mblen(y); #endif curr[0] = j * ins_c; for (x = VARDATA_ANY(s), i = 1; i m; NEXT_X, i++) { intins; intdel; intsub; ins = prev[i] + ins_c; del = curr[i - 1] + del_c; sub = prev[i - 1] + (CMP ? 0 : sub_c); curr[i] = Min(ins, del); curr[i] = Min(curr[i], sub); } temp = curr; curr = prev; prev = temp; } d = prev[m - 1]; /* * Because the final value was swapped from the previous row to the * current row, that's where we'll find it. */ return d; } What do you thing about it? With best regards, Alexander Korotkov.
Re: [HACKERS] multibyte charater set in levenshtein function
On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas robertmh...@gmail.com wrote: Ah, I see. That's pretty compelling, I guess. Although it still seems like a lot of code... I think there is a way to merge single-byte and multi-byte versions of functions without loss in performance using macros and includes (like in 'backend/utils/adt/like.c'). Do you think it is acceptable in this case? With best regards, Alexander Korotkov.
Re: [HACKERS] multibyte charater set in levenshtein function
Here is new version of my patch. There are following changes: 1) I've merged singlebyte and multibyte versions of levenshtein_internal and levenshtein_less_equal_internal using macros and includes. 2) I found that levenshtein takes reasonable time even for long strings. There is an example with strings which contain random combinations of words. test=# select count(*) from words; count --- 98569 (1 row) test=# select * from words limit 10; id | word| next_id ---++- 200 | Agnew diodes drilled composts Wassermann nonprofit adjusting Chautauqua| 17370 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither| 70983 5418 | Eroses gauchos bowls smellier weeklies tormentors cornmeal punctuality | 96685 6103 | G prompt boron overthrew cricking begonia snuffers blazed | 34518 4707 | Djibouti Mari pencilling approves pointing's pashas magpie rams| 71200 10125 | Lyle Edgardo livers gruelled capable cancels gnawing's inhospitable| 28018 9615 | Leos deputy evener yogis assault palatals Octobers surceased | 21878 15640 | Starr's arrests malapropism Koufax's aesthetes Howell rustier Algeria | 19316 15776 | Sucre battle's misapprehension's predicating nutmegged electrification bowl planks | 65739 16878 | Updike Forster ragout keenly exhalation grayed switch guava's | 43013 (10 rows) Time: 26,125 ms Time: 137,061 ms test=# select avg(length(word)) from words; avg - 74.6052207083362923 (1 row) Time: 96,415 ms test=# select * from words where levenshtein(word, 'Dhaka craziness savvies teeae luhs Barentss unwe zher')=10; id | word | next_id --+-+- 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither | 70983 (1 row) Time: 2957,426 ms test=# select * from words where levenshtein_less_equal(word, 'Dhaka craziness savvies teeae luhs Barentss unwe zher',10)=10; id | word | next_id --+-+- 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither | 70983 (1 row) Time: 84,755 ms It takes O(max(n, m) * max_d / min(ins_c, max_c)) in worst case. That's why I changed limitation of this function to max_d = 255 * min(ins_c, del_c) 3) I found that there is no need for storing character offsets in levenshtein_less_equal_internal_mb. Instead of this first position in string, where value of distance wasn't greater than max_d, can be stored. 4) I found that there is theoretical maximum distance based of string lengths. It represents the case, when no characters are matching. If theoretical maximum distance is less than given maximum distance we can use theoretical maximum distance. It improves behavior of levenshtein_less_equal with high max_d, but it's still slower than levenshtein: test=# select * from words where levenshtein_less_equal(word, 'Dhaka craziness savvies teeae luhs Barentss unwe zher',1000)=10; id | word | next_id --+-+- 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither | 70983 (1 row) Time: 4102,731 ms test=# select * from words where levenshtein(word, 'Dhaka craziness savvies teeae luhs Barentss unwe zher')=10; id | word | next_id --+-+- 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither | 70983 (1 row) Time: 2983,244 ms With best regards, Alexander Korotkov. fuzzystrmatch-0.5.tar.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
I forgot attribution in levenshtein.c file. With best regards, Alexander Korotkov. fuzzystrmatch-0.5.1.tar.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] knngist - 0.8
I think that queries like this: select * from test where val - 500 1 order by val - 500; can also be optimized using knngist. In case of btree_gist this query can be easily rewritten: select * from test where val 499 and val 501 order by val - 500; But, in pg_trgm it makes it possible to combine different similarity levels in one query. For example: select * from test_trgm order by t - 'asdf' 0.5 or t - 'qwer' 0.4; Is there any chance to handle this syntax also? With best regards, Alexander Korotkov.
Re: [HACKERS] multibyte charater set in levenshtein function
On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas robertmh...@gmail.com wrote: I reviewed this code in a fair amount of detail today and ended up rewriting it. In general terms, it's best to avoid changing things that are not relevant to the central purpose of the patch. This patch randomly adds a whole bunch of whitespace and removes a whole bunch of comments which I find useful and do not want to see removed. It took me about an hour to reverse out all of those changes, which then let me get down to the business of actually analyzing the code. I'm sorry. This is my first patch, I will be more accurate next time. But, I think there is no unified opinion about some changes like replacement !m with m==0. I think this line is not correct: if (m != s_bytes n != t_bytes) The correct negation of assumption, that both strings are single-byte, is the assumption, that at least one string is not single-byte. In this patch levenshtein function can calculate distance incorrectly: test=# select levenshtein('фw', 'ww'); levenshtein - 2 (1 row) This line should be rewritten by this. if (m != s_bytes || n != t_bytes) The attached patch takes the approach of using a fast-path for just the innermost loop when neither string contains any multi-byte characters (regardless of whether or not the encoding allows multi-byte characters). It's still slower than CVS HEAD, but for strings containing no multi-byte characters it's only marginally, if at all, slower than previous releases, because 9.0 doesn't contain your fix to avoid the extra string copy; and it's faster than your implementation even when multi-byte characters ARE present. It is slightly slower on single-byte encoding databases (I tested SQL_ASCII), but still faster than previous releases. It's also quite a bit less code duplication. Can I look at the test which shows that your implementation is faster than my when multi-byte characters are present? My tests show reverse. For example, with russian dictionary of openoffice: With my version of patch: test=# select sum(levenshtein(word, 'фывафыва')) from test; sum - 1675281 (1 row) Time: 277,190 ms With your version of patch: test=# select sum(levenshtein(word, 'фывафыва')) from test; sum - 1675281 (1 row) Time: 398,011 ms I think that the cause of slow down slowdown is memcmp function. This function is very fast for long parts of memory, but of few bytes inline function is faster, because compiler have more freedom for optimization. After replacing memcmp with inline function like following in your implementation: static inline bool char_cmp(const char *s, const char *d, int len) { while (len--) { if (*s++ != *d++) return false; } return true; } Code becomes much faster: test=# select sum(levenshtein(word, 'фывафыва')) from test; sum - 1675281 (1 row) Time: 241,272 ms With best regards, Alexander Korotkov.
Re: [HACKERS] multibyte charater set in levenshtein function
Now I think patch is as good as can be. :) I'm going to prepare less-or-equal function in same manner as this patch. With best regards, Alexander Korotkov.
Re: [HACKERS] knngist - 0.8
In gist consitent method support only filtering strategies. For such strategies consistent method returns true if subtree can contain matching node and false otherwise. Knngist introduce also order by strategies. For filtering strategies knngist consistent method returns 0 if subtree can contain matching node and -1 otherwise. For order by strategies knngist consistent method returns minimal possible distance in subtree. I think we can use consistent method with order by strategies not only for ordering but also for filtering. If query contain assertion that distance is less than some value, than we can call consistent method with order by strategy and compare result with query value in order to determine whether scan subtree. Such approach can give benefit when we need to filter by similarity. For example, in pg_trgm % is used for similarity filtering, but similarity threshold is global for session. That's why we can't create complex queries which contain similarity filtering with different threshold. With best regards, Alexander Korotkov. On Mon, Aug 2, 2010 at 8:14 PM, Robert Haas robertmh...@gmail.com wrote: 2010/7/29 Alexander Korotkov aekorot...@gmail.com: But, in pg_trgm it makes it possible to combine different similarity levels in one query. For example: select * from test_trgm order by t - 'asdf' 0.5 or t - 'qwer' 0.4; Is there any chance to handle this syntax also? Maybe I'm missing something, but I don't think that ORDER BY clause makes much sense. OR is going to reduce a true or false value - and it's usually not that interesting to order by a column that can only take one of two values. Am I confused? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Re: [HACKERS] knngist - 0.8
On Tue, Aug 10, 2010 at 1:35 AM, Euler Taveira de Oliveira eu...@timbira.com wrote: What do you mean by complex queries? You can always use the SET command. Sadly it doesn't work when you have different thresholds within distinct subqueries. (In pg_similarity I use this approach to set the function's thresholds). I mean exactly different thresholds in distinct subqueries. What I am investigating is a way to build an index with some user-defined parameters. (We already have some infra-structure in reloptions for that but it needs some work to support my idea). I have some half-baked patch that I'm planning to submit to some of the CFs. Unfortunately, I don't have time for it ATM. User-defined parameters for GiST would be a great feature. I'm performing some experiments with GiST and I'm really feeling the need of it. With best regards, Alexander Korotkov.
Re: [HACKERS] multibyte charater set in levenshtein function
Here is the patch which adds levenshtein_less_equal function. I'm going to add it to current commitfest. With best regards, Alexander Korotkov. On Tue, Aug 3, 2010 at 3:23 AM, Robert Haas robertmh...@gmail.com wrote: On Mon, Aug 2, 2010 at 5:07 PM, Alexander Korotkov aekorot...@gmail.com wrote: Now I think patch is as good as can be. :) OK, committed. I'm going to prepare less-or-equal function in same manner as this patch. Sounds good. Since we're now more than half-way through this CommitFest and this patch has undergone quite a bit of change from what we started out with, I'd like to ask that you submit the new patch for the next CommitFest, to begin September 15th. https://commitfest.postgresql.org/action/commitfest_view/open -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company *** a/contrib/fuzzystrmatch/fuzzystrmatch.c --- b/contrib/fuzzystrmatch/fuzzystrmatch.c *** *** 61,66 PG_MODULE_MAGIC; --- 61,68 */ extern Datum levenshtein_with_costs(PG_FUNCTION_ARGS); extern Datum levenshtein(PG_FUNCTION_ARGS); + extern Datum levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS); + extern Datum levenshtein_less_equal(PG_FUNCTION_ARGS); extern Datum metaphone(PG_FUNCTION_ARGS); extern Datum soundex(PG_FUNCTION_ARGS); extern Datum difference(PG_FUNCTION_ARGS); *** *** 92,98 soundex_code(char letter) #define MAX_LEVENSHTEIN_STRLEN 255 static int levenshtein_internal(text *s, text *t, ! int ins_c, int del_c, int sub_c); /* --- 94,100 #define MAX_LEVENSHTEIN_STRLEN 255 static int levenshtein_internal(text *s, text *t, ! int ins_c, int del_c, int sub_c, int max_d); /* *** *** 202,211 rest_of_char_same(const char *s1, const char *s2, int len) * between supplied strings. Generally * (1, 1, 1) penalty costs suffices common * cases, but your mileage may vary. */ static int levenshtein_internal(text *s, text *t, ! int ins_c, int del_c, int sub_c) { int m, n, --- 204,217 * between supplied strings. Generally * (1, 1, 1) penalty costs suffices common * cases, but your mileage may vary. + * Returns accurate value if max_d 0 or + * actual distance is less or equal than + * max_d, otherwise returns value greater + * than max_d. */ static int levenshtein_internal(text *s, text *t, ! int ins_c, int del_c, int sub_c, int max_d) { int m, n, *** *** 219,224 levenshtein_internal(text *s, text *t, --- 225,233 const char *s_data; const char *t_data; const char *y; + const char *prev_x = NULL; + intcurr_left = 0, curr_right = 0, prev_left, prev_right, d; + intdelta = 0, min_d = 0, theor_max_d; /* Extract a pointer to the actual character data. */ s_data = VARDATA_ANY(s); *** *** 238,243 levenshtein_internal(text *s, text *t, --- 247,266 return n * ins_c; if (!n) return m * del_c; + + /* + * There is theoretical maximum distance based of string lengths. It + * represents the case, when no characters are matching. If max_d + * reaches this value than we can use accurate calculation of distance + * which is faster in this case. + */ + if (max_d = 0) + { + theor_max_d = Min(m*del_c + n*ins_c, (m n) ? + (n * sub_c + (m - n) * del_c):(m * sub_c + (n - m) * ins_c)) - 1; + if (max_d = theor_max_d) + max_d = -1; + } /* * For security concerns, restrict excessive CPU+RAM usage. (This *** *** 251,256 levenshtein_internal(text *s, text *t, --- 274,296 MAX_LEVENSHTEIN_STRLEN))); /* + * We can find the minimal distance by the difference of string lengths. + */ + if (max_d = 0) + { + delta = m - n; + if (delta 0) + min_d = delta * del_c; + else if (delta 0) + min_d = - delta * ins_c; + else + min_d = 0; + + if (min_d max_d) + return max_d + 1; + } + + /* * In order to avoid calling pg_mblen() repeatedly on each character in s, * we cache all the lengths before starting the main loop -- but if all the * characters in both strings are single byte, then we skip this and use *** *** 286,369 levenshtein_internal(text *s, text *t, curr = prev + m; /* Initialize the previous row to 0..cols */ ! for (i = 0; i m; i++) ! prev[i] = i * del_c; /* Loop through rows of the notional array */ for (y = t_data, j = 1; j n; j++) { int *temp; - const char *x = s_data; int y_char_len = n != t_bytes + 1 ? pg_mblen(y) : 1; /* ! * First cell must increment sequentially, as we're on the j'th row of ! * the (m+1)x(n+1) array. ! */ ! curr[0] = j * ins_c; ! ! /* ! * This inner loop is critical to performance, so we include a * fast-path to handle
Re: [HACKERS] multibyte charater set in levenshtein function
протоколируются сострадательный отрясший вывалить браманизм наполниться агафоновна испольный подтаскивающий запруживавшийся трофимович перетаскиваемый обручавший процентщица передов вихрастый отволочённый дискотека пришей распасовывавший полиция возделавший трехглавый битва загазованный обовьетесь перехитривший инулин стелить недоброжелательность изотрёте пятиалтынный жабовидный щелочно дефицитность сиротиночка хлорбензол вгрызаться прокрашивать никарагуанец валентный понесённый перестегивание воспретительный переименовываемый таявший раскупорившийся передислоцируется юлианович праздничный лачужка присыхать отбеливший фехтовальный удобряющий слепнул зонт уластить удобрение тормозивший ушибся ошибся мкс сейсмологический лесомелиорация рабовладельцем неудачный самовоспламенение карбидный круглопильный кубинцем подлакированный наблюдение исцарапавшийся издёргавший (10 rows) test=# select * from phrases2 where levenshtein('таяй раскупорившийся передислоцируется юлианович праздничный лачужка присыхать опппливший ффехтовальный добряющий', a) = 10; a -- таявший раскупорившийся передислоцируется юлианович праздничный лачужка присыхать отбеливший фехтовальный удобряющий (1 row) Time: 1291,318 ms test=# select * from phrases2 where levenshtein_less_equal('таяй раскупорившийся передислоцируется юлианович праздничный лачужка присыхать опппливший ффехтовальный добряющий', a, 10) = 10; a -- таявший раскупорившийся передислоцируется юлианович праздничный лачужка присыхать отбеливший фехтовальный удобряющий (1 row) Time: 55,405 ms test=# select sum(levenshtein_less_equal('таяй раскупорившийся передислоцируется юлианович праздничный лачужка присыхать опппливший ффехтовальный добряющий', a, 200)) from phrases; sum - 1091878 (1 row) Time: 674,734 ms test=# select sum(levenshtein('таяй раскупорившийся передислоцируется юлианович праздничный лачужка присыхать опппливший ффехтовальный добряющий', a)) from phrases; sum - 1091878 (1 row) Time: 673,515 ms With best regards, Alexander Korotkov.
Re: [HACKERS] Real-life range datasets
Hello, On Thu, Dec 22, 2011 at 12:51 PM, Benedikt Grundmann bgrundm...@janestreet.com wrote: I should be able to give you a table with the same characteristics as the instruments table but bogus data by replacing all entries in the table with random strings of the same length or something like that. I can probably take a little bit of time during this or the next week to generate such fake real world data ;-) Is there an ftp site to upload the gzipped pg_dump file to? Thank you very much for your response! I'm going to send you accessories for upload soon. - With best regards, Alexander Korotkov.
Re: [HACKERS] Collect frequency statistics for arrays
Hi! Thanks for your great work on reviewing this patch. Now I'm trying to find memory corruption bug. Unfortunately it doesn't appears on my system. Can you check if this bug remains in attached version of patch. If so, please provide me information about system you're running (processor, OS etc.). -- With best regards, Alexander Korotkov. arrayanalyze-0.8.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
On Wed, Jan 4, 2012 at 12:33 AM, Noah Misch n...@leadboat.com wrote: On Wed, Jan 04, 2012 at 12:09:16AM +0400, Alexander Korotkov wrote: Thanks for your great work on reviewing this patch. Now I'm trying to find memory corruption bug. Unfortunately it doesn't appears on my system. Can you check if this bug remains in attached version of patch. If so, please provide me information about system you're running (processor, OS etc.). I get the same diagnostic from this version. Opteron processor, operating system is Ubuntu 8.04 (64-bit). You're using --enable-cassert, right? Oh, actually no. Thanks for point. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Collect frequency statistics for arrays
+ * distribution id negligible there. In the column @ const case summary + * frequency of elements is high (otherwise we have selectivity close to 0). What does the term summary frequency denote? I meant summ of frequences of const array elements. + /* + * Rest is a average length of elements which aren't present in mcelem. + */ + rest = avg_length; You define rest here as an array length ... + + default_freq = Min(DEFAULT_CONT_SEL, minfreq / 2); + + mcelem_index = 0; + + /* + * mult is the multiplier that presents estimate of probability that each + * mcelem which is not present in constant doesn't occur. + */ + mult = 1.0f; + + for (i = 0; i nitems; i++) + { + boolfound = false; + + /* Comparison with previous value in order to guarantee uniquness */ + if (i 0) + { + if (!element_compare(array_data[i - 1], array_data[i])) + continue; + } + + /* + * Iterate over mcelem until find mcelem that is greater or equal to + * element of constant. Simultaneously taking care about rest and + * mult. If that mcelem is found then fill corresponding elem_selec. + */ + while (mcelem_index nmcelem) + { + int cmp = element_compare(mcelem[mcelem_index], array_data[i]); + + if (cmp 0) + { + mult *= (1.0f - numbers[mcelem_index]); + rest -= numbers[mcelem_index]; ... But here, you're subtracting a frequency from an array length? Yes, because average distinct element count is summ of frequencies of elements. Substracting mcelem frequencies from avg_length we have summ of frequencies of non-mcelem elements. -- With best regards, Alexander Korotkov. arrayanalyze-0.9.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
-eq_opr) + return (Selectivity) 0.5; Punting on other operators this way creates a plan quality regression for operations like const ANY (column). Please do it some way that falls back on the somewhat-better existing scalararraysel() treatment for this. I've made calc_scalararraysel return -1 in this case or in the case of no comparison function. scalararraysel continues it's work when calc_scalararraysel returns negative value. + /* + * Calculate first n distinct element counts probabilities by histogram. We + * assume that any interval between a and b histogram values gives + * 1 / ((b - a + 1) * (nhist - 1)) probability to values between a and b and + * half of that to a and b. Returns total probability that distinct element + * count is less of equal to n. + */ + static float + calc_hist(Datum *hist, int nhist, float *hist_part, int n) To test this function, I ran the following test case: set default_statistics_target = 4; create table t3 as select array(select * from generate_series(1, v)) as arr from (values (2),(2),(2),(3),(5),(5),(5)) v(v), generate_series(1,100); analyze t3; -- length_histogram_bounds = {2,2,5,5} select * from t3 where arr @ array[6,7,8,9,10,11]; Using gdb to observe calc_hist()'s result during the last command: (gdb) p calc_hist(hist, nhist, hist_part, unique_nitems) $23 = 0.66687 (gdb) x/6f hist_part 0xcd4bc8: 0 0 0.33343 0 0xcd4bd8: 0 0.33343 I expected an equal, nonzero probability in hist_part[3] and hist_part[4] and a total probability of 1.0. + { + int k, + i = 0, + prev_interval = 0, + next_interval = 0; + float frac, + total = 0.0f; + + /* + * frac is a probability contribution by each interval between histogram + * values. We have nhist - 1 intervals. Contribution of one will be 1 / + * (nhist - 1). + */ + frac = 1.0f / ((float) (nhist - 1)); + for (k = 0; k = n; k++) + { + int count = 0; + + /* Count occurences of k distinct element counts in histogram. */ + while (i nhist DatumGetInt32(hist[i]) = k) + { + if (DatumGetInt32(hist[i]) == k) + count++; + i++; + } + + if (count 0) + { + float val; + + /* Find length between current histogram value and the next one */ + if (i nhist) + next_interval = DatumGetInt32(hist[i + 1]) - Doesn't this read past the array end when i == nhist - 1? It was a bug. It also causes wrong histogram calculation you noted above. Fixed. + stats-extra_data = extra_data-std_extra_data; + old_context = CurrentMemoryContext; + extra_data-std_compute_stats(stats, fetchfunc, samplerows, totalrows); + MemoryContextSwitchTo(old_context); Is the callee known to change CurrentMemoryContext and not restore it? Offhand, I'm not seeing how it could do so. Right. Saving of memory context is not needed. Removed. *** a/src/include/catalog/pg_type.h --- b/src/include/catalog/pg_type.h This now updates all array types except record[]. I'm don't know offhand how to even make a non-empty value of type record[], let alone get it into a context where ANALYZE would see it. However, is there a particular reason to make that one different? Oh, I didn't update all array types in 2 tries :) Fixed. -- With best regards, Alexander Korotkov. arrayanalyze-0.11.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Fri, Jan 20, 2012 at 12:30 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: The code badly needs comments. There is no explanation of how the trigram extraction code in trgm_regexp.c works. Sure. I hoped to find a time for comments before commitfest starts. Unfortunately I didn't, sorry. Guessing from the variable names, it seems to be some sort of a coloring algorithm that works on a graph, but that all needs to be explained. Can this algorithm be found somewhere in literature, perhaps? A link to a paper would be nice. I hope it's truly novel. At least application to regular expressions. I'm going to write a paper about it. Apart from that, the multibyte issue seems like the big one. Any way around that? Conversion of pg_wchar to multibyte character is the only way I found to avoid serious hacking of existing regexp code. Do you think additional function in pg_wchar_tbl which converts pg_wchar back to multibyte character is possible solution? -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
I also have a question about pg_wchar. /* *--- * encoding info table * XXX must be sorted by the same order as enum pg_enc (in mb/pg_wchar.h) *--- */ pg_wchar_tbl pg_wchar_table[] = { {pg_ascii2wchar_with_len, pg_ascii_mblen, pg_ascii_dsplen, pg_ascii_verifier, 1}, /* PG_SQL_ASCII */ {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3}, /* PG_EUC_JP */ {pg_euccn2wchar_with_len, pg_euccn_mblen, pg_euccn_dsplen, pg_euccn_verifier, 2}, /* PG_EUC_CN */ {pg_euckr2wchar_with_len, pg_euckr_mblen, pg_euckr_dsplen, pg_euckr_verifier, 3}, /* PG_EUC_KR */ {pg_euctw2wchar_with_len, pg_euctw_mblen, pg_euctw_dsplen, pg_euctw_verifier, 4}, /* PG_EUC_TW */ {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3}, /* PG_EUC_JIS_2004 */ {pg_utf2wchar_with_len, pg_utf_mblen, pg_utf_dsplen, pg_utf8_verifier, 4}, /* PG_UTF8 */ {pg_mule2wchar_with_len, pg_mule_mblen, pg_mule_dsplen, pg_mule_verifier, 4}, /* PG_MULE_INTERNAL */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN1 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN2 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN3 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN4 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN5 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN6 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN7 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN8 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN9 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN10 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1256 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1258 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN866 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN874 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_KOI8R */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1251 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1252 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-5 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-6 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-7 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-8 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1250 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1253 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1254 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1255 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1257 */ {pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_KOI8U */ {0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2}, /* PG_SJIS */ {0, pg_big5_mblen, pg_big5_dsplen, pg_big5_verifier, 2}, /* PG_BIG5 */ {0, pg_gbk_mblen, pg_gbk_dsplen, pg_gbk_verifier, 2}, /* PG_GBK */ {0, pg_uhc_mblen, pg_uhc_dsplen, pg_uhc_verifier, 2}, /* PG_UHC */ {0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifier, 4}, /* PG_GB18030 */ {0, pg_johab_mblen, pg_johab_dsplen, pg_johab_verifier, 3}, /* PG_JOHAB */ {0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2} /* PG_SHIFT_JIS_2004 */ }; What does last 7 zeros in the first column means? No conversion to pg_wchar is possible from these encodings? -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Fri, Jan 20, 2012 at 1:07 AM, Alexander Korotkov aekorot...@gmail.comwrote: What does last 7 zeros in the first column means? No conversion to pg_wchar is possible from these encodings? Uh, I see. These encodings is not supported as server encodings. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
Hi! Thank you for your feedback! On Fri, Jan 20, 2012 at 3:33 AM, Erik Rijkers e...@xs4all.nl wrote: The patch yields spectacular speedups with small, simple-enough regexen. But it does not do a good enough job when guessing where to use the index and where fall back to Seq Scan. This can lead to (also spectacular) slow-downs, compared to Seq Scan. Could you give some examples of regexes where index scan becomes slower than seq scan? I guessed that MAX_COLOR_CHARS limits the character class size (to 4, in your patch), is that true? I can understand you want that value to be low to limit the above risk, but now it reduces the usability of the feature a bit: one has to split up larger char-classes into several smaller ones to make a statement use the index: i.e.: Yes, MAX_COLOR_CHARS is number of maximum character in automata color when that color is divided to a separated characters. And it's likely there could be better solution than just have this hard limit. Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I suppose this is normally necessary for perfomance reasons, but it would be useful te be able to compile a test version that allows it. I don't know how hard that would be. I seems that Ctrl-C was impossible because procedure of trigrams exctraction becomes so long while it is not breakable. It's not difficult to make this procedure breakable, but actually it just shouldn't take so long. There is also a minor bug, I think, when running with 'set enable_seqscan=off' in combination with a too-large regex: Thanks for pointing. Will be fixed. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Fri, Jan 20, 2012 at 8:45 PM, Marti Raudsepp ma...@juffo.org wrote: On Fri, Jan 20, 2012 at 01:33, Erik Rijkers e...@xs4all.nl wrote: Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I suppose this is normally necessary for perfomance reasons, but it would be useful te be able to compile a test version that allows it. I believe being interruptible is a requirement for the patch to be accepted. CHECK_FOR_INTERRUPTS(); should be added to the indeterminate loops. Sure. It's easy to fix. But it seems that in this case gin extract_query method becomes slow (because index scan itself is breakable). So, it just shouldn't work so long. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Fri, Jan 20, 2012 at 12:54 AM, Alexander Korotkov aekorot...@gmail.comwrote: On Fri, Jan 20, 2012 at 12:30 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Apart from that, the multibyte issue seems like the big one. Any way around that? Conversion of pg_wchar to multibyte character is the only way I found to avoid serious hacking of existing regexp code. Do you think additional function in pg_wchar_tbl which converts pg_wchar back to multibyte character is possible solution? Do you have any notes on it? I could make the patch which adds such function into core. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Collect frequency statistics for arrays
Hi! Updated patch is attached. I've updated comment of mcelem_array_contained_selec with more detailed description of probability distribution assumption. Also, I found that rest behavious should be better described by Poisson distribution, relevant changes were made. On Tue, Jan 17, 2012 at 2:33 PM, Noah Misch n...@leadboat.com wrote: By summary frequency of elements, do you mean literally P_0 + P_1 ... + P_N? If so, I can follow the above argument for column const and column @ const, but not for column @ const. For column @ const, selectivity cannot exceed the smallest frequency among const elements. A number of high-frequency elements will drive up the sum of the frequencies without changing the true selectivity much at all. Referencing to summary frequency is not really correct. It would be more correct to reference to number of element in const. When there are many elements in const, column @ const selectivity tends to be close to 0 and column @ const tends to be close to 1. Surely, it's true when elements have some kind of middle values of frequencies (not very close to 0 and not very close to 1). I've replaced summary frequency of elements by number of elements. -- With best regards, Alexander Korotkov. arrayanalyze-0.12.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
On Mon, Jan 23, 2012 at 7:58 PM, Noah Misch n...@leadboat.com wrote: + /* Take care about events with low probabilities. */ + if (rest DEFAULT_CONTAIN_SEL) + { Why the change from rest 0 to this in the latest version? Ealier addition of rest distribution require O(m) time. Now there is a more accurate and proved estimate, but it takes O(m^2) time.It doesn't make general assymptotical time worse, but it significant. That's why I decided to skip for low values of rest which don't change distribution significantly. + /* emit some statistics for debug purposes */ + elog(DEBUG3, array: target # mces = %d, bucket width = %d, + # elements = %llu, hashtable size = %d, usable entries = %d, + num_mcelem, bucket_width, element_no, i, track_len); That should be UINT64_FMT. (I introduced that error in v0.10.) I've attached a new version that includes the UINT64_FMT fix, some edits of your newest comments, and a rerun of pgindent on the new files. I see no other issues precluding commit, so I am marking the patch Ready for Committer. Great! If I made any of the comments worse, please post another update. Changes looks reasonable for me. Thanks! -- With best regards, Alexander Korotkov.
Re: GiST for range types (was Re: [HACKERS] Range Types - typo + NULL string constructor)
Hi! New version of patch is attached. On Thu, Dec 22, 2011 at 11:46 AM, Jeff Davis pg...@j-davis.com wrote: A few comments: * In range_gist_picksplit, it would be nice to have a little bit more intuitive description of what's going on with the nonEmptyCount and nonInfCount numbers. For instance, it appears to depend on the fact that a range must either be in nonEmptyCount or in nonInfCount. Also, can you describe the reason you're multiplying by two and taking the absolute value? It seems to work, but I am missing the intuition behind those operations. total_count - 2*nonEmptyCount = (total_count - nonEmptyCount) - nonEmptyCount = emptyCount - nonEmptyCount So, it's really not evident. I've simplified it. * The penalty function is fairly hard to read still. At a high level, I think we're trying to accomplish a few things (in order from most to least important): - Keep normal ranges separate. - Avoid broadening the class of the original predicate (e.g. turning single-infinite into double-infinite). - Avoid broadening (as determined by subtype_diff) the original predicate. - Favor adding ranges to narrower original predicates. Do you agree? If so, perhaps we can attack those more directly and it might be a little more readable. Additionally, the arbitrary numbers might become a problem. Can we choose better constants there? They would still be arbitrary when compared with real numbers derived from subtype_diff, but maybe we can still do better than what's there. I've changes some comments and add constants for penalty values. * Regarding the leftover common entries that can go to either side: what is the delta measure trying to accomplish? When I work through some examples, it seems to favor putting larger common ranges on the left (small delta) and smaller common ranges on the right (smaller delta). Why is that good? Or did I misread the code? Intuitively, I would think that we'd want to push the ranges with lower upper bounds to the left and higher lower bounds to the right -- in other words, recurse. Obviously, we'd need to make sure it terminated at some point, but splitting the common entries does seem like a smaller version of the original problem. Thoughts? That was a bug. Actually, no abs is needed. Indeed it doesn't affect result significantly. - With best regards, Alexander Korotkov. rangetypegist-0.6.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GiST for range types (was Re: [HACKERS] Range Types - typo + NULL string constructor)
On Mon, Jan 30, 2012 at 1:39 AM, Jeff Davis pg...@j-davis.com wrote: Thank you for the updates. I have a small patch attached. The only code change I made was very minor: I changed the constants used in the penalty function because your version used INFINITE_BOUND_PENALTY when adding an empty range, and that didn't quite make sense to me. If I'm mistaken you can leave it as-is. I also attached range-gist-test.sql, which I used for a performance test. I mix various types of ranges together in a larger table of 1.1M tuples. And then I create a smaller table that only contains normal ranges and empty ranges. There are two tests: 1. Create an index on the big table 2. Do a range join (using overlaps rather than equals) where the smaller table is on the outer side of a nested loop join and an index scan over the larger table on the inner. The index creation time reduces by a small amount with the patch, from around 16s without the patch to around 13s with the patch. The query time, however, dropped from around 26s to around 14s! Almost 2x speedup with the patch! Moreover, looking at the loop timing in the explain analyze output, it goes from about 7..24 ms per loop down to about 1.5..13 ms per loop. That seems to indicate that the index distribution is better, with more queries returning quickly. So, great work Alexander! Very convincing results. Great! Thank you for reviewing this patch! Marking ready for committer, but please apply my comment fixes at your discretion. Patch with your comment fixes is attached. - With best regards, Alexander Korotkov. rangetypegist-0.7.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] spgist text_ops and LIKE
On Thu, Feb 2, 2012 at 10:21 PM, Robert Haas robertmh...@gmail.com wrote: I'm having trouble figuring out under what set of circumstances spgist is expected to be the best available alternative. It only supports a small subset of the data types that GiST does, so I suppose the point is that it should be faster for the cases that it does handle. And, considering that this is all brand-new code, the fact that it's almost keeping pace with btree on both pattern-matching and equality comparisons is certainly respectable -- but I so far haven't found any cases where it's a clear win. There's limited glory in being the almost-fastest way of indexing for a certain class of queries. I think that current implementation of suffix tree (in particular implementation of consistent methods) in spgist is far from reveal full potential of suffix tree. I see at least two additional search types which can be efficiently evaluated using suffix tree: 1) Search by prefix regular expression. For example, search for /^a(bc|d)e[fgh]/ in single index scan. 2) Search by levenshtein distance, i.e. SELECT * FROM tbl WHERE levenshtein(col, 'some constant') k; using index (surely actual syntax for such index scan would be anouther). This is only two additional applications which first comes to my mind, there could be other. Unfortunately, I don't have enough of time for it now. But I like to put my hands on this in future. Admittedly, I haven't tried the point-in-box stuff yet. In short, I believe spgist for geometrical data-types is more CPU-effective and less IO-effective than gist. Gist have to call consistent method for all index tuples of the page it uses in scan, while spgist scans smaller nodes. Sure, comprehensive testing is required for doing final conclusion about that. - With best regards, Alexander Korotkov.
Re: [HACKERS] Fast GiST index build - further improvements
On Thu, Sep 8, 2011 at 8:35 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Now that the main GiST index build patch has been committed, there's a few further improvements that could make it much faster still: Better management of the buffer pages on disk. At the moment, the temporary file is used as a heap of pages belonging to all the buffers in random order. I think we could speed up the algorithm considerably by reading/writing the buffer pages sequentially. For example, when an internal page is split, and all the tuples in its buffer are relocated, that would be a great chance to write the new pages of the new buffers in sequential order, instead of writing them back to the pages freed up by the original buffer, which can be spread all around the temp file. I wonder if we could use a separate file for each buffer? Or at least, a separate file for all buffers that are larger than, say 100 MB in size. Better management of in-memory buffer pages. When we start emptying a buffer, we currently flush all the buffer pages in memory to the temporary file, to make room for new buffer pages. But that's a waste of time, if some of the pages we had in memory belonged to the buffer we're about to empty next, or that we empty tuples to. Also, if while emptying a buffer, all the tuples go to just one or two lower level buffers, it would be beneficial to keep more than one page in-memory for those buffers. Buffering leaf pages. This I posted on a separate thread already: http://archives.postgresql.**org/message-id/4E5350DB.** 3060...@enterprisedb.comhttp://archives.postgresql.org/message-id/4e5350db.3060...@enterprisedb.com Also, at the moment there is one issue with the algorithm that we have glossed over this far: For each buffer, we keep some information in memory, in the hash table, and in the auxiliary lists. That means that the amount of memory needed for the build scales with the size of the index. If you're dealing with very large indexes, hopefully you also have a lot of RAM in your system, so I don't think this is a problem in practice. Still, it would be nice to do something about that. A straightforward idea would be to swap some of the information to disk. Another idea that, simpler to implement, would be to completely destroy a buffer, freeing all the memory it uses, when it becomes completely empty. Then, if you're about to run out of memory (as defined by maintenance_work_mem), you can empty some low level buffers to disk to free up some. Unfortunately, I hadn't enough of time to implement something of this before 9.2 release. Work on my Phd. thesis and web-projects takes too much time. But, I think there is one thing we should fix before 9.2 release. We assume that gist index build have at least effective_cache_size/4 of cache. This assumption could easily be false on high concurrency systems. I don't see the way for convincing estimate here, but we could document this behaviour. So, users could just tune effective_cache_size for gist index build on high concurrency. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
On Tue, Feb 7, 2012 at 11:26 PM, Jay Levitt jay.lev...@gmail.com wrote: [*] psql:slowcube.sql:20: ERROR: node buffer of page being split (121550) does not exist This looks like a bug in buffering GiST index build I've implemented during my GSoC 2011 project. It looks especially strange with following setting: effective_cache_size = 3734MB because buffering GiST index build just shouldn't turn on in this case when index fits to cache. I'm goint to take a detailed look on this. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
ITSM, I found the problem. This piece of code is triggering an error. It assumes each page of corresponding to have initialized buffer. That should be true because we're inserting index tuples from up to down while splits propagate from down to up. if (!found) { /* * Node buffer should exist at this point. If it didn't exist before, * the insertion that caused the page to split should've created it. */ elog(ERROR, node buffer of page being split (%u) does not exist, blocknum); } But this assumptions becomes false we turn buffer off in the root page. So, root page can produce pages without initialized buffers when splits. /* * Does specified level have buffers? (Beware of multiple evaluation of * arguments.) */ #define LEVEL_HAS_BUFFERS(nlevel, gfbb) \ ((nlevel) != 0 (nlevel) % (gfbb)-levelStep == 0 \ (nlevel) != (gfbb)-rootitem-level) So, I think we should just do silent return from the function instead of triggering error. Patch is attached. -- With best regards, Alexander Korotkov. *** a/src/backend/access/gist/gistbuildbuffers.c --- b/src/backend/access/gist/gistbuildbuffers.c *** *** 607,617 gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate, if (!found) { /* ! * Node buffer should exist at this point. If it didn't exist before, ! * the insertion that caused the page to split should've created it. */ ! elog(ERROR, node buffer of page being split (%u) does not exist, ! blocknum); } /* --- 607,616 if (!found) { /* ! * Page without buffer could be produced by split of root page. So ! * we've just nothing to do here when there is no buffer. */ ! return; } /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
On Wed, Feb 15, 2012 at 2:54 AM, Tom Lane t...@sss.pgh.pa.us wrote: Alexander Korotkov aekorot...@gmail.com writes: ITSM, I found the problem. This piece of code is triggering an error. It assumes each page of corresponding to have initialized buffer. That should be true because we're inserting index tuples from up to down while splits propagate from down to up. But this assumptions becomes false we turn buffer off in the root page. So, root page can produce pages without initialized buffers when splits. Hmm ... can we tighten the error check rather than just remove it? It feels less than safe to assume that a hash-entry-not-found condition *must* reflect a corner-case situation like that. At the very least I'd like to see it verify that we'd turned off buffering before deciding this is OK. Better, would it be practical to make dummy entries in the hash table even after turning buffers off, so that the logic here becomes if (!found) error; else if (entry is dummy) return without doing anything; else proceed; regards, tom lane Ok, there is another patch fixes this problem. Instead of error triggering remove it adds empty buffers on root page split if needed. -- With best regards, Alexander Korotkov. *** a/src/backend/access/gist/gistbuild.c --- b/src/backend/access/gist/gistbuild.c *** *** 668,677 gistbufferinginserttuples(GISTBuildState *buildstate, Buffer buffer, if (is_split BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO) { GISTBufferingInsertStack *oldroot = gfbb-rootitem; ! Page page = BufferGetPage(buffer); ! ItemId iid; ! IndexTuple idxtuple; ! BlockNumber leftmostchild; gfbb-rootitem = (GISTBufferingInsertStack *) MemoryContextAlloc( gfbb-context, sizeof(GISTBufferingInsertStack)); --- 668,678 if (is_split BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO) { GISTBufferingInsertStack *oldroot = gfbb-rootitem; ! Page page = BufferGetPage(buffer); ! ItemId iid; ! IndexTuple idxtuple; ! BlockNumber leftmostchild; ! OffsetNumber maxoff, i; gfbb-rootitem = (GISTBufferingInsertStack *) MemoryContextAlloc( gfbb-context, sizeof(GISTBufferingInsertStack)); *** *** 694,699 gistbufferinginserttuples(GISTBuildState *buildstate, Buffer buffer, --- 695,719 oldroot-parent = gfbb-rootitem; oldroot-blkno = leftmostchild; oldroot-downlinkoffnum = InvalidOffsetNumber; + + /* + * If root page split produce new pages on leven which have buffers + * then initialize empty buffers there. + */ + if (LEVEL_HAS_BUFFERS(oldroot-level, gfbb)) + { + maxoff = PageGetMaxOffsetNumber(page); + for (i = FirstOffsetNumber; i = maxoff; i = OffsetNumberNext(i)) + { + iid = PageGetItemId(page, i); + idxtuple = (IndexTuple) PageGetItem(page, iid); + gistGetNodeBuffer(gfbb, + buildstate-giststate, + ItemPointerGetBlockNumber((idxtuple-t_tid)), + i, + gfbb-rootitem); + } + } } if (splitinfo) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
On Wed, Feb 15, 2012 at 4:26 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Actually, I think it made sense to simply do nothing if the buffer doesn't exist. The algorithm doesn't require that all the buffers must exist at all times. It is quite accidental that whenever we call gistRelocateBuildBuffersOnSpli**t(), the page must already have its buffer created (and as we found out, the assumption doesn't hold after a root split, anyway). Also, we talked earlier that it would be good to destroy buffers that become completely empty, to save memory. If we do that, we'd have to remove that check anyway. So, I think we should go with your original fix and simply do nothing in gistRelocateBuildBuffersOnSpli**t() if the page doesn't have a buffer. Moreover, if the page has a buffer but it's empty, gistRelocateBuildBuffersOnSpli**t() doesn't need to create buffers for the new sibling pages. In the final emptying phase, that's a waste of time, the buffers we create will never be used, and even before that I think it's better to create the buffers lazily. I agree. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Designing an extension for feature-space similarity search
On Thu, Feb 16, 2012 at 12:34 AM, Jay Levitt jay.lev...@gmail.com wrote: - But a dimension might be in any domain, not just floats - The distance along each dimension is a domain-specific function What exact domains do you expect? Some domains could appear to be quite hard for index-based similarity search using GiST (for example, sets, strings etc.). -- With best regards, Alexander Korotkov.
Re: [HACKERS] Designing an extension for feature-space similarity search
On Fri, Feb 17, 2012 at 11:00 PM, Jay Levitt jay.lev...@gmail.com wrote: Tom Lane wrote: Jay Levittjay.lev...@gmail.com writes: - Does KNN-GiST run into problems when- returns values that don't make sense in the physical world? If the indexed entities are records, it would be entirely your own business how you handled individual fields being NULL. This turns out to be a bit challenging. Let's say I'm building a nullable_point type that allows the Y axis to be NULL (or any sentinel value for missing data), where the semantics are NULL is infinitely far from the query. I'll need my GiST functions to return useful results with NULL - not just correct results, but results that help partition the tree nicely. At first I thought this posed a challenge for union; if I have these points: (1,2) (2,1) (1,NULL) what's the union? I think the answer is to treat NULL box coordinates like LL = -infinity, UR = infinity, or (equivalently, I think) to store a saw_nulls bit in addition to LL and UR. The real challenge is probably in picksplit and penalty - where in the tree should I stick (1,NULL)? - at which point you say Yes, algorithms for efficient indexes are hard work and computer-science-y and point me at surrogate splitters. Just thinking out loud, I guess; if other GiST types have addressed this problem, I'd love to hear about it. Similar problem appears at GiST indexing of ranges, because range can be empty. There additional contain empty flag was introduced. This contain empty flag indicates that underlying value can be empty. So, this flag is set when union with empty range or other range with this flag set. It's likely you need similar flag for each dimension. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Designing an extension for feature-space similarity search
On Fri, Feb 17, 2012 at 11:32 PM, Jay Levitt jay.lev...@gmail.com wrote: Alexander Korotkov wrote: On Fri, Feb 17, 2012 at 11:00 PM, Jay Levitt jay.lev...@gmail.com mailto:jay.lev...@gmail.com wrote: At first I thought this posed a challenge for union; if I have these points: (1,2) (2,1) (1,NULL) what's the union? I think the answer is to treat NULL box coordinates like LL = -infinity, UR = infinity, or (equivalently, I think) to store a saw_nulls bit in addition to LL and UR. Similar problem appears at GiST indexing of ranges, because range can be empty. There additional contain empty flag was introduced. This contain empty flag indicates that underlying value can be empty. So, this flag is set when union with empty range or other range with this flag set. It's likely you need similar flag for each dimension. Ah, yes, exactly the same problem. So what led you to add a flag instead of using the range NULL..NULL? I'm on the fence about choosing. At first, range bounds can't be NULL :) At second, if we have range (a;b)+contain empty in internal page, both facts: 1) All normal underlying ranges are contained in (a;b). 2) There can be empty underlying ranges. are useful for search. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Incorrect behaviour when using a GiST index on points
On Thu, Feb 16, 2012 at 11:43 AM, Alexander Korotkov aekorot...@gmail.comwrote: Described differences leads to incorrect behaviour of GiST index. The question is: what is correct way to fix it? Should on_pb also use FP* or consistent method should behave like on_pb? Any comments on this? Current behaviour definitely indicates a bug, and I'm ready to fix it. The only question: is this bug in on_pb or gist? -- With best regards, Alexander Korotkov.
Re: [HACKERS] Google Summer of Code? Call for mentors.
On Thu, Feb 16, 2012 at 12:56 PM, Dave Page dp...@pgadmin.org wrote: On Wed, Feb 15, 2012 at 10:18 PM, Josh Berkus j...@agliodbs.com wrote: Hackers, The call is now open for Google Summer of Code. If you are interested in being a GSoC mentor this summer, please reply to this email. I want to gauge whether or not we should participate this summer. I'll play, but I think I'm going to have to be very sure about any project before I agree to mentor it this year (not that there was anything wrong with my student last year - quite the opposite in fact - I just need to be sure I'm spending my time on the right things). FYI, I found myself to be eligible for this year. So, if PostgreSQL will participate this year, I'll do some proposals on indexing. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
On Wed, Feb 15, 2012 at 7:28 PM, Tom Lane t...@sss.pgh.pa.us wrote: Alexander Korotkov aekorot...@gmail.com writes: On Wed, Feb 15, 2012 at 4:26 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: So, I think we should go with your original fix and simply do nothing in gistRelocateBuildBuffersOnSpli**t() if the page doesn't have a buffer. I agree. OK, I won't object. So, I think we can just commit first version of fix now. -- With best regards, Alexander Korotkov. *** a/src/backend/access/gist/gistbuildbuffers.c --- b/src/backend/access/gist/gistbuildbuffers.c *** *** 607,617 gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate, if (!found) { /* ! * Node buffer should exist at this point. If it didn't exist before, ! * the insertion that caused the page to split should've created it. */ ! elog(ERROR, node buffer of page being split (%u) does not exist, ! blocknum); } /* --- 607,616 if (!found) { /* ! * Page without buffer could be produced by split of root page. So ! * we've just nothing to do here when there is no buffer. */ ! return; } /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Incorrect behaviour when using a GiST index on points
On Mon, Feb 20, 2012 at 7:22 PM, Tom Lane t...@sss.pgh.pa.us wrote: Alexander Korotkov aekorot...@gmail.com writes: On Thu, Feb 16, 2012 at 11:43 AM, Alexander Korotkov aekorot...@gmail.comwrote: Described differences leads to incorrect behaviour of GiST index. The question is: what is correct way to fix it? Should on_pb also use FP* or consistent method should behave like on_pb? Any comments on this? Current behaviour definitely indicates a bug, and I'm ready to fix it. The only question: is this bug in on_pb or gist? I'm inclined to think the right answer is to make on_pb use the FP* macros, for consistency with other geometric operators. But it's worth asking whether that will actually fix the problem. I've thought for some time that we'd eventually find cases where geo_ops' use of fuzzy comparisons breaks index behavior entirely, because it destroys natural assumptions like the transitive law. So that could eventually lead us to rip out the FP* macros everywhere. In any case, this doesn't seem like something we could back-patch; it'd be a behavioral change in HEAD only. Analyzing GiST interface methods more detail I found one other place where fuzzy comparison was used. It is gist_box_same function. And it's really scary thing. It means that entry of internal page is not extended when inserting index tuple extends it in less than EPSILON. Correspondingly current GiST search behaviour is neither exact search or fuzzy search with given EPSILON. It is something in the middle. See following example for proof. test=# create table test(a box); CREATE TABLE test=# insert into test (select (case when i%2= 0 then box(point(1,1),point(1,1)) else box(point(2,2),point(2,2)) end) from generate_series(1,300) i); INSERT 0 300 test=# create index test_idx on test using gist(a); CREATE INDEX test=# test=# insert into test values (box(point(1.009,1.009), point(1.009,1.009))); INSERT 0 1 test=# select * from test where a box(point(1.018,1.018), point(1.018,1.018)); a - (1.009,1.009),(1.009,1.009) (1 row) test=# set enable_seqscan = off; SET test=# select * from test where a box(point(1.018,1.018), point(1.018,1.018)); a --- (0 rows) But, I believe we still can bachpatch it in one of following ways: 1) Fix consistent and same functions. Add to release note notice that users should rebuild indexes if they want correct behaviour. 2) Leave same function as is. Add kluges to consistent functions which provides correct search on current not truly correct trees. But anyway +1 for rip out the FP* macros everywhere in HEAD. Because I really dislike the way FP* are currently implemented. Why EPSILON is 1.0E-06? We don't know anything about nature of data that users store in geometrical datatypes. The lesser double precision value is 5E-324. For some datasets EPSILON can easily cover whole range. -- With bestregards, Alexander Korotkov.
Re: [HACKERS] Incorrect behaviour when using a GiST index on points
Attached patch fixes GiST behaviour without altering operators behaviour. -- With best regards, Alexander Korotkov. *** a/src/backend/access/gist/gistproc.c --- b/src/backend/access/gist/gistproc.c *** *** 836,842 gist_box_picksplit(PG_FUNCTION_ARGS) } /* ! * Equality method * * This is used for both boxes and points. */ --- 836,843 } /* ! * Equality method. Returns true only when boxes are exact same. We can't ! * ignore small extents because of index consistency. * * This is used for both boxes and points. */ *** *** 848,856 gist_box_same(PG_FUNCTION_ARGS) bool *result = (bool *) PG_GETARG_POINTER(2); if (b1 b2) ! *result = DatumGetBool(DirectFunctionCall2(box_same, ! PointerGetDatum(b1), ! PointerGetDatum(b2))); else *result = (b1 == NULL b2 == NULL) ? TRUE : FALSE; PG_RETURN_POINTER(result); --- 849,857 bool *result = (bool *) PG_GETARG_POINTER(2); if (b1 b2) ! *result = (b1-low.x == b2-low.x b1-low.y == b2-low.y ! b1-high.x == b2-high.x b1-high.y == b2-high.y) ! ? TRUE : FALSE; else *result = (b1 == NULL b2 == NULL) ? TRUE : FALSE; PG_RETURN_POINTER(result); *** *** 1326,1331 gist_point_consistent(PG_FUNCTION_ARGS) --- 1327,1333 bool *recheck = (bool *) PG_GETARG_POINTER(4); bool result; StrategyNumber strategyGroup = strategy / GeoStrategyNumberOffset; + BOX *query, *key; switch (strategyGroup) { *** *** 1337,1348 gist_point_consistent(PG_FUNCTION_ARGS) *recheck = false; break; case BoxStrategyNumberGroup: ! result = DatumGetBool(DirectFunctionCall5( ! gist_box_consistent, ! PointerGetDatum(entry), ! PG_GETARG_DATUM(1), ! Int16GetDatum(RTOverlapStrategyNumber), ! 0, PointerGetDatum(recheck))); break; case PolygonStrategyNumberGroup: { --- 1339,1356 *recheck = false; break; case BoxStrategyNumberGroup: ! /* ! * This code repeats logic of on_ob which uses simple comparison ! * rather than FP* functions. ! */ ! query = PG_GETARG_BOX_P(1); ! key = DatumGetBoxP(entry-key); ! ! *recheck = false; ! result = key-high.x = query-low.x ! key-low.x = query-high.x ! key-high.y = query-low.y ! key-low.y = query-high.y; break; case PolygonStrategyNumberGroup: { -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
On Thu, Mar 1, 2012 at 12:39 AM, Tom Lane t...@sss.pgh.pa.us wrote: I am starting to look at this patch now. I'm wondering exactly why the decision was made to continue storing btree-style statistics for arrays, in addition to the new stuff. The pg_statistic rows for array columns tend to be unreasonably wide already, and as-is this patch will make them quite a lot wider. I think it requires more than a little bit of evidence to continue storing stats that seem to have only small probability of usefulness. In particular, if we didn't store that stuff, we'd not need to widen the number of columns in pg_statistic, which would noticeably reduce both the footprint of the patch and the probability of breaking external code. Initially, I used existing slots for new statistics, but I've changed this after the first review: http://archives.postgresql.org/pgsql-hackers/2011-07/msg00780.php Probably, btree statistics really does matter for some sort of arrays? For example, arrays representing paths in the tree. We could request a subtree in a range query on such arrays. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Collect frequency statistics for arrays
On Thu, Mar 1, 2012 at 1:09 AM, Tom Lane t...@sss.pgh.pa.us wrote: Alexander Korotkov aekorot...@gmail.com writes: On Thu, Mar 1, 2012 at 12:39 AM, Tom Lane t...@sss.pgh.pa.us wrote: I am starting to look at this patch now. I'm wondering exactly why the decision was made to continue storing btree-style statistics for arrays, Probably, btree statistics really does matter for some sort of arrays? For example, arrays representing paths in the tree. We could request a subtree in a range query on such arrays. That seems like a pretty narrow, uncommon use-case. Also, to get accurate stats for such queries that way, you'd need really enormous histograms. I doubt that the existing parameters for histogram size will permit meaningful estimation of more than the first array entry (since we don't make the histogram any larger than we do for a scalar column). The real point here is that the fact that we're storing btree-style stats for arrays is an accident, backed into by having added btree comparators for arrays plus analyze.c's habit of applying default scalar-oriented analysis functions to any type without an explicit typanalyze entry. I don't recall that we ever thought hard about it or showed that those stats were worth anything. OK. I don't object to removing btree stats from arrays. What do you thinks about pg_stats view in this case? Should it combine values histogram and array length histogram in single column like do for MCV and MCELEM? -- With best regards, Alexander Korotkov.
Re: [HACKERS] Collect frequency statistics for arrays
On Thu, Mar 1, 2012 at 1:19 AM, Alexander Korotkov aekorot...@gmail.comwrote: On Thu, Mar 1, 2012 at 1:09 AM, Tom Lane t...@sss.pgh.pa.us wrote: That seems like a pretty narrow, uncommon use-case. Also, to get accurate stats for such queries that way, you'd need really enormous histograms. I doubt that the existing parameters for histogram size will permit meaningful estimation of more than the first array entry (since we don't make the histogram any larger than we do for a scalar column). The real point here is that the fact that we're storing btree-style stats for arrays is an accident, backed into by having added btree comparators for arrays plus analyze.c's habit of applying default scalar-oriented analysis functions to any type without an explicit typanalyze entry. I don't recall that we ever thought hard about it or showed that those stats were worth anything. OK. I don't object to removing btree stats from arrays. What do you thinks about pg_stats view in this case? Should it combine values histogram and array length histogram in single column like do for MCV and MCELEM? Btree statistics for arrays and additional statistics slot are removed from attached version of patch. pg_stats view is untouched for while. -- With best regards, Alexander Korotkov. arrayanalyze-0.13.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
On Sun, Mar 4, 2012 at 5:38 AM, Tom Lane t...@sss.pgh.pa.us wrote: 1. I'm still unhappy about the loop that fills the count histogram, as I noted earlier today. It at least needs a decent comment and some overflow protection, and I'm not entirely convinced that it doesn't have more bugs than the overflow issue. Attached patch is focused on fixing this. The frac variable overflow is evaded by making it int64. I hope comments is clarifying something. In general this loop copies behaviour of histogram constructing loop of compute_scalar_stats function. But instead of values array we've array of unique DEC and it's frequency. -- With best regards, Alexander Korotkov. *** a/src/backend/utils/adt/array_typanalyze.c --- b/src/backend/utils/adt/array_typanalyze.c *** *** 581,587 compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc, DECountItem **sorted_count_items; int count_item_index; int delta; ! int frac; float4 *hist; /* num_hist must be at least 2 for the loop below to work */ --- 581,587 DECountItem **sorted_count_items; int count_item_index; int delta; ! int64 frac; float4 *hist; /* num_hist must be at least 2 for the loop below to work */ *** *** 612,633 compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc, hist[num_hist] = (double) element_no / (double) nonnull_cnt; /* ! * Construct the histogram. ! * ! * XXX this needs work: frac could overflow, and it's not clear ! * how or why the code works. Even if it does work, it needs ! * documented. */ delta = analyzed_rows - 1; count_item_index = 0; ! frac = sorted_count_items[0]-frequency * (num_hist - 1); for (i = 0; i num_hist; i++) { while (frac = 0) { count_item_index++; Assert(count_item_index count_items_count); ! frac += sorted_count_items[count_item_index]-frequency * (num_hist - 1); } hist[i] = sorted_count_items[count_item_index]-count; frac -= delta; --- 612,642 hist[num_hist] = (double) element_no / (double) nonnull_cnt; /* ! * Construct the histogram of DECs. The object of this loop is to ! * copy the max and min DECs and evenly-spaced DECs in between ! * (space here is number of arrays corresponding to DEC). If we ! * imagine ordered array of DECs where each input array have a ! * corresponding DEC item, i'th value of histogram will be ! * DECs[i * (analyzed_rows - 1) / (num_hist - 1)]. But instead ! * of such array we've sorted_count_items which holds unique DEC ! * values with their frequencies. We can imagine frac variable as ! * an (index in DECs corresponding to next sorted_count_items ! * element - index in DECs corresponding to last histogram value) * ! * (num_hist - 1). In this case negative fraction leads us to ! * iterate over sorted_count_items. */ delta = analyzed_rows - 1; count_item_index = 0; ! frac = (int64)sorted_count_items[0]-frequency * ! (int64)(num_hist - 1); for (i = 0; i num_hist; i++) { while (frac = 0) { count_item_index++; Assert(count_item_index count_items_count); ! frac += (int64)sorted_count_items[count_item_index]-frequency * ! (int64)(num_hist - 1); } hist[i] = sorted_count_items[count_item_index]-count; frac -= delta; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
On Sun, Mar 4, 2012 at 5:38 AM, Tom Lane t...@sss.pgh.pa.us wrote: 2. The tests in the above-mentioned message show that in most cases where mcelem_array_contained_selec falls through to the rough estimate, the resulting rowcount estimate is just 1, ie we are coming out with very small selectivities. Although that path will now only be taken when there are no stats, it seems like we'd be better off to return DEFAULT_CONTAIN_SEL instead of what it's doing. I think there must be something wrong with the rough estimate logic. Could you recheck that? I think the wrong think with rough estimate is that assumption about independent occurrences of items is very unsuitable even for rough estimate. The following example shows that rough estimate really works in the case of independent occurrences of items. Generate test table where item occurrences are really independent. test=# create table test as select ('{'||(select string_agg(s,',') from (select case when (t*0 + random()) 0.1 then i::text else null end from generate_series(1,100) i) as x(s))||'}')::int[] AS val from generate_series(1,1) t; SELECT 1 test=# analyze test; ANALYZE Do some test. test=# explain analyze select * from test where val @ array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60]; QUERY PLAN -- Seq Scan on test (cost=0.00..239.00 rows=151 width=61) (actual time=0.325..32.556 rows=163 loops=1 ) Filter: (val @ '{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60}'::integer[]) Rows Removed by Filter: 9837 Total runtime: 32.806 ms (4 rows) Delete DECHIST statistics. test=# update pg_statistic set stakind1 = 0, staop1 = 0, stanumbers1 = null, stavalues1 = null where starelid = (select oid from pg_class where relname = 'test') and stakind1 = 5; UPDATE 0 test=# update pg_statistic set stakind2 = 0, staop2 = 0, stanumbers2 = null, stavalues2 = null where starelid = (select oid from pg_class where relname = 'test') and stakind2 = 5; UPDATE 0 test=# update pg_statistic set stakind3 = 0, staop3 = 0, stanumbers3 = null, stavalues3 = null where starelid = (select oid from pg_class where relname = 'test') and stakind3 = 5; UPDATE 0 test=# update pg_statistic set stakind4 = 0, staop4 = 0, stanumbers4 = null, stavalues4 = null where starelid = (select oid from pg_class where relname = 'test') and stakind4 = 5; UPDATE 1 test=# update pg_statistic set stakind5 = 0, staop5 = 0, stanumbers5 = null, stavalues5 = null where starelid = (select oid from pg_class where relname = 'test') and stakind5 = 5; UPDATE 0 Do another test. test=# explain analyze select * from test where val @ array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60]; QUERY PLAN -- Seq Scan on test (cost=0.00..239.00 rows=148 width=61) (actual time=0.332..32.952 rows=163 loops=1) Filter: (val @ '{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60}'::integer[]) Rows Removed by Filter: 9837 Total runtime: 33.225 ms (4 rows) It this particular case rough estimate is quite accurate. But in most part of cases it behaves really bad. It is why I started to invent calc_distr and etc. So, I think return DEFAULT_CONTAIN_SEL is OK unless we've some better ideas. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Collect frequency statistics for arrays
On Mon, Mar 5, 2012 at 1:11 AM, Tom Lane t...@sss.pgh.pa.us wrote: BTW, one other thing about the count histogram: seems like we are frequently generating uselessly large ones. For instance, do ANALYZE in the regression database and then run select tablename,attname,elem_count_histogram from pg_stats where elem_count_histogram is not null; You get lots of entries that look like this: pg_proc | proallargtypes | {1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,6,6,6,2.80556} pg_proc | proargmodes| {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1.6} pg_proc | proargnames| {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,7,7,7,7,8,8,8,14,14,15,16,3.8806} pg_proc | proconfig | {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1} pg_class| reloptions | {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1} which seems to me to be a rather useless expenditure of space. Couldn't we reduce the histogram size when there aren't many different counts? It seems fairly obvious to me that we could bound the histogram size with (max count - min count + 1), but maybe something even tighter would work; or maybe I'm missing something and this would sacrifice accuracy. True. If (max count - min count + 1) is small, enumerating of frequencies is both more compact and more precise representation. Simultaneously, if (max count - min count + 1) is large, we can run out of statistics_target with such representation. We can use same representation of count distribution as for scalar column value: MCV and HISTOGRAM, but it would require additional statkind and statistics slot. Probably, you've better ideas? -- With best regards, Alexander Korotkov.
Re: [HACKERS] Collect frequency statistics for arrays
On Thu, Mar 8, 2012 at 4:51 AM, Tom Lane t...@sss.pgh.pa.us wrote: Alexander Korotkov aekorot...@gmail.com writes: True. If (max count - min count + 1) is small, enumerating of frequencies is both more compact and more precise representation. Simultaneously, if (max count - min count + 1) is large, we can run out of statistics_target with such representation. We can use same representation of count distribution as for scalar column value: MCV and HISTOGRAM, but it would require additional statkind and statistics slot. Probably, you've better ideas? I wasn't thinking of introducing two different representations, but just trimming the histogram length when it's larger than necessary. On reflection my idea above is wrong; for example assume that we have a column with 900 arrays of length 1 and 100 arrays of length 2. Going by what I said, we'd reduce the histogram to {1,2}, which might accurately capture the set of lengths present but fails to show that 1 is much more common than 2. However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries) would capture the situation perfectly in one-tenth the space that the current logic does. More generally, by limiting the histogram to statistics_target entries, we are already accepting errors of up to 1/(2*statistics_target) in the accuracy of the bin-boundary representation. What the above example shows is that sometimes we could meet the same accuracy requirement with fewer entries. I'm not sure how this could be mechanized but it seems worth thinking about. I can propose following representation of histogram. If (max_count - min_count + 1) = statistics_target then 1) store max_count and min_count in stavalues 2) store frequencies from min_count ot max_count in numvalues If (max_count - min_count + 1) statistics_target then store histogram in current manner. I think in this case it's unlikely to be many repeating values. I can propose patch which change histogram representation to this. Comments? -- With best regards, Alexander Korotkov.
Re: [HACKERS] Incorrect behaviour when using a GiST index on points
I believe that attached version of patch can be backpatched. It fixes this problem without altering of index building procedure. It just makes checks in internal pages softener enough to compensate effect of gist_box_same implementation. -- With best regards, Alexander Korotkov. *** a/src/backend/access/gist/gistproc.c --- b/src/backend/access/gist/gistproc.c *** *** 70,76 gist_box_consistent(PG_FUNCTION_ARGS) if (DatumGetBoxP(entry-key) == NULL || query == NULL) PG_RETURN_BOOL(FALSE); ! /* * if entry is not leaf, use rtree_internal_consistent, else use * gist_box_leaf_consistent --- 70,76 if (DatumGetBoxP(entry-key) == NULL || query == NULL) PG_RETURN_BOOL(FALSE); ! /* * if entry is not leaf, use rtree_internal_consistent, else use * gist_box_leaf_consistent *** *** 80,88 gist_box_consistent(PG_FUNCTION_ARGS) query, strategy)); else ! PG_RETURN_BOOL(rtree_internal_consistent(DatumGetBoxP(entry-key), query, strategy)); } static void --- 80,102 query, strategy)); else ! { ! /* ! * Box in internal page can be narrower than box in leaf page not ! * more than EPSILON in each boundary. Do corresponding correction. ! */ ! BOX key, *entrykey; ! ! entrykey = DatumGetBoxP(entry-key); ! key.low.x = entrykey-low.x - EPSILON; ! key.low.y = entrykey-low.y - EPSILON; ! key.high.x = entrykey-high.x + EPSILON; ! key.high.y = entrykey-high.y + EPSILON; ! ! PG_RETURN_BOOL(rtree_internal_consistent(key, query, strategy)); + } } static void *** *** 847,852 gist_box_same(PG_FUNCTION_ARGS) --- 861,871 BOX *b2 = PG_GETARG_BOX_P(1); bool *result = (bool *) PG_GETARG_POINTER(2); + /* + * box_same function allow difference between boxes limited by EPSILON. + * Thus box in internal page can be narrower than box in leaf page not + * more than EPSILON in each boundary. + */ if (b1 b2) *result = DatumGetBool(DirectFunctionCall2(box_same, PointerGetDatum(b1), *** *** 1072,1077 gist_poly_consistent(PG_FUNCTION_ARGS) --- 1091,1097 GISTENTRY *entry = (GISTENTRY *) PG_GETARG_POINTER(0); POLYGON*query = PG_GETARG_POLYGON_P(1); StrategyNumber strategy = (StrategyNumber) PG_GETARG_UINT16(2); + BOX key, *entrykey; /* Oid subtype = PG_GETARG_OID(3); */ bool *recheck = (bool *) PG_GETARG_POINTER(4); *** *** 1083,1094 gist_poly_consistent(PG_FUNCTION_ARGS) if (DatumGetBoxP(entry-key) == NULL || query == NULL) PG_RETURN_BOOL(FALSE); /* * Since the operators require recheck anyway, we can just use * rtree_internal_consistent even at leaf nodes. (This works in part * because the index entries are bounding boxes not polygons.) */ ! result = rtree_internal_consistent(DatumGetBoxP(entry-key), (query-boundbox), strategy); /* Avoid memory leak if supplied poly is toasted */ --- 1103,1128 if (DatumGetBoxP(entry-key) == NULL || query == NULL) PG_RETURN_BOOL(FALSE); + entrykey = DatumGetBoxP(entry-key); + if (!GIST_LEAF(entry)) + { + /* + * Box in internal page can be narrower than box in leaf page not + * more than EPSILON in each boundary. Do corresponding correction. + */ + key.low.x = entrykey-low.x - EPSILON; + key.low.y = entrykey-low.y - EPSILON; + key.high.x = entrykey-high.x + EPSILON; + key.high.y = entrykey-high.y + EPSILON; + entrykey = key; + } + /* * Since the operators require recheck anyway, we can just use * rtree_internal_consistent even at leaf nodes. (This works in part * because the index entries are bounding boxes not polygons.) */ ! result = rtree_internal_consistent(entrykey, (query-boundbox), strategy); /* Avoid memory leak if supplied poly is toasted */ *** *** 1152,1158 gist_circle_consistent(PG_FUNCTION_ARGS) /* Oid subtype = PG_GETARG_OID(3); */ bool *recheck = (bool *) PG_GETARG_POINTER(4); ! BOX bbox; bool result; /* All cases served by this function are inexact */ --- 1186,1192 /* Oid subtype = PG_GETARG_OID(3); */ bool *recheck = (bool *) PG_GETARG_POINTER(4); ! BOX bbox, *entrykey, key; bool result; /* All cases served by this function are inexact */ *** *** 1170,1177 gist_circle_consistent(PG_FUNCTION_ARGS) bbox.low.x = query-center.x - query-radius; bbox.high.y = query-center.y + query-radius; bbox.low.y = query-center.y - query-radius; ! result = rtree_internal_consistent(DatumGetBoxP(entry-key), bbox, strategy); PG_RETURN_BOOL(result); --- 1204,1225 bbox.low.x = query-center.x - query-radius; bbox.high.y = query-center.y + query-radius; bbox.low.y
[HACKERS] Wildcard search support for pg_trgm
Hackers, Here is first version of patch, which enable index support of wildcard search in pg_trgm contrib module. The idea of the patch is to extract from wildcard trigrams which should occurs in wildcard matching string. For example, for '%sector%' wildcard such trigrams would be: 'sec', 'ect', 'tor'. create table words (word text); copy words from '/usr/share/dict/american-english'; test=# explain analyze select * from words where word ilike '%independ%'; QUERY PLAN -- Seq Scan on words (cost=0.00..1703.11 rows=10 width=9) (actual time=18.818..174.146 rows=7 loops=1) Filter: (word ~~* '%independ%'::text) Total runtime: 174.200 ms (3 rows) CREATE INDEX trgm_idx ON words USING gist (word gist_trgm_ops); test=# explain analyze select * from words where word ilike '%independ%'; QUERY PLAN -- Bitmap Heap Scan on words (cost=4.36..40.11 rows=10 width=9) (actual time=2.445..2.529 rows=7 loops=1) Recheck Cond: (word ~~* '%independ%'::text) - Bitmap Index Scan on trgm_idx (cost=0.00..4.35 rows=10 width=0) (actual time=2.406..2.406 rows=7 loops=1) Index Cond: (word ~~* '%independ%'::text) Total runtime: 2.612 ms (5 rows) CREATE INDEX trgm_idx ON words USING gin (word gin_trgm_ops); test=# explain analyze select * from words where word ilike '%independ%'; QUERY PLAN --- Bitmap Heap Scan on words (cost=76.08..111.83 rows=10 width=9) (actual time=2.675..2.755 rows=7 loops=1) Recheck Cond: (word ~~* '%independ%'::text) - Bitmap Index Scan on trgm_idx (cost=0.00..76.07 rows=10 width=0) (actual time=2.642..2.642 rows=7 loops=1) Index Cond: (word ~~* '%independ%'::text) Total runtime: 2.839 ms (5 rows) I've encountered with following problems: 1) Indexing support for ilike is possible only with case-insensetive wildcards, e.g. when IGNORECASE macro is enabled. But I can't use this macro in pg_trgm.sql.in, where list of operators is defined. Probably, is it enuogh to put comment near IGNORECASE, which tells that if one disable this macro he should also remove oparators from pg_trgm.sql.in? 2) I found gist index not very useful with default SIGLENINT = 3. I've changed this value to 15 and I found gist index performs very good on dictionary. But on longer strings greater values of SIGLENINT may be required (probably even SIGLENINT 122 will give benefit in some cases in spite of TOAST). With best regards, Alexander Korotkov. *** a/contrib/pg_trgm/pg_trgm.sql.in --- b/contrib/pg_trgm/pg_trgm.sql.in *** *** 113,118 FOR TYPE text USING gist --- 113,120 AS OPERATOR1 % (text, text), OPERATOR2 - (text, text) FOR ORDER BY pg_catalog.float_ops, + OPERATOR3 ~~ (text, text), + OPERATOR4 ~~* (text, text), FUNCTION1 gtrgm_consistent (internal, text, int, oid, internal), FUNCTION2 gtrgm_union (bytea, internal), FUNCTION3 gtrgm_compress (internal), *** *** 144,149 CREATE OPERATOR CLASS gin_trgm_ops --- 146,153 FOR TYPE text USING gin AS OPERATOR1 % (text, text), + OPERATOR3 ~~ (text, text), + OPERATOR4 ~~* (text, text), FUNCTION1 btint4cmp (int4, int4), FUNCTION2 gin_extract_trgm (text, internal), FUNCTION3 gin_extract_trgm (text, internal, int2, internal, internal), *** a/contrib/pg_trgm/trgm.h --- b/contrib/pg_trgm/trgm.h *** *** 19,24 --- 19,26 /* operator strategy numbers */ #define SimilarityStrategyNumber 1 #define DistanceStrategyNumber 2 + #define LikeStrategyNumber 3 + #define ILikeStrategyNumber 4 typedef char trgm[3]; *** *** 53,59 typedef struct /* gist */ #define BITBYTE 8 ! #define SIGLENINT 3 /* 122 = key will toast, so very slow!!! */ #define SIGLEN ( sizeof(int)*SIGLENINT ) #define SIGLENBIT (SIGLEN*BITBYTE - 1) /* see makesign */ --- 55,61 /* gist */ #define BITBYTE 8 ! #define SIGLENINT 15 /* 122 = key will toast, so very slow!!! */ #define SIGLEN ( sizeof(int)*SIGLENINT ) #define SIGLENBIT (SIGLEN*BITBYTE - 1) /* see makesign */ *** *** 89,94 typedef char *BITVECP; --- 91,101 extern float4 trgm_limit; TRGM *generate_trgm(char *str, int slen); + TRGM *generate_wildcard_trgm(char *str, int slen); float4 cnt_sml
Re: [HACKERS] Wildcard search support for pg_trgm
I found another problem. GIN index suffers from GIN indexes do not support whole-index scans when no trigram can be extracted from pattern. With best regards, Alexander Korotkov.
Re: [HACKERS] Wildcard search support for pg_trgm
On Mon, Dec 13, 2010 at 12:14 AM, Dimitri Fontaine dimi...@2ndquadrant.frwrote: How different (and better) is it from wildspeed? The general advantage is possibility of usage wildcard search and trigram similarity search using the same index. I expect that GIN trigram index is slightly less space demanding, but slightly slower on search than wildspeed. Also, I expect GiST trigram index to be slower on search, but faster on updates. While I didn't check these assumptions in details. I've lack of test datasets for sufficient testing, and I would like to ask community to help me with it. Because testing on dictionaries is good, but obviously not enough. With best regards, Alexander Korotkov.
Re: [HACKERS] Wildcard search support for pg_trgm
=0x9a25ee8, altdest=0x9a25ee8, completionTag=0xbfcd9d9c ) at pquery.c:791 #22 0x0825ec78 in exec_simple_query (query_string=value optimized out) at postgres.c:1059 #23 0x0825fe01 in PostgresMain (argc=2, argv=0x99aeb68, username=0x99aead8 smagen) at postgres.c:3939 #24 0x08229250 in BackendRun () at postmaster.c:3577 #25 BackendStartup () at postmaster.c:3262 #26 ServerLoop () at postmaster.c:1448 #27 0x0822be12 in PostmasterMain (argc=3, argv=0x99acc58) at postmaster.c:1109 #28 0x081ce3b7 in main (argc=3, argv=0x99acc58) at main.c:199 With best regards, Alexander Korotkov. *** a/contrib/pg_trgm/pg_trgm.sql.in --- b/contrib/pg_trgm/pg_trgm.sql.in *** *** 113,118 FOR TYPE text USING gist --- 113,120 AS OPERATOR1 % (text, text), OPERATOR2 - (text, text) FOR ORDER BY pg_catalog.float_ops, + OPERATOR3 ~~ (text, text), + OPERATOR4 ~~* (text, text), FUNCTION1 gtrgm_consistent (internal, text, int, oid, internal), FUNCTION2 gtrgm_union (bytea, internal), FUNCTION3 gtrgm_compress (internal), *** *** 129,140 RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_extract_trgm(text, internal, int2, internal, internal) RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal) RETURNS bool AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; --- 131,142 AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_extract_query_trgm(text, internal, int2, internal, internal, internal, internal) RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal, internal, internal) RETURNS bool AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; *** *** 144,151 CREATE OPERATOR CLASS gin_trgm_ops FOR TYPE text USING gin AS OPERATOR1 % (text, text), FUNCTION1 btint4cmp (int4, int4), FUNCTION2 gin_extract_trgm (text, internal), ! FUNCTION3 gin_extract_trgm (text, internal, int2, internal, internal), ! FUNCTION4 gin_trgm_consistent (internal, int2, text, int4, internal, internal), STORAGE int4; --- 146,155 FOR TYPE text USING gin AS OPERATOR1 % (text, text), + OPERATOR3 ~~ (text, text), + OPERATOR4 ~~* (text, text), FUNCTION1 btint4cmp (int4, int4), FUNCTION2 gin_extract_trgm (text, internal), ! FUNCTION3 gin_extract_query_trgm (text, internal, int2, internal, internal, internal, internal), ! FUNCTION4 gin_trgm_consistent (internal, int2, text, int4, internal, internal, internal, internal), STORAGE int4; *** a/contrib/pg_trgm/trgm.h --- b/contrib/pg_trgm/trgm.h *** *** 19,24 --- 19,26 /* operator strategy numbers */ #define SimilarityStrategyNumber 1 #define DistanceStrategyNumber 2 + #define LikeStrategyNumber 3 + #define ILikeStrategyNumber 4 typedef char trgm[3]; *** *** 53,59 typedef struct /* gist */ #define BITBYTE 8 ! #define SIGLENINT 3 /* 122 = key will toast, so very slow!!! */ #define SIGLEN ( sizeof(int)*SIGLENINT ) #define SIGLENBIT (SIGLEN*BITBYTE - 1) /* see makesign */ --- 55,61 /* gist */ #define BITBYTE 8 ! #define SIGLENINT 15 /* 122 = key will toast, so very slow!!! */ #define SIGLEN ( sizeof(int)*SIGLENINT ) #define SIGLENBIT (SIGLEN*BITBYTE - 1) /* see makesign */ *** *** 89,94 typedef char *BITVECP; --- 91,101 extern float4 trgm_limit; TRGM *generate_trgm(char *str, int slen); + TRGM *generate_wildcard_trgm(char *str, int slen); float4 cnt_sml(TRGM *trg1, TRGM *trg2); + booltrgm_contain(TRGM *trg1, TRGM *trg2); + + #define ISESCAPECHAR(x) (*(x) == '\\') + #define ISWILDCARDCHAR(x) (*(x) == '_' || *(x) == '%') #endif /* __TRGM_H__ */ *** a/contrib/pg_trgm/trgm_gin.c --- b/contrib/pg_trgm/trgm_gin.c *** *** 6,11 --- 6,12 #include trgm.h #include access/gin.h + #include access/skey.h #include access/itup.h #include access/tuptoaster.h #include storage/bufpage.h *** *** 16,21 --- 17,25 PG_FUNCTION_INFO_V1(gin_extract_trgm); Datum gin_extract_trgm(PG_FUNCTION_ARGS); + PG_FUNCTION_INFO_V1(gin_extract_query_trgm); + Datum gin_extract_query_trgm(PG_FUNCTION_ARGS); + PG_FUNCTION_INFO_V1(gin_trgm_consistent); Datum gin_trgm_consistent
Re: [HACKERS] Wildcard search support for pg_trgm
Hi, Here is updated version of this patch. With best regards, Alexander Korotkov. *** a/contrib/pg_trgm/pg_trgm.sql.in --- b/contrib/pg_trgm/pg_trgm.sql.in *** *** 113,118 FOR TYPE text USING gist --- 113,120 AS OPERATOR1 % (text, text), OPERATOR2 - (text, text) FOR ORDER BY pg_catalog.float_ops, + OPERATOR3 ~~ (text, text), + OPERATOR4 ~~* (text, text), FUNCTION1 gtrgm_consistent (internal, text, int, oid, internal), FUNCTION2 gtrgm_union (bytea, internal), FUNCTION3 gtrgm_compress (internal), *** *** 129,140 RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_extract_trgm(text, internal, int2, internal, internal) RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal) RETURNS bool AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; --- 131,142 AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_extract_query_trgm(text, internal, int2, internal, internal, internal, internal) RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal, internal, internal) RETURNS bool AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; *** *** 144,151 CREATE OPERATOR CLASS gin_trgm_ops FOR TYPE text USING gin AS OPERATOR1 % (text, text), FUNCTION1 btint4cmp (int4, int4), FUNCTION2 gin_extract_trgm (text, internal), ! FUNCTION3 gin_extract_trgm (text, internal, int2, internal, internal), ! FUNCTION4 gin_trgm_consistent (internal, int2, text, int4, internal, internal), STORAGE int4; --- 146,155 FOR TYPE text USING gin AS OPERATOR1 % (text, text), + OPERATOR3 ~~ (text, text), + OPERATOR4 ~~* (text, text), FUNCTION1 btint4cmp (int4, int4), FUNCTION2 gin_extract_trgm (text, internal), ! FUNCTION3 gin_extract_query_trgm (text, internal, int2, internal, internal, internal, internal), ! FUNCTION4 gin_trgm_consistent (internal, int2, text, int4, internal, internal, internal, internal), STORAGE int4; *** a/contrib/pg_trgm/trgm.h --- b/contrib/pg_trgm/trgm.h *** *** 19,24 --- 19,26 /* operator strategy numbers */ #define SimilarityStrategyNumber 1 #define DistanceStrategyNumber 2 + #define LikeStrategyNumber 3 + #define ILikeStrategyNumber 4 typedef char trgm[3]; *** *** 53,59 typedef struct /* gist */ #define BITBYTE 8 ! #define SIGLENINT 3 /* 122 = key will toast, so very slow!!! */ #define SIGLEN ( sizeof(int)*SIGLENINT ) #define SIGLENBIT (SIGLEN*BITBYTE - 1) /* see makesign */ --- 55,61 /* gist */ #define BITBYTE 8 ! #define SIGLENINT 15 /* 122 = key will toast, so very slow!!! */ #define SIGLEN ( sizeof(int)*SIGLENINT ) #define SIGLENBIT (SIGLEN*BITBYTE - 1) /* see makesign */ *** *** 89,94 typedef char *BITVECP; --- 91,101 extern float4 trgm_limit; TRGM *generate_trgm(char *str, int slen); + TRGM *generate_wildcard_trgm(char *str, int slen); float4 cnt_sml(TRGM *trg1, TRGM *trg2); + booltrgm_contain(TRGM *trg1, TRGM *trg2); + + #define ISESCAPECHAR(x) (*(x) == '\\') + #define ISWILDCARDCHAR(x) (*(x) == '_' || *(x) == '%') #endif /* __TRGM_H__ */ *** a/contrib/pg_trgm/trgm_gin.c --- b/contrib/pg_trgm/trgm_gin.c *** *** 6,11 --- 6,12 #include trgm.h #include access/gin.h + #include access/skey.h #include access/itup.h #include access/tuptoaster.h #include storage/bufpage.h *** *** 16,21 --- 17,25 PG_FUNCTION_INFO_V1(gin_extract_trgm); Datum gin_extract_trgm(PG_FUNCTION_ARGS); + PG_FUNCTION_INFO_V1(gin_extract_query_trgm); + Datum gin_extract_query_trgm(PG_FUNCTION_ARGS); + PG_FUNCTION_INFO_V1(gin_trgm_consistent); Datum gin_trgm_consistent(PG_FUNCTION_ARGS); *** *** 58,90 gin_extract_trgm(PG_FUNCTION_ARGS) } Datum gin_trgm_consistent(PG_FUNCTION_ARGS) { bool *check = (bool *) PG_GETARG_POINTER(0); ! /* StrategyNumber strategy = PG_GETARG_UINT16(1); */ /* text*query = PG_GETARG_TEXT_P(2); */ ! int32 nkeys = PG_GETARG_INT32(3); ! /* Pointer*extra_data = (Pointer *) PG_GETARG_POINTER(4); */ bool *recheck = (bool *) PG_GETARG_POINTER(5); bool res = FALSE; int32 i
Re: [HACKERS] wildcard search support for pg_trgm
Hi! On Mon, Jan 24, 2011 at 3:07 AM, Jan Urbański wulc...@wulczer.org wrote: I see two issues with this patch. First of them is the resulting index size. I created a table with 5 copies of /usr/share/dict/american-english in it and a gin index on it, using gin_trgm_ops. The results were: * relation size: 18MB * index size: 109 MB while without the patch the GIN index was 43 MB. I'm not really sure *why* this happens, as it's not obvious from reading the patch what exactly is this extra data that gets stored in the index, making it more than double its size. Do you sure that you did comparison correctly? The sequence of index building and data insertion does matter. I tried to build gin index on 5 copies of /usr/share/dict/american-english with patch and got 43 MB index size. That leads me to the second issue. The pg_trgm code is already woefully uncommented, and after spending quite some time reading it back and forth I have to admit that I don't really understand what the code does in the first place, and so I don't understand what does that patch change. I read all the changes in detail and I could't find any obvious mistakes like reading over array boundaries or dereferencing uninitialized pointers, but I can't tell if the patch is correct semantically. All test cases I threw at it work, though. I'll try to write sufficient comment and send new revision of patch. This patch changes the names and signatures of some support functions for GIN, and I'm not sure how that affects binary compatibility and pg_upgrade. I tried to create an index with the vanilla source, and then recompile pg_trgm and reindex the table, but it still was not using the index. I think it's because it's missing entries in the catalogs about the index supporting the like strategy. How should this be handled? This patch don't alters structure of index. It only adds strategies for index scan. In order update this index one should recreate operator class (it will require to drop index). It can be done by sequential uninstall_pg_trgm.sql and pg_trgm.sql. After that new index can be created and it will support like strategy. Although actually there is no need of index recreation, I don't see easier way to do this. With best regards, Alexander Korotkov.
Re: [HACKERS] wildcard search support for pg_trgm
Hello! New version of patch is in the attachment. Some comments was added in this version. Likely these comments need significant correction because of my english. Some notes abount gin interface functions. Extract query and extract value functions was separated, because with wildcard search these funtions no longer do the same. New arguments was added to sql description of gin interface functions in order to make it confom to new gin interface. See docs of development version: http://developer.postgresql.org/pgdocs/postgres/gin-extensibility.html. With best regards, Alexander Korotkov. *** a/contrib/pg_trgm/pg_trgm.sql.in --- b/contrib/pg_trgm/pg_trgm.sql.in *** *** 113,118 FOR TYPE text USING gist --- 113,120 AS OPERATOR1 % (text, text), OPERATOR2 - (text, text) FOR ORDER BY pg_catalog.float_ops, + OPERATOR3 ~~ (text, text), + OPERATOR4 ~~* (text, text), FUNCTION1 gtrgm_consistent (internal, text, int, oid, internal), FUNCTION2 gtrgm_union (bytea, internal), FUNCTION3 gtrgm_compress (internal), *** *** 129,140 RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_extract_trgm(text, internal, int2, internal, internal) RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal) RETURNS bool AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; --- 131,142 AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_extract_query_trgm(text, internal, int2, internal, internal, internal, internal) RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal, internal, internal) RETURNS bool AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; *** *** 144,151 CREATE OPERATOR CLASS gin_trgm_ops FOR TYPE text USING gin AS OPERATOR1 % (text, text), FUNCTION1 btint4cmp (int4, int4), FUNCTION2 gin_extract_trgm (text, internal), ! FUNCTION3 gin_extract_trgm (text, internal, int2, internal, internal), ! FUNCTION4 gin_trgm_consistent (internal, int2, text, int4, internal, internal), STORAGE int4; --- 146,155 FOR TYPE text USING gin AS OPERATOR1 % (text, text), + OPERATOR3 ~~ (text, text), + OPERATOR4 ~~* (text, text), FUNCTION1 btint4cmp (int4, int4), FUNCTION2 gin_extract_trgm (text, internal), ! FUNCTION3 gin_extract_query_trgm (text, internal, int2, internal, internal, internal, internal), ! FUNCTION4 gin_trgm_consistent (internal, int2, text, int4, internal, internal, internal, internal), STORAGE int4; *** a/contrib/pg_trgm/trgm.h --- b/contrib/pg_trgm/trgm.h *** *** 13,24 --- 13,32 #define LPADDING 2 #define RPADDING 1 #define KEEPONLYALNUM + /* + * IGNORECASE macro means that trigrams is case-insensetive. If this macro is + * disabled, then ~~* operator should be excluded from operator class, because + * we can't handle case-insensetive wildcard search with case-sensetive + * trigrams. + */ #define IGNORECASE #define DIVUNION /* operator strategy numbers */ #define SimilarityStrategyNumber 1 #define DistanceStrategyNumber 2 + #define LikeStrategyNumber 3 + #define ILikeStrategyNumber 4 typedef char trgm[3]; *** *** 53,59 typedef struct /* gist */ #define BITBYTE 8 ! #define SIGLENINT 3 /* 122 = key will toast, so very slow!!! */ #define SIGLEN ( sizeof(int)*SIGLENINT ) #define SIGLENBIT (SIGLEN*BITBYTE - 1) /* see makesign */ --- 61,67 /* gist */ #define BITBYTE 8 ! #define SIGLENINT 15 /* 122 = key will toast, so very slow!!! */ #define SIGLEN ( sizeof(int)*SIGLENINT ) #define SIGLENBIT (SIGLEN*BITBYTE - 1) /* see makesign */ *** *** 89,94 typedef char *BITVECP; --- 97,107 extern float4 trgm_limit; TRGM *generate_trgm(char *str, int slen); + TRGM *generate_wildcard_trgm(char *str, int slen); float4 cnt_sml(TRGM *trg1, TRGM *trg2); + booltrgm_contain(TRGM *trg1, TRGM *trg2); + + #define ISESCAPECHAR(x) (*(x) == '\\') /* Wildcard escape character */ + #define ISWILDCARDCHAR(x) (*(x) == '_' || *(x) == '%') /* Wildcard meta-character */ #endif /* __TRGM_H__ */ *** a/contrib/pg_trgm/trgm_gin.c --- b/contrib/pg_trgm/trgm_gin.c *** *** 6,11 --- 6,12 #include trgm.h
Re: [HACKERS] wildcard search support for pg_trgm
Hi! On Mon, Jan 31, 2011 at 12:52 AM, Jan Urbański wulc...@wulczer.org wrote: I saw that the code tries to handle ILIKE searches, but apparently it's failing somewhere. It was just a typo. Corrected version attached. With best regards, Alexander Korotkov. *** a/contrib/pg_trgm/pg_trgm.sql.in --- b/contrib/pg_trgm/pg_trgm.sql.in *** *** 113,118 FOR TYPE text USING gist --- 113,120 AS OPERATOR1 % (text, text), OPERATOR2 - (text, text) FOR ORDER BY pg_catalog.float_ops, + OPERATOR3 ~~ (text, text), + OPERATOR4 ~~* (text, text), FUNCTION1 gtrgm_consistent (internal, text, int, oid, internal), FUNCTION2 gtrgm_union (bytea, internal), FUNCTION3 gtrgm_compress (internal), *** *** 129,140 RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_extract_trgm(text, internal, int2, internal, internal) RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal) RETURNS bool AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; --- 131,142 AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_extract_query_trgm(text, internal, int2, internal, internal, internal, internal) RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; ! CREATE OR REPLACE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal, internal, internal) RETURNS bool AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; *** *** 144,151 CREATE OPERATOR CLASS gin_trgm_ops FOR TYPE text USING gin AS OPERATOR1 % (text, text), FUNCTION1 btint4cmp (int4, int4), FUNCTION2 gin_extract_trgm (text, internal), ! FUNCTION3 gin_extract_trgm (text, internal, int2, internal, internal), ! FUNCTION4 gin_trgm_consistent (internal, int2, text, int4, internal, internal), STORAGE int4; --- 146,155 FOR TYPE text USING gin AS OPERATOR1 % (text, text), + OPERATOR3 ~~ (text, text), + OPERATOR4 ~~* (text, text), FUNCTION1 btint4cmp (int4, int4), FUNCTION2 gin_extract_trgm (text, internal), ! FUNCTION3 gin_extract_query_trgm (text, internal, int2, internal, internal, internal, internal), ! FUNCTION4 gin_trgm_consistent (internal, int2, text, int4, internal, internal, internal, internal), STORAGE int4; *** a/contrib/pg_trgm/trgm.h --- b/contrib/pg_trgm/trgm.h *** *** 13,24 --- 13,32 #define LPADDING 2 #define RPADDING 1 #define KEEPONLYALNUM + /* + * IGNORECASE macro means that trigrams is case-insensetive. If this macro is + * disabled, then ~~* operator should be excluded from operator class, because + * we can't handle case-insensetive wildcard search with case-sensetive + * trigrams. + */ #define IGNORECASE #define DIVUNION /* operator strategy numbers */ #define SimilarityStrategyNumber 1 #define DistanceStrategyNumber 2 + #define LikeStrategyNumber 3 + #define ILikeStrategyNumber 4 typedef char trgm[3]; *** *** 53,59 typedef struct /* gist */ #define BITBYTE 8 ! #define SIGLENINT 3 /* 122 = key will toast, so very slow!!! */ #define SIGLEN ( sizeof(int)*SIGLENINT ) #define SIGLENBIT (SIGLEN*BITBYTE - 1) /* see makesign */ --- 61,67 /* gist */ #define BITBYTE 8 ! #define SIGLENINT 15 /* 122 = key will toast, so very slow!!! */ #define SIGLEN ( sizeof(int)*SIGLENINT ) #define SIGLENBIT (SIGLEN*BITBYTE - 1) /* see makesign */ *** *** 89,94 typedef char *BITVECP; --- 97,107 extern float4 trgm_limit; TRGM *generate_trgm(char *str, int slen); + TRGM *generate_wildcard_trgm(char *str, int slen); float4 cnt_sml(TRGM *trg1, TRGM *trg2); + booltrgm_contain(TRGM *trg1, TRGM *trg2); + + #define ISESCAPECHAR(x) (*(x) == '\\') /* Wildcard escape character */ + #define ISWILDCARDCHAR(x) (*(x) == '_' || *(x) == '%') /* Wildcard meta-character */ #endif /* __TRGM_H__ */ *** a/contrib/pg_trgm/trgm_gin.c --- b/contrib/pg_trgm/trgm_gin.c *** *** 6,11 --- 6,12 #include trgm.h #include access/gin.h + #include access/skey.h #include access/itup.h #include access/tuptoaster.h #include storage/bufpage.h *** *** 16,21 --- 17,25 PG_FUNCTION_INFO_V1(gin_extract_trgm); Datum gin_extract_trgm(PG_FUNCTION_ARGS); + PG_FUNCTION_INFO_V1(gin_extract_query_trgm); + Datum
Re: [HACKERS] wildcard search support for pg_trgm
On Tue, Feb 1, 2011 at 5:40 AM, Tom Lane t...@sss.pgh.pa.us wrote: AFAICT that would break on-disk compatibility of pg_trgm GIST indexes. I don't believe we have adequate evidence to justify doing that, and in any case it ought to be a separate patch rather than buried inside a mostly unrelated feature patch. Ok. Actually, I don't think just increasement of SIGLENINT as a solution. I beleive that we need to have it as index parameter. I'll try to provide more of tests in order to motivate this. With best regards, Alexander Korotkov.
[HACKERS] WIP: store additional info in GIN index
Hackers, Attached patch enables GIN to store additional information with item pointers in posting lists and trees. Such additional information could be positions of words, positions of trigrams, lengths of arrays and so on. This is the first and most huge patch of serie of GIN improvements which was presented at PGConf.EU http://wiki.postgresql.org/images/2/25/Full-text_search_in_PostgreSQL_in_milliseconds-extended-version.pdf Patch modifies GIN interface as following: 1) Two arguments are added to extractValue Datum **addInfo, bool **addInfoIsNull 2) Two arguments are added to consistent Datum addInfo[], bool addInfoIsNull[] 3) New method config is introduced which returns datatype oid of addtional information (analogy with SP-GiST config method). Patch completely changes storage in posting lists and leaf pages of posting trees. It uses varbyte encoding for BlockNumber and OffsetNumber. BlockNumber are stored incremental in page. Additionally one bit of OffsetNumber is reserved for additional information NULL flag. To be able to find position in leaf data page quickly patch introduces small index in the end of page. -- With best regards, Alexander Korotkov. ginaddinfo.1.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_trgm partial-match
Hi! On Thu, Nov 15, 2012 at 11:39 PM, Fujii Masao masao.fu...@gmail.com wrote: Note that we cannot do a partial-match if KEEPONLYALNUM is disabled, i.e., if query key contains multibyte characters. In this case, byte length of the trigram string might be larger than three, and its CRC is used as a trigram key instead of the trigram string itself. Because of using CRC, we cannot do a partial-match. Attached patch extends pg_trgm so that it compares a partial-match query key only when KEEPONLYALNUM is enabled. Didn't get this point. How does KEEPONLYALNUM guarantee that each trigram character is singlebyte? CREATE TABLE test (val TEXT); INSERT INTO test VALUES ('aa'), ('aaa'), ('шaaш'); CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops); ANALYZE test; test=# SELECT * FROM test WHERE val LIKE '%aa%'; val -- aa aaa шaaш (3 rows) test=# set enable_seqscan = off; SET test=# SELECT * FROM test WHERE val LIKE '%aa%'; val - aa aaa (2 rows) I think we can use partial match only for singlebyte encodings. Or, at most, in cases when all alpha-numeric characters are singlebyte (have no idea how to check this). -- With best regards, Alexander Korotkov.
Re: [HACKERS] pg_trgm partial-match
On Mon, Nov 19, 2012 at 10:05 AM, Alexander Korotkov aekorot...@gmail.comwrote: On Thu, Nov 15, 2012 at 11:39 PM, Fujii Masao masao.fu...@gmail.comwrote: Note that we cannot do a partial-match if KEEPONLYALNUM is disabled, i.e., if query key contains multibyte characters. In this case, byte length of the trigram string might be larger than three, and its CRC is used as a trigram key instead of the trigram string itself. Because of using CRC, we cannot do a partial-match. Attached patch extends pg_trgm so that it compares a partial-match query key only when KEEPONLYALNUM is enabled. Didn't get this point. How does KEEPONLYALNUM guarantee that each trigram character is singlebyte? CREATE TABLE test (val TEXT); INSERT INTO test VALUES ('aa'), ('aaa'), ('шaaш'); CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops); ANALYZE test; test=# SELECT * FROM test WHERE val LIKE '%aa%'; val -- aa aaa шaaш (3 rows) test=# set enable_seqscan = off; SET test=# SELECT * FROM test WHERE val LIKE '%aa%'; val - aa aaa (2 rows) I think we can use partial match only for singlebyte encodings. Or, at most, in cases when all alpha-numeric characters are singlebyte (have no idea how to check this). Actually, I also was fiddling around idea of partial match on trigrams when I was working on initial LIKE patch. But, I concluded that we would need a separate opclass which always keeps full trigram in entry. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
Hi! New version of patch is attached. Changes are following: 1) Right way to convert from pg_wchar to multibyte. 2) Optimization of producing CFNA-like graph on trigrams (produce smaller, but equivalent, graphs in less time). 3) Comments and refactoring. -- With best regards, Alexander Korotkov. trgm-regexp-0.2.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
Some quick comments. On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra t...@fuzzy.cz wrote: 6) It does not compile - I do get a bunch of errors like this Fixed. 7) Once fixed, it seems to work CREATE EXTENSION pg_trgm ; CREATE TABLE TEST (val TEXT); INSERT INTO test SELECT md5(i::text) FROM generate_series(1,100) s(i); CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops); ANALYZE test; EXPLAIN SELECT * FROM test WHERE val ~ '.*qqq.*'; QUERY PLAN - Bitmap Heap Scan on test (cost=16.77..385.16 rows=100 width=33) Recheck Cond: (val ~ '.*qqq.*'::text) - Bitmap Index Scan on trgm_idx (cost=0.00..16.75 rows=100 width=0) Index Cond: (val ~ '.*qqq.*'::text) (4 rows) but I do get a bunch of NOTICE messages with debugging info (no matter if the GIN index is used or not, so it's somewhere in the common regexp code). But I guess that's due to WIP status. It's due to TRGM_REGEXP_DEBUG macro. I disabled it by default. But I think pieces of code hidden by that macro could be useful for debug even after WIP status. -- With best regards, Alexander Korotkov. trgm-regexp-0.3.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra t...@fuzzy.cz wrote: 2) It's common to use upper-case names for macros, but trgm.h defines macro iswordchr - I see it's moved from trgm_op.c but maybe we could make it a bit more correct? 3) I see there are two '#ifdef KEEPONLYALNUM blocks right next to each other in trgm.h - maybe it'd be a good idea to join them? 4) The two new method prototypes at the end of trgm.h use different indendation than the rest (spaces only instead of tabs). These issues are fixed in attached patch. Additionally 3 new macros are introduced: MAX_RESULT_STATES, MAX_RESULT_ARCS, MAX_RESULT_PATHS. They are limiting resources usage during regex processing. -- With best regards, Alexander Korotkov. trgm-regexp-0.4.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Tue, Nov 20, 2012 at 1:43 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Glad to see this patch hasn't been totally forgotten. Being able to use indexes for regular expressions would be really cool! Back in January, I asked for some high-level description of how the algorithm works (http://archives.postgresql.** org/message-id/4F187D5C.30701@**enterprisedb.comhttp://archives.postgresql.org/message-id/4f187d5c.30...@enterprisedb.com). That's still sorely needed. Googling around, I found the slides for your presentation on this from PGConf.EU - it would be great to have the information from that presentation included in the patch. New version of patch is attached. The changes are following: 1) A big comment with high-level description of what is going on. 2) Regression tests. 3) Documetation update. 4) Some more refactoring. -- With best regards, Alexander Korotkov. trgm-regexp-0.5.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
Hi! On Wed, Nov 21, 2012 at 12:51 AM, Pavel Stehule pavel.steh...@gmail.comwrote: do you plan to support GiST? At first, I would note that pg_trgm GiST opclass is quite ridiculous for support regex search (and, actually for LIKE/ILIKE search which is already implemented too). Because in GiST opclass we store set of trigrams in leaf pages. In was designed for trigram similarity search and have sense for it because of elimination of trigram set computation. But for regex or LIKE/ILIKE search this representation is both lossy and bigger than just original string. Probably we could think about another opclass for GiST focusing on regex and LIKE/ILIKE search? However, amyway I can create additional patch for current GiST opclass. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Mon, Nov 26, 2012 at 4:55 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Great, that top-level comment helped tremendously! I feel enlightened. I fixed some spelling, formatting etc. trivial stuff while reading through the patch, see attached. Below is some feedback on the details: * I don't like the PG_TRY/CATCH trick. It's not generally safe to catch an error, without propagating it further or rolling back the whole (sub)transation. It might work in this case, as you're only suppressing errors with the special sqlcode that are used in the same file, but it nevertheless feels naughty. I believe none of the limits that are being checked are strict; it's OK to exceed the limits somewhat, as long as you terminate the processing in a reasonable time, in case of pathological input. I'd suggest putting an explicit check for the limits somewhere, and not rely on ereport(). Something like this, in the code that recurses: if (trgmCNFA-arcsCount MAX_RESULT_ARCS || hash_get_num_entries(trgmCNFA-**states) MAX_RESULT_STATES) { trgmCNFA-overflowed = true; return; } PG_TRY/CATCH trick is replaced with some number of if/return. Code becomes a bit more complicated, but your notes does matter. And then check for the overflowed flag at the top level. * This part of the high-level comment was not clear to me: * States of the graph produced in the first stage are marked with keys. Key is a pair * of a prefix and a state of the original automaton. Prefix is a last * characters. So, knowing the prefix is enough to know what is a trigram when we read some new * character. However, we can know single character of prefix or don't know any * characters of it. Each state of resulting graph have an enter key (with that * key we've entered this state) and a set of keys which are reachable without * reading any predictable trigram. The algorithm of processing each state * of resulting graph are so: * 1) Add all keys which achievable without reading of any predictable trigram. * 2) Add outgoing arcs labeled with trigrams. * Step 2 leads to creation of new states and recursively processing them. So, * we use depth-first algorithm. I didn't understand that. Can you elaborate? It might help to work through an example, with some ascii art depicting the graph. This comment is extended with example. * It would be nice to add some comments to TrgmCNFA struct, explaining which fields are valid at which stages. For example, it seems that 'trgms' array is calculated only after building the CNFA, by getTrgmVector() function, while arcsCount is updated on the fly, while recursing in the getState() function. TrgmCNFA comment is extended with this. * What is the representation used for the path matrix? Needs a comment. Comments are added to PackedTrgmPaths and TrgmStatePath. * What do the getColorinfo() getColorInfo comment now references to ColorInfo struct which have comment. and scanColorMap() functions do? scanColorMap comment now have description of colormap structure. What exactly does a color represent? This is added to the top comment. What's the tradeoff in choosing MAX_COLOR_CHARS? Comment is added to the macro. -- With best regards, Alexander Korotkov. trgm-regexp-0.6.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: One thing that bothers me with this algoritm is that the overflow mechanism is all-or-nothing. In many cases, even when there is a huge number of states in the diagram, you could still extract at least a few trigrams that must be present in any matching string, with little effort. At least, it seems like that to a human :-). For example, consider this: explain analyze select count(*) from azjunk4 where txt ~ ('^** aabaacaadaaeaafaagaahaaiaajaak**aalaamaanaaoaapaaqaaraasaataau** aavaawaaxaayaazabaabbabcabdabe**abfabgabhabiabjabkablabmabnabo** abpabqabrabsabtabuabvabwabxaby**abzacaacbaccacdaceacfacgachaci** acjackaclacmacnacoacpacqacracs**actacuacvacwacxacyaczadaadbadc** addadeadfadgadhadiadjadkadladm**adnadoadpadqadradsadtaduadvadw** adxadyadzaeaaebaecaedaeeaefaeg**aehaeiaejaekaelaemaenaeoaepaeq** aeraesaetaeuaevaewaexaeyaezafa**afbafcafdafeaffafgafhafiafjafk** aflafmafnafoafpafqafrafsaftafu**afvafwafxafyafzagaagbagcagdage** agfaggaghagiagjagkaglagmagnago**agpagqagragsagtaguagvagwagxagy** agzahaahbahcahdaheahfahgahhahi**ahjahkahlahmahnahoahpahqahrahs**$'); you get a query plan like this (the long regexp string edited out): Aggregate (cost=228148.02..228148.03 rows=1 width=0) (actual time=131.100..131 .101 rows=1 loops=1) - Bitmap Heap Scan on azjunk4 (cost=228144.01..228148.02 rows=1 width=0) ( actual time=131.096..131.096 rows=0 loops=1) Recheck Cond: (txt ~ ridiculously long regexp) Rows Removed by Index Recheck: 1 - Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx (cost=0.00..228144 .01 rows=1 width=0) (actual time=82.914..82.914 rows=1 loops=1) Index Cond: (txt ~ ridiculously long regexp) Total runtime: 131.230 ms (7 rows) That ridiculously long string exceeds the number of states (I think, could be number of paths or arcs too), and the algorithm gives up, resorting to scanning the whole index as can be seen by the Rows Removed by Index Recheck line. However, it's easy to see that any matching string must contain *any* of the possible trigrams the algorithm extracts. If it could safely return just a few of them, say aab and abz, and discard the rest, that would already be much better than a full index scan. Would it be safe to simply stop short the depth-first search on overflow, and proceed with the graph that was constructed up to that point? For depth-first it's not. But your proposal naturally makes sense. I've changed it to breadth-first search. And then it's safe to mark all unprocessed states as final when overflow. It means that we assume every unprocessed branch to immediately finish with matching (this can give us more false positives but no false negatives). For overflow of matrix collection, it's safe to do just OR between all the trigrams. New version of patch is attached. -- With best regards, Alexander Korotkov. trgm-regexp-0.7.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
Hi! On Thu, Nov 29, 2012 at 12:58 PM, er e...@xs4all.nl wrote: On Mon, November 26, 2012 20:49, Alexander Korotkov wrote: trgm-regexp-0.6.patch.gz I ran the simple-minded tests against generated data (similar to the ones I did in January 2012). The problems of that older version seem pretty much all removed. (although I didn't do much work on it -- just reran these tests). Thanks a lot for testing! Could you repeat for 0.7 version of patch which has new overflow handling? -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Fri, Nov 30, 2012 at 3:20 PM, Alexander Korotkov aekorot...@gmail.comwrote: For depth-first it's not. Oh, I didn't explained it. In order to stop graph processing we need to be sure that we put all outgoing arcs from state or assume that state to be final. In DFS we can be in the final part of graph producing but still didn't add some arc (with new trigram) from initial state directly to the final state. It obviously leads to false negatives. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Fri, Nov 30, 2012 at 6:23 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 30.11.2012 13:20, Alexander Korotkov wrote: On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangashlinnakangas@** vmware.com hlinnakan...@vmware.com wrote: Would it be safe to simply stop short the depth-first search on overflow, and proceed with the graph that was constructed up to that point? For depth-first it's not. But your proposal naturally makes sense. I've changed it to breadth-first search. And then it's safe to mark all unprocessed states as final when overflow. It means that we assume every unprocessed branch to immediately finish with matching (this can give us more false positives but no false negatives). For overflow of matrix collection, it's safe to do just OR between all the trigrams. New version of patch is attached. Thanks, sounds good. I've spent quite a long time trying to understand the transformation the getState/addKeys/addAcrs functions do to the original CNFA graph. I think that still needs more comments to explain the steps involved in it. One thing that occurs to me is that it might be better to delay expanding colors to characters. You could build and transform the graph, and even create the paths, all while operating on colors. You would end up with lists of color trigrams, consisting of sequences of three colors that must appear in the source string. Only at the last step you would expand the color trigrams to real character trigrams. I think that would save a lot of processing while building the graph, if you have colors that contain many characters. It would allow us to do better than the fixed small MAX_COLOR_CHARS limit. For example with MAX_COLOR_CHARS = 4 in the current patch, it's a shame that we can't do anything with a fairly simple regexp like ^a[b-g]h$ Nice idea to delay expanding colors to characters! Obviously, we should delay expanding inly alphanumerical characters. Because non-alphanumberical characters influence graph structure. Trying to implement... -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Sat, Dec 1, 2012 at 3:22 PM, Erik Rijkers e...@xs4all.nl wrote: On Fri, November 30, 2012 12:22, Alexander Korotkov wrote: Hi! On Thu, Nov 29, 2012 at 12:58 PM, er e...@xs4all.nl wrote: On Mon, November 26, 2012 20:49, Alexander Korotkov wrote: I ran the simple-minded tests against generated data (similar to the ones I did in January 2012). The problems of that older version seem pretty much all removed. (although I didn't do much work on it -- just reran these tests). Thanks a lot for testing! Could you repeat for 0.7 version of patch which has new overflow handling? I've attached a similar test re-run that compares HEAD with patch versions 0.6, and 0.7. Thanks! Did you write scripts for automated testing? I would be nice if you share them. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Mon, Dec 3, 2012 at 2:05 PM, Heikki Linnakangas hlinnakan...@vmware.comwrote: On 02.12.2012 20:19, Tom Lane wrote: Alexander Korotkovaekorot...@gmail.com writes: Nice idea to delay expanding colors to characters! Obviously, we should delay expanding inly alphanumerical characters. Because non-alphanumberical characters influence graph structure. Trying to implement... Uh, why would that be? Colors are colors. The regexp machinery doesn't care whether they represent alphanumerics or not. (Or to be more precise, if there is a situation where it makes a difference, separate colors will have been created for each set of characters that need to be distinguished.) The regexp machinery doesn't care, but the trigrams that pg_trgm extracts only contain alphanumeric characters. So if by looking at the CNFA graph produced by the regexp machinery you conclude that any matching strings must contain three-letter sequences %oo and #oo, you can just club them together into oo trigram. I think you can run a pre-processing step to the colors, and merge colors that are equivalent as far as trigrams are considered. For example, if you have a color that contains only character '%', and another that contains character '#', you can treat them as the same hcolor. You might then be able to simplify the CNFA. Actually, it would be even better if you could apply the pre-processing to the regexp before the regexp machinery turns it into a CNFA. Not sure how easy it would be to do such pre-processing. Treating colors as same should be possible only for colors which has no alphanumeric characters, because colors are non-overlapping. However, this optimization could be significant in some cases. BTW, why create the path matrix? You could check the check array of trigrams in the consistent function directly against the graph. Consistent should return true, iff there is a path through the graph following only arcs that contain trigrams present in the check array. Finding a path through a complex graph could be expensive, O(|E|), but if the path is complex, the path matrix would be large as well, and checking against a large matrix isn't exactly free either. It would allow you to avoid overflows caused by having too many paths through the graph. Actually, I generally dislike path matrix for same reasons. But: 1) Output graphs could contain trigrams which are completely useless for search. For example, for regex /(abcdefgh)*ijk/ we need only ijk trigram while graph would contain much more.Path matrix is a method to get rid of all of them. 2) If we use color trigrams then we need some criteria for which color trigrams to expand into trigrams. Simultaneously, we shouldn't allow path from initial state to the final by unexpanded trigrams. It seems much harder to do with graph than with matrix. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: store additional info in GIN index
Hi! On Sun, Dec 2, 2012 at 5:02 AM, Tomas Vondra t...@fuzzy.cz wrote: I've tried to apply the patch with the current HEAD, but I'm getting segfaults whenever VACUUM runs (either called directly or from autovac workers). The patch applied cleanly against 9b3ac49e and needed a minor fix when applied on HEAD (because of an assert added to ginRedoCreatePTree), but that shouldn't be a problem. Thanks for testing! Patch is rebased with HEAD. The bug you reported was fixed. -- With best regards, Alexander Korotkov. ginaddinfo.2.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for removng unused targets
On Mon, Dec 3, 2012 at 8:31 PM, Tom Lane t...@sss.pgh.pa.us wrote: Etsuro Fujita fujita.ets...@lab.ntt.co.jp writes: Sorry for the delay. I've reviewed the patch. It was applied successfully, and it worked well for tests I did including the example you showed. I think it's worth the work, but I'm not sure you go about it in the right way. (I feel the patch decreases code readability more than it gives an advantage.) One thought here is that I don't particularly like adding a field like resorderbyonly to TargetEntry in the first place. That makes this optimization the business of the parser, which it should not be; and furthermore makes it incumbent on the rewriter, as well as anything else that manipulates parsetrees, to maintain the flag correctly while rearranging queries. It would be better if this were strictly the business of the planner. But having said that, I'm wondering (without having read the patch) why you need anything more than the existing resjunk field. Actually, I don't know all the cases when resjunk flag is set. Is it reliable to decide target to be used only for ORDER BY if it's resjunk and neither system or used in grouping? If it's so or there are some other cases which are easy to determine then I'll remove resorderbyonly flag. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: store additional info in GIN index
On Tue, Dec 4, 2012 at 9:34 PM, Robert Haas robertmh...@gmail.com wrote: On Sun, Nov 18, 2012 at 4:54 PM, Alexander Korotkov aekorot...@gmail.com wrote: Patch completely changes storage in posting lists and leaf pages of posting trees. It uses varbyte encoding for BlockNumber and OffsetNumber. BlockNumber are stored incremental in page. Additionally one bit of OffsetNumber is reserved for additional information NULL flag. To be able to find position in leaf data page quickly patch introduces small index in the end of page. This sounds like it means that this would break pg_upgrade, about which I'm not too keen. Ideally, we'd like to have a situation where new indexes have additional capabilities, but old indexes are still usable for things that they could do before. I am not sure whether that's a realistic goal. This means to have two versions of code which deals with posting trees and lists. For me it seems unlikely we have resources for maintenance of this. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Patch for removng unused targets
On Tue, Dec 4, 2012 at 11:52 PM, Tom Lane t...@sss.pgh.pa.us wrote: Alexander Korotkov aekorot...@gmail.com writes: On Mon, Dec 3, 2012 at 8:31 PM, Tom Lane t...@sss.pgh.pa.us wrote: But having said that, I'm wondering (without having read the patch) why you need anything more than the existing resjunk field. Actually, I don't know all the cases when resjunk flag is set. Is it reliable to decide target to be used only for ORDER BY if it's resjunk and neither system or used in grouping? If it's so or there are some other cases which are easy to determine then I'll remove resorderbyonly flag. resjunk means that the target is not supposed to be output by the query. Since it's there at all, it's presumably referenced by ORDER BY or GROUP BY or DISTINCT ON, but the meaning of the flag doesn't depend on that. What you would need to do is verify that the target is resjunk and not used in any clause besides ORDER BY. I have not read your patch, but I rather imagine that what you've got now is that the parser checks this and sets the new flag for consumption far downstream. Why not just make the same check in the planner? A more invasive, but possibly cleaner in the long run, approach is to strip all resjunk targets from the query's tlist at the start of planning and only put them back if needed. BTW, when I looked at this a couple years ago, it seemed like the major problem was that the planner assumes that all plans for the query should emit the same tlist, and thus that tlist eval cost isn't a distinguishing factor. Breaking that assumption seemed to require rather significant refactoring. I never found the time to try to actually do it. May be there is some way to not remove items from tlist, but evade actual calculation? -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: store additional info in GIN index
On Wed, Dec 5, 2012 at 1:56 AM, Tomas Vondra t...@fuzzy.cz wrote: On 4.12.2012 20:12, Alexander Korotkov wrote: Hi! On Sun, Dec 2, 2012 at 5:02 AM, Tomas Vondra t...@fuzzy.cz mailto:t...@fuzzy.cz wrote: I've tried to apply the patch with the current HEAD, but I'm getting segfaults whenever VACUUM runs (either called directly or from autovac workers). The patch applied cleanly against 9b3ac49e and needed a minor fix when applied on HEAD (because of an assert added to ginRedoCreatePTree), but that shouldn't be a problem. Thanks for testing! Patch is rebased with HEAD. The bug you reported was fixed. Applies fine, but I get a segfault in dataPlaceToPage at gindatapage.c. The whole backtrace is here: http://pastebin.com/YEPuWeuV The messages written into PostgreSQL log are quite variable - usually it looks like this: 2012-12-04 22:31:08 CET 31839 LOG: database system was not properly shut down; automatic recovery in progress 2012-12-04 22:31:08 CET 31839 LOG: redo starts at 0/68A76E48 2012-12-04 22:31:08 CET 31839 LOG: unexpected pageaddr 0/1BE64000 in log segment 00010069, offset 15089664 2012-12-04 22:31:08 CET 31839 LOG: redo done at 0/69E63638 but I've seen this message too 2012-12-04 22:20:29 CET 31709 LOG: database system was not properly shut down; automatic recovery in progress 2012-12-04 22:20:29 CET 31709 LOG: redo starts at 0/AEAFAF8 2012-12-04 22:20:29 CET 31709 LOG: record with zero length at 0/C7D5698 2012-12-04 22:20:29 CET 31709 LOG: redo done at 0/C7D55E I wasn't able to prepare a simple testcase to reproduce this, so I've attached two files from my fun project where I noticed it. It's a simple DB + a bit of Python for indexing mbox archives inside Pg. - create.sql - a database structure with a bunch of GIN indexes on tsvector columns on messages table - load.py - script for parsing mbox archives / loading them into the messages table (warning: it's a bit messy) Usage: 1) create the DB structure $ createdb archives $ psql archives create.sql 2) fetch some archives (I consistently get SIGSEGV after first three) $ wget http://archives.postgresql.org/pgsql-hackers/mbox/pgsql-hackers.1997-01.gz $ wget http://archives.postgresql.org/pgsql-hackers/mbox/pgsql-hackers.1997-02.gz $ wget http://archives.postgresql.org/pgsql-hackers/mbox/pgsql-hackers.1997-03.gz 3) gunzip and load them using the python script $ gunzip pgsql-hackers.*.gz $ ./load.py --db archives pgsql-hackers.* 4) et voila - a SIGSEGV :-( I suspect this might be related to the fact that the load.py script uses savepoints quite heavily to handle UNIQUE_VIOLATION (duplicate messages). Thanks for bug report. It is fixed in the attached patch. -- With best regards, Alexander Korotkov. ginaddinfo.3.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Statistics and selectivity estimation for ranges
Hi, Jeff! Thanks a lot for review! On Mon, Dec 10, 2012 at 11:21 PM, Jeff Davis pg...@j-davis.com wrote: It looks like there are still some problems with this patch. CREATE TABLE foo(ir int4range); insert into foo select 'empty' from generate_series(1,1); insert into foo select int4range(NULL, g, '(]') from generate_series(1,100) g; insert into foo select int4range(g, NULL, '[)') from generate_series(1,100) g; insert into foo select int4range(g, ((g*1.01)+10)::int4, '[]') from generate_series(1,100) g; CREATE TABLE bar(ir) AS select * from foo order by random(); ANALYZE bar; Now: EXPLAIN ANALYZE SELECT * FROM bar WHERE ir @ int4range(1,2); The estimates are -nan. Similar for many other queries. Oh, yeah! It appears that infinities require much more cautious work with them than I supposed. That should be fixes in the attached version of patch. However, it require significant rethinking of comments. Will update comments and address your questions in a couple of days. Could you recheck if attached patch really fixes problem you reported? -- With best regards, Alexander Korotkov. range_stat-0.9.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gistchoose vs. bloat
Hi! On Sat, Dec 8, 2012 at 7:05 PM, Andres Freund and...@2ndquadrant.comwrote: I notice there's no documentation about the new reloption at all? Thanks for notice! I've added small description to docs in the attached patch. -- With best regards, Alexander Korotkov. gist_choose_bloat-0.5.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SP-GiST for ranges based on 2d-mapping and quad-tree
Hi! On Sun, Nov 4, 2012 at 11:41 PM, Jeff Davis pg...@j-davis.com wrote: On Fri, 2012-11-02 at 12:47 +0400, Alexander Korotkov wrote: Right version of patch is attached. * In bounds_adjacent, there's no reason to flip the labels back. Fixed. * Comment should indicate more explicitly that bounds_adjacent is sensitive to the argument order. Fixed. * In bounds_adjacent, it appears that bound1 corresponds to B in the comment above, and bound2 corresponds to A in the comment above. I would have guessed from reading the comment that bound1 corresponded to A. We should just use consistent names between the comment and the code (e.g. boundA and boundB). Fixed. * I could be getting confused, but I think that line 645 of rangetypes_spgist.c should be inverted (!bounds_adjacent(...)). Good catch! Actually, code was in direct contradiction with comment. Fixed. * I think needPrevious should have an explanatory comment somewhere. It looks like you are using it to store some state as you descend the tree, but it doesn't look like it's used to reconstruct the value (because we already have the value anyway). Since it's being used for a purpose other than what's intended, that should be explained. Yes, it's a some kind of hack now. Comment is added. -- With best regards, Alexander Korotkov. range_spgist_adjacent-0.4.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Mon, Dec 3, 2012 at 4:31 PM, Alexander Korotkov aekorot...@gmail.comwrote: Actually, I generally dislike path matrix for same reasons. But: 1) Output graphs could contain trigrams which are completely useless for search. For example, for regex /(abcdefgh)*ijk/ we need only ijk trigram while graph would contain much more.Path matrix is a method to get rid of all of them. 2) If we use color trigrams then we need some criteria for which color trigrams to expand into trigrams. Simultaneously, we shouldn't allow path from initial state to the final by unexpanded trigrams. It seems much harder to do with graph than with matrix. Now, I have an idea about doing some not comprehensive but simple and fast simplification of graph. I'm doing experiments now. In case of success we could get rid of path matrix. -- With best regards, Alexander Korotkov.
Re: [HACKERS] gistchoose vs. bloat
On Fri, Dec 14, 2012 at 12:46 PM, Jeff Davis pg...@j-davis.com wrote: Thanks for notice! I've added small description to docs in the attached patch. Here is an edited version of the documentation note. Please review to see if you like my version. Edited version looks good for me. Also, I fixed a compiler warning. Thanks! -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Fri, Dec 14, 2012 at 1:34 AM, Alexander Korotkov aekorot...@gmail.comwrote: On Mon, Dec 3, 2012 at 4:31 PM, Alexander Korotkov aekorot...@gmail.comwrote: Actually, I generally dislike path matrix for same reasons. But: 1) Output graphs could contain trigrams which are completely useless for search. For example, for regex /(abcdefgh)*ijk/ we need only ijk trigram while graph would contain much more.Path matrix is a method to get rid of all of them. 2) If we use color trigrams then we need some criteria for which color trigrams to expand into trigrams. Simultaneously, we shouldn't allow path from initial state to the final by unexpanded trigrams. It seems much harder to do with graph than with matrix. Now, I have an idea about doing some not comprehensive but simple and fast simplification of graph. I'm doing experiments now. In case of success we could get rid of path matrix. Attached patch have following changes: 1) Postphone expansion of colors. Graph are building on color trigrams. 2) Selective expansion of color trigrams into simple trigrams. All non-expanded color trigrams are removed. Such removal leads to union of all states pairs connected with corresponding arcs. Surely, this must no lead to union of initial and final states: that could do all previous work senseless. -- With best regards, Alexander Korotkov. trgm-regexp-0.8.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
Hi! On Mon, Dec 17, 2012 at 12:54 PM, Erik Rijkers e...@xs4all.nl wrote: On Sun, December 16, 2012 22:25, Alexander Korotkov wrote: trgm-regexp-0.8.patch.gz 22 k Hi Alexander, I gave this a quick try; the patch works when compiled for DEBUG, but crashes as a 'speed'-compiled binary: Compile for speed: $ pg_config --configure '--prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.trgm_regex8' '--with-pgport=6556' '--enable-depend' '--with-openssl' '--with-perl' '--with-libxml' $ psql psql (9.3devel-trgm_regex8-20121216_2336-c299477229559d4ee7db68720d86d3fb391db761) Type help for help. testdb=# explain analyze select txt from azjunk5 where txt ~ 'x[aeiouy]{2,5}q'; The connection to the server was lost. Attempting reset: Failed. ! \q Didn't reproduce it yet. Can you retry it with this line uncommented: #define TRGM_REGEXP_DEBUG Then we can see which stage it fails. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Mon, Dec 17, 2012 at 1:16 PM, Alexander Korotkov aekorot...@gmail.comwrote: Didn't reproduce it yet. Can you retry it with this line uncommented: #define TRGM_REGEXP_DEBUG Then we can see which stage it fails. Bug is found and fixed in attached patch. -- With best regards, Alexander Korotkov. trgm-regexp-0.9.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: index support for regexp search
On Tue, Dec 18, 2012 at 11:45 AM, Erik Rijkers e...@xs4all.nl wrote: On Tue, December 18, 2012 08:04, Alexander Korotkov wrote: I ran the same test again: HEAD versus trgm_regex v6, 7 and 9. In v9 there is some gain but also some regression. It remains a difficult problem... If I get some time in the holidays I'll try to diversify the test program; it is now too simple. Note, that regexes which contains {,n} are likely not what do you expect. test=# select 'xq' ~ 'x[aeiou]{,2}q'; ?column? -- f (1 row) test=# select 'xa{,2}q' ~ 'x[aeiou]{,2}q'; ?column? -- t (1 row) You should use {0,n} to express from 0 to n occurences. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: index support for regexp search
On Tue, Dec 18, 2012 at 12:51 PM, Erik Rijkers e...@xs4all.nl wrote: On Tue, December 18, 2012 09:45, Alexander Korotkov wrote: You should use {0,n} to express from 0 to n occurences. Thanks, but I know that of course. It's a testing program; and in the end robustness with unexpected or even wrong input is as important as performance. (to put it bluntly, I am also trying to get your patch to fall over ;-)) I found most of regressions in 0.9 version to be in {,n} cases. New version of patch use more of trigrams than previous versions. For example for regex 'x[aeiou]{,2}q'. In 0.7 version we use trigrams '__2', '_2_' and '__q'. In 0.9 version we use trigrams 'xa_', 'xe_', 'xi_', 'xo_', 'xu_', '__2', '_2_' and '__q'. But, actually trigram '__2' or '_2_' never occurs. It enough to have one of them, all others are just causing a slowdown. Simultaneously, we can't decide reasonably which trigrams to use without knowing their frequencies. For example, if trigrams 'xa_', 'xe_', 'xi_', 'xo_', 'xu_' were altogether more rare than '__2', newer version of patch would be faster. -- With best regards, Alexander Korotkov.
Re: [HACKERS] WIP: store additional info in GIN index
Hi! On Thu, Dec 6, 2012 at 5:44 AM, Tomas Vondra t...@fuzzy.cz wrote: Then I've run a simple benchmarking script, and the results are not as good as I expected, actually I'm getting much worse performance than with the original GIN index. The following table contains the time of loading the data (not a big difference), and number of queries per minute for various number of words in the query. The queries looks like this SELECT id FROM messages WHERE body_tsvector @@ plainto_tsquery('english', 'word1 word2 ...') so it's really the simplest form of FTS query possible. without patch | with patch loading 750 sec| 770 sec 1 word 1500|1100 2 words 23000|9800 3 words 24000|9700 4 words 16000|7200 I'm not saying this is a perfect benchmark, but the differences (of querying) are pretty huge. Not sure where this difference comes from, but it seems to be quite consistent (I usually get +-10% results, which is negligible considering the huge difference). Is this an expected behaviour that will be fixed by another patch? Another patches which significantly accelerate index search will be provided. This patch changes only GIN posting lists/trees storage. However, it wasn't expected that this patch significantly changes index scan speed in any direction. The database contains ~680k messages from the mailing list archives, i.e. about 900 MB of data (in the table), and the GIN index on tsvector is about 900MB too. So the whole dataset nicely fits into memory (8GB RAM), and it seems to be completely CPU bound (no I/O activity at all). The configuration was exactly the same in both cases shared buffers = 1GB work mem = 64 MB maintenance work mem = 256 MB I can either upload the database somewhere, or provide the benchmarking script if needed. Unfortunately, I can't reproduce such huge slowdown on my testcases. Could you share both database and benchmarking script? -- With best regards, Alexander Korotkov.
Re: [HACKERS] Statistics and selectivity estimation for ranges
On Mon, Dec 10, 2012 at 11:21 PM, Jeff Davis pg...@j-davis.com wrote: And I have a few other questions/comments: * Why is summ spelled with two ms? Is it short for summation? If so, might be good to use summation of instead of integrate in the comment. Fixed. * Why does get_length_hist_frac return 0.0 when i is the last value? Is that a mistake? Comment was wrong. Actually it return fraction fraction of ranges which length is *greater*. * I am still confused by the distinction between rbound_bsearch and rbound_bsearch_bin. What is the intuitive purpose of each? I've added corresponding comments. rbound_bsearch is for scalar operators and for bin corresponding to upper bound. rbound_bsearch_bin is now rbound_bsearch_bin_lower. It is for bin corresponding to lower bound. * You use constant value in the comments in several places. Would query value or search key be better? Yes. Fixed. I also renamed get_length_hist_frac to get_length_hist_summ and rewrote comments about it. Hope it becomes more understandable. -- With best regards, Alexander Korotkov. range_stat-0.10.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers