Re: [HACKERS] multibyte charater set in levenshtein function
2010/8/28 Alexander Korotkov aekorot...@gmail.com: Now test for levenshtein_less_equal performance. Nice results. I'll try to find time to look at this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
On Aug 28, 2010, at 8:34 AM, Alexander Korotkov aekorot...@gmail.com wrote: Here is the patch which adds levenshtein_less_equal function. I'm going to add it to current commitfest. Cool. Please submit some performance results comparing levenshtein in HEAD vs. levenshtein with this patch vs. levenshtein_less_equal. Perhaps the test cases we used previously would be a good place to start. ...Robert
Re: [HACKERS] multibyte charater set in levenshtein function
Here is the patch which adds levenshtein_less_equal function. I'm going to add it to current commitfest. With best regards, Alexander Korotkov. On Tue, Aug 3, 2010 at 3:23 AM, Robert Haas robertmh...@gmail.com wrote: On Mon, Aug 2, 2010 at 5:07 PM, Alexander Korotkov aekorot...@gmail.com wrote: Now I think patch is as good as can be. :) OK, committed. I'm going to prepare less-or-equal function in same manner as this patch. Sounds good. Since we're now more than half-way through this CommitFest and this patch has undergone quite a bit of change from what we started out with, I'd like to ask that you submit the new patch for the next CommitFest, to begin September 15th. https://commitfest.postgresql.org/action/commitfest_view/open -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company *** a/contrib/fuzzystrmatch/fuzzystrmatch.c --- b/contrib/fuzzystrmatch/fuzzystrmatch.c *** *** 61,66 PG_MODULE_MAGIC; --- 61,68 */ extern Datum levenshtein_with_costs(PG_FUNCTION_ARGS); extern Datum levenshtein(PG_FUNCTION_ARGS); + extern Datum levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS); + extern Datum levenshtein_less_equal(PG_FUNCTION_ARGS); extern Datum metaphone(PG_FUNCTION_ARGS); extern Datum soundex(PG_FUNCTION_ARGS); extern Datum difference(PG_FUNCTION_ARGS); *** *** 92,98 soundex_code(char letter) #define MAX_LEVENSHTEIN_STRLEN 255 static int levenshtein_internal(text *s, text *t, ! int ins_c, int del_c, int sub_c); /* --- 94,100 #define MAX_LEVENSHTEIN_STRLEN 255 static int levenshtein_internal(text *s, text *t, ! int ins_c, int del_c, int sub_c, int max_d); /* *** *** 202,211 rest_of_char_same(const char *s1, const char *s2, int len) * between supplied strings. Generally * (1, 1, 1) penalty costs suffices common * cases, but your mileage may vary. */ static int levenshtein_internal(text *s, text *t, ! int ins_c, int del_c, int sub_c) { int m, n, --- 204,217 * between supplied strings. Generally * (1, 1, 1) penalty costs suffices common * cases, but your mileage may vary. + * Returns accurate value if max_d 0 or + * actual distance is less or equal than + * max_d, otherwise returns value greater + * than max_d. */ static int levenshtein_internal(text *s, text *t, ! int ins_c, int del_c, int sub_c, int max_d) { int m, n, *** *** 219,224 levenshtein_internal(text *s, text *t, --- 225,233 const char *s_data; const char *t_data; const char *y; + const char *prev_x = NULL; + intcurr_left = 0, curr_right = 0, prev_left, prev_right, d; + intdelta = 0, min_d = 0, theor_max_d; /* Extract a pointer to the actual character data. */ s_data = VARDATA_ANY(s); *** *** 238,243 levenshtein_internal(text *s, text *t, --- 247,266 return n * ins_c; if (!n) return m * del_c; + + /* + * There is theoretical maximum distance based of string lengths. It + * represents the case, when no characters are matching. If max_d + * reaches this value than we can use accurate calculation of distance + * which is faster in this case. + */ + if (max_d = 0) + { + theor_max_d = Min(m*del_c + n*ins_c, (m n) ? + (n * sub_c + (m - n) * del_c):(m * sub_c + (n - m) * ins_c)) - 1; + if (max_d = theor_max_d) + max_d = -1; + } /* * For security concerns, restrict excessive CPU+RAM usage. (This *** *** 251,256 levenshtein_internal(text *s, text *t, --- 274,296 MAX_LEVENSHTEIN_STRLEN))); /* + * We can find the minimal distance by the difference of string lengths. + */ + if (max_d = 0) + { + delta = m - n; + if (delta 0) + min_d = delta * del_c; + else if (delta 0) + min_d = - delta * ins_c; + else + min_d = 0; + + if (min_d max_d) + return max_d + 1; + } + + /* * In order to avoid calling pg_mblen() repeatedly on each character in s, * we cache all the lengths before starting the main loop -- but if all the * characters in both strings are single byte, then we skip this and use *** *** 286,369 levenshtein_internal(text *s, text *t, curr = prev + m; /* Initialize the previous row to 0..cols */ ! for (i = 0; i m; i++) ! prev[i] = i * del_c; /* Loop through rows of the notional array */ for (y = t_data, j = 1; j n; j++) { int *temp; - const char *x = s_data; int y_char_len = n != t_bytes + 1 ? pg_mblen(y) : 1; /* ! * First cell must increment sequentially, as we're on the j'th row of ! * the (m+1)x(n+1) array. ! */ ! curr[0] = j * ins_c; ! ! /* ! * This inner loop is critical to performance, so we include a * fast-path to handle the
Re: [HACKERS] multibyte charater set in levenshtein function
SELECT SUM(levenshtein(a, 'foo')) from words; SELECT SUM(levenshtein(a, 'Urbański')) FROM words; SELECT SUM(levenshtein(a, 'ańs')) FROM words; SELECT SUM(levenshtein(a, 'foo')) from words2; SELECT SUM(levenshtein(a, 'дом')) FROM words2; SELECT SUM(levenshtein(a, 'компьютер')) FROM words2; Before this patch results was: 67 98 63 102 108 177 And after patch: 65 100 65 104 111 182 It is slower a bit, but I think it is a price for having less code duplication. Now test for levenshtein_less_equal performance. test=# SELECT * FROM words WHERE levenshtein(a, 'extensize') = 2; a --- expensive extension extensive (3 rows) Time: 98,083 ms test=# SELECT * FROM words WHERE levenshtein_less_equal(a, 'extensize', 2) = 2; a --- expensive extension extensive (3 rows) Time: 48,069 ms test=# SELECT * FROM words2 WHERE levenshtein(a, 'клубничный') = 2; a кузничный кубичный клубочный клубничный (4 rows) Time: 197,499 ms test=# SELECT * FROM words2 WHERE levenshtein_less_equal(a, 'клубничный', 3) = 2; a кузничный кубичный клубочный клубничный (4 rows) Time: 100,073 ms It gives some speedup for search on word with small distances. test=# SELECT sum(levenshtein(a, 'extensize')) from words; sum 835545 (1 row) Time: 96,253 ms test=# SELECT sum(levenshtein_less_equal(a, 'extensize', 20)) from words; sum 835545 (1 row) Time: 98,438 ms With big distances it gives similar performance because in this case similar branch of code is used. In order to test it with longer words I created a table with random combinations of words. test=# create table phrases (a text primary key); test=# insert into phrases select string_agg(y.a, ' ') from (select x.a, row_number() over () from (select a from words order by random()) x) y group by (y.row_number / 10); test=# select * from phrases limit 10; a -- hosiery's spermatozoa Haleakala disco greenness beehive paranoids atrophy unfair Weinberg's all profanity's individualized bellowed perishables Romanov's communicant's landlady's pauperized Ito seasoned jawbreakers you'll hurling's showcasing anecdota cauliflower intellectualism facilitate infantryman's busheled designing lucid nutritionists Aesculapius's rifle clefting exult Milosevic foresight lurking capitulation's pantsuits traumatizes moonlighting lancer's Longfellow rising unclasps permutes centralest cavalryman Dwight granddaughter knight tallyho's sober graduate locket's knucklehead courtliest sapphires baize coniferous emolument antarctic Laocoon's deadens unseemly Tartars utopians Pekingese's Cartwright badge's Dollie's spryness Ajax's Amber's bookkeeper spouting concordant Indus carbonation cinnamon's stockbrokers Evita's Canaries Waldorf's rampage Amparo exertions Kuznetsk's divots humongous wolverine chugged laurels Goliaths hazelnut (10 rows) test=# select * from phrases where levenshtein('nucklehead courtliest sapphires be coniferous emolument antarctic Laocoon''s deadens unseemly', a) = 10; a -- knucklehead courtliest sapphires baize coniferous emolument antarctic Laocoon's deadens unseemly (1 row) Time: 581,850 ms test=# select * from phrases where levenshtein_less_equal('nucklehead courtliest sapphires be coniferous emolument antarctic Laocoon''s deadens unseemly', a, 10) = 10; a -- knucklehead courtliest sapphires baize coniferous emolument antarctic Laocoon's deadens unseemly (1 row) Time: 22,876 ms It gives great speedup for search with small distances on relatively long strings. test=# select sum(levenshtein('nucklehead courtliest sapphires be coniferous emolument antarctic Laocoon''s deadens unseemly', a)) from phrases; sum 794601 (1 row) Time: 571,138 ms test=# select sum(levenshtein_less_equal('nucklehead courtliest sapphires be coniferous emolument antarctic Laocoon''s deadens unseemly', a, 100)) from phrases; sum 794601 (1 row) Time: 562,895 ms Similar results appears for multi-byte strings. test=# create table phrases2 (a text primary key); test=# insert into phrases2 select string_agg(y.a, ' ') from (select x.a, row_number() over () from (select a from words2 order by random()) x) y group by (y.row_number / 10); INSERT 0 14584 test=# select * from phrases2 limit 10; a --- борис спиннинг растопочный цифра выдохнуть иридий гнёздышко омоновский базовый пожжёте закосить насыщающий паратый продрых обеспложивание милливатт бамбуковый отпекающий книжница приложиться разный устремляющийся отучающийся абрикосовка
Re: [HACKERS] multibyte charater set in levenshtein function
On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas robertmh...@gmail.com wrote: I reviewed this code in a fair amount of detail today and ended up rewriting it. In general terms, it's best to avoid changing things that are not relevant to the central purpose of the patch. This patch randomly adds a whole bunch of whitespace and removes a whole bunch of comments which I find useful and do not want to see removed. It took me about an hour to reverse out all of those changes, which then let me get down to the business of actually analyzing the code. I'm sorry. This is my first patch, I will be more accurate next time. But, I think there is no unified opinion about some changes like replacement !m with m==0. I think this line is not correct: if (m != s_bytes n != t_bytes) The correct negation of assumption, that both strings are single-byte, is the assumption, that at least one string is not single-byte. In this patch levenshtein function can calculate distance incorrectly: test=# select levenshtein('фw', 'ww'); levenshtein - 2 (1 row) This line should be rewritten by this. if (m != s_bytes || n != t_bytes) The attached patch takes the approach of using a fast-path for just the innermost loop when neither string contains any multi-byte characters (regardless of whether or not the encoding allows multi-byte characters). It's still slower than CVS HEAD, but for strings containing no multi-byte characters it's only marginally, if at all, slower than previous releases, because 9.0 doesn't contain your fix to avoid the extra string copy; and it's faster than your implementation even when multi-byte characters ARE present. It is slightly slower on single-byte encoding databases (I tested SQL_ASCII), but still faster than previous releases. It's also quite a bit less code duplication. Can I look at the test which shows that your implementation is faster than my when multi-byte characters are present? My tests show reverse. For example, with russian dictionary of openoffice: With my version of patch: test=# select sum(levenshtein(word, 'фывафыва')) from test; sum - 1675281 (1 row) Time: 277,190 ms With your version of patch: test=# select sum(levenshtein(word, 'фывафыва')) from test; sum - 1675281 (1 row) Time: 398,011 ms I think that the cause of slow down slowdown is memcmp function. This function is very fast for long parts of memory, but of few bytes inline function is faster, because compiler have more freedom for optimization. After replacing memcmp with inline function like following in your implementation: static inline bool char_cmp(const char *s, const char *d, int len) { while (len--) { if (*s++ != *d++) return false; } return true; } Code becomes much faster: test=# select sum(levenshtein(word, 'фывафыва')) from test; sum - 1675281 (1 row) Time: 241,272 ms With best regards, Alexander Korotkov.
Re: [HACKERS] multibyte charater set in levenshtein function
Now I think patch is as good as can be. :) I'm going to prepare less-or-equal function in same manner as this patch. With best regards, Alexander Korotkov.
Re: [HACKERS] multibyte charater set in levenshtein function
2010/8/2 Alexander Korotkov aekorot...@gmail.com: On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas robertmh...@gmail.com wrote: I reviewed this code in a fair amount of detail today and ended up rewriting it. In general terms, it's best to avoid changing things that are not relevant to the central purpose of the patch. This patch randomly adds a whole bunch of whitespace and removes a whole bunch of comments which I find useful and do not want to see removed. It took me about an hour to reverse out all of those changes, which then let me get down to the business of actually analyzing the code. I'm sorry. This is my first patch, I will be more accurate next time. But, I think there is no unified opinion about some changes like replacement !m with m==0. Sure; if that were the only such change, I wouldn't have mentioned it. I think this line is not correct: if (m != s_bytes n != t_bytes) Good catch, fixed in the attached version. The attached patch takes the approach of using a fast-path for just the innermost loop when neither string contains any multi-byte characters (regardless of whether or not the encoding allows multi-byte characters). It's still slower than CVS HEAD, but for strings containing no multi-byte characters it's only marginally, if at all, slower than previous releases, because 9.0 doesn't contain your fix to avoid the extra string copy; and it's faster than your implementation even when multi-byte characters ARE present. It is slightly slower on single-byte encoding databases (I tested SQL_ASCII), but still faster than previous releases. It's also quite a bit less code duplication. Can I look at the test which shows that your implementation is faster than my when multi-byte characters are present? My tests show reverse. For example, with russian dictionary of openoffice: Hmm, with the correction you mention above, I'm getting about the same results with multi-byte characters (but faster with only single-byte characters). I did this: CREATE TABLE words (a text not null primary key); COPY words from '/usr/share/dict/words'; SELECT SUM(levenshtein(a, 'foo')) from words; SELECT SUM(levenshtein(a, 'Urbański')) FROM words; SELECT SUM(levenshtein(a, 'ańs')) FROM words; With the updated patch attached below, these ran about 102 ms, 222 ms, 129 ms. With your patch, I get about 134 ms, 222 ms, 130 ms. Perhaps this is because I only have multi-byte characters in one of the inputs, not both. I don't have a dictionary handy with multi-byte words in it. Can you send me a file? I think that the cause of slow down slowdown is memcmp function. This function is very fast for long parts of memory, but of few bytes inline function is faster, because compiler have more freedom for optimization. After replacing memcmp with inline function like following in your implementation: Yeah, that does seem to be slightly better. I've done a version of this in the attached patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company mb_levenshtein_rmh-v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
2010/8/2 Alexander Korotkov aekorot...@gmail.com: The dump of the table with russian dictionary is in attachment. I use following tests: SELECT SUM(levenshtein(a, 'foo')) from words; SELECT SUM(levenshtein(a, 'Urbański')) FROM words; SELECT SUM(levenshtein(a, 'ańs')) FROM words; SELECT SUM(levenshtein(a, 'foo')) from words2; SELECT SUM(levenshtein(a, 'дом')) FROM words2; SELECT SUM(levenshtein(a, 'компьютер')) FROM words2; With your last version of patch results are: 63ms 94ms 61ms 100ms 121ms 217ms. But I found some other optimization. With this condition: if (x[x_char_len-1] == y[y_char_len-1] x_char_len == y_char_len (x_char_len == 1 || char_same(x, y, x_char_len - 1))) results become: 58ms 96ms 63ms 102ms 107ms 171ms On single-byte characters results almost didn't change, but they come better with multi-byte characters. Generally, this improvement is because first bytes of different characters are frequently same (for example, almost all Cyrillic characters start from same byte, and I think that there is similar situation in other alphabets), but match of last bytes is just a coincidence. Hey, that's really cool. It helps a lot here, too: My previous patch, with your 5 test cases: Time: 100.556 ms Time: 205.254 ms Time: 127.070 ms Time: 77.734 ms Time: 90.345 ms Time: 166.954 ms Your original patch, same 5 test cases: Time: 131.489 ms Time: 215.048 ms Time: 125.471 ms Time: 80.068 ms Time: 87.110 ms Time: 166.918 ms My patch, with your proposed change to compare the last character first, same 5 test cases: Time: 96.489 ms Time: 201.497 ms Time: 121.876 ms Time: 75.005 ms Time: 80.219 ms Time: 142.844 ms Updated patch attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company mb_levenshtein_rmh-v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
On Mon, Aug 2, 2010 at 5:07 PM, Alexander Korotkov aekorot...@gmail.com wrote: Now I think patch is as good as can be. :) OK, committed. I'm going to prepare less-or-equal function in same manner as this patch. Sounds good. Since we're now more than half-way through this CommitFest and this patch has undergone quite a bit of change from what we started out with, I'd like to ask that you submit the new patch for the next CommitFest, to begin September 15th. https://commitfest.postgresql.org/action/commitfest_view/open -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
On Fri, Jul 30, 2010 at 1:14 PM, Alexander Korotkov aekorot...@gmail.com wrote: Ok, here is the patch for multi-byte characters. I changed arguments of levenshtein_internal function from text * to const char * and int. I think that it makes levenshtein_internal more reusable. For example, this function can be used for calculation distance of two fragments of strings without need of copying of these fragments. Actullay, I'm not fully satisfied with my solution in merging of single-byte and multi-byte versions of function. There are ifdef's in body of function and non context-free macros. This is due to multi-byte needs to store characters lengths returned by pg_mblen when single-byte version doesn't need that. I tried multi-byte version without storing characters lengths, but in this case function is about 1.5 times slower (function needs to call pg_mblen n*m times). I reviewed this code in a fair amount of detail today and ended up rewriting it. In general terms, it's best to avoid changing things that are not relevant to the central purpose of the patch. This patch randomly adds a whole bunch of whitespace and removes a whole bunch of comments which I find useful and do not want to see removed. It took me about an hour to reverse out all of those changes, which then let me get down to the business of actually analyzing the code. The biggest problem I found is that, as coded, this is more than 50% slower on UTF-8 databases, because the multi-byte path is really slow, even if there are actually no multi-byte characters present. The attached patch takes the approach of using a fast-path for just the innermost loop when neither string contains any multi-byte characters (regardless of whether or not the encoding allows multi-byte characters). It's still slower than CVS HEAD, but for strings containing no multi-byte characters it's only marginally, if at all, slower than previous releases, because 9.0 doesn't contain your fix to avoid the extra string copy; and it's faster than your implementation even when multi-byte characters ARE present. It is slightly slower on single-byte encoding databases (I tested SQL_ASCII), but still faster than previous releases. It's also quite a bit less code duplication. Please review and let me know your thoughts. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company mb_levenshtein_rmh.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
I forgot attribution in levenshtein.c file. With best regards, Alexander Korotkov. fuzzystrmatch-0.5.1.tar.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
On Wed, Jul 21, 2010 at 5:59 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, Jul 21, 2010 at 2:47 PM, Alexander Korotkov aekorot...@gmail.com wrote: On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas robertmh...@gmail.com wrote: *scratches head* Aren't you just moving the same call to a different place? So, where you can find this different place? :) In this patch null-terminated strings are not used at all. I can't. You win. :-) Actually, I wonder if there's enough performance improvement there that we might think about extracting that part of the patch and apply it separately. Then we could continue trying to figure out what to do with the rest. Sometimes it's simpler to deal with one change at a time. I tested this today and the answer was a resounding yes. I ran sum(levenshtein(t, 'foo')) over a dictionary file with about 2 million words and got a speedup of around 15% just by eliminating the text_to_cstring() calls. So I've committed that part of this patch. I'll try to look at the rest of the patch when I get a chance, but I'm wondering if it might make sense to split it into two patches - specifically, one patch to handle multi-byte characters correctly, and then a second patch for the less-than-or-equal-to functions. I think that might simplify reviewing a bit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
Here is new version of my patch. There are following changes: 1) I've merged singlebyte and multibyte versions of levenshtein_internal and levenshtein_less_equal_internal using macros and includes. 2) I found that levenshtein takes reasonable time even for long strings. There is an example with strings which contain random combinations of words. test=# select count(*) from words; count --- 98569 (1 row) test=# select * from words limit 10; id | word| next_id ---++- 200 | Agnew diodes drilled composts Wassermann nonprofit adjusting Chautauqua| 17370 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither| 70983 5418 | Eroses gauchos bowls smellier weeklies tormentors cornmeal punctuality | 96685 6103 | G prompt boron overthrew cricking begonia snuffers blazed | 34518 4707 | Djibouti Mari pencilling approves pointing's pashas magpie rams| 71200 10125 | Lyle Edgardo livers gruelled capable cancels gnawing's inhospitable| 28018 9615 | Leos deputy evener yogis assault palatals Octobers surceased | 21878 15640 | Starr's arrests malapropism Koufax's aesthetes Howell rustier Algeria | 19316 15776 | Sucre battle's misapprehension's predicating nutmegged electrification bowl planks | 65739 16878 | Updike Forster ragout keenly exhalation grayed switch guava's | 43013 (10 rows) Time: 26,125 ms Time: 137,061 ms test=# select avg(length(word)) from words; avg - 74.6052207083362923 (1 row) Time: 96,415 ms test=# select * from words where levenshtein(word, 'Dhaka craziness savvies teeae luhs Barentss unwe zher')=10; id | word | next_id --+-+- 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither | 70983 (1 row) Time: 2957,426 ms test=# select * from words where levenshtein_less_equal(word, 'Dhaka craziness savvies teeae luhs Barentss unwe zher',10)=10; id | word | next_id --+-+- 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither | 70983 (1 row) Time: 84,755 ms It takes O(max(n, m) * max_d / min(ins_c, max_c)) in worst case. That's why I changed limitation of this function to max_d = 255 * min(ins_c, del_c) 3) I found that there is no need for storing character offsets in levenshtein_less_equal_internal_mb. Instead of this first position in string, where value of distance wasn't greater than max_d, can be stored. 4) I found that there is theoretical maximum distance based of string lengths. It represents the case, when no characters are matching. If theoretical maximum distance is less than given maximum distance we can use theoretical maximum distance. It improves behavior of levenshtein_less_equal with high max_d, but it's still slower than levenshtein: test=# select * from words where levenshtein_less_equal(word, 'Dhaka craziness savvies teeae luhs Barentss unwe zher',1000)=10; id | word | next_id --+-+- 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither | 70983 (1 row) Time: 4102,731 ms test=# select * from words where levenshtein(word, 'Dhaka craziness savvies teeae luhs Barentss unwe zher')=10; id | word | next_id --+-+- 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither | 70983 (1 row) Time: 2983,244 ms With best regards, Alexander Korotkov. fuzzystrmatch-0.5.tar.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
Excerpts from Alexander Korotkov's message of jue jul 22 03:21:57 -0400 2010: On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas robertmh...@gmail.com wrote: Ah, I see. That's pretty compelling, I guess. Although it still seems like a lot of code... I think there is a way to merge single-byte and multi-byte versions of functions without loss in performance using macros and includes (like in 'backend/utils/adt/like.c'). Do you think it is acceptable in this case? Using the trick of a single source file similar to like_match.c that implements all the different behaviors, and include that file more than once with different macro definitions, is probably a good idea. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
On Thu, Jul 22, 2010 at 3:21 AM, Alexander Korotkov aekorot...@gmail.com wrote: On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas robertmh...@gmail.com wrote: Ah, I see. That's pretty compelling, I guess. Although it still seems like a lot of code... I think there is a way to merge single-byte and multi-byte versions of functions without loss in performance using macros and includes (like in 'backend/utils/adt/like.c'). Do you think it is acceptable in this case? I'm not sure. I'd like to get a second opinion from someone else on which way to go with this, but unfortunately 2 or 3 of the main committers are on vacation at the moment. Does anyone else have an opinion on what to do about this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
Such version with macros and includes can look like this: #ifdef MULTIBYTE #define NEXT_X (x+= char_lens[i-1]) #define NEXT_Y (y+= y_char_len) #define CMP (char_cmp(x, char_lens[i-1], y, y_char_len)) #else #define NEXT_X (x++) #define NEXT_Y (y++) #define CMP (*x == *y) #endif static int levenshtein_internal(text *s, text *t, int ins_c, int del_c, int sub_c) { intm, n, d; int *prev; int *curr; inti, j; const char *x; const char *y; #ifdef MULTIBYTE int *char_lens; inty_char_len; #endif /* * We should calculate number of characters for multibyte encodings */ m = pg_mbstrlen_with_len(VARDATA_ANY(s), VARSIZE_ANY_EXHDR(s)); n = pg_mbstrlen_with_len(VARDATA_ANY(t), VARSIZE_ANY_EXHDR(t)); /* * We can transform an empty s into t with n insertions, or a non-empty t * into an empty s with m deletions. */ if (m == 0) return n * ins_c; if (n == 0) return m * del_c; /* * For security concerns, restrict excessive CPU+RAM usage. (This * implementation uses O(m) memory and has O(mn) complexity.) */ if (m MAX_LEVENSHTEIN_STRLEN || n MAX_LEVENSHTEIN_STRLEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg(argument exceeds the maximum length of %d bytes, MAX_LEVENSHTEIN_STRLEN))); /* One more cell for initialization column and row. */ ++m; ++n; /* * Instead of building an (m+1)x(n+1) array, we'll use two different * arrays of size m+1 for storing accumulated values. At each step one * represents the previous row and one is the current row of the * notional large array. * For multibyte encoding we'll also store array of lengths of * characters of first string in order to avoid great number of * pg_mblen calls. */ #ifdef MULTIBYTE prev = (int *) palloc(3 * m * sizeof(int)); curr = prev + m; char_lens = prev + 2 * m; for (i = 0, x = VARDATA_ANY(s); i m - 1; i++) { char_lens[i] = pg_mblen(x); x += char_lens[i]; } char_lens[i] = 0; #else prev = (int *) palloc(2 * m * sizeof(int)); curr = prev + m; #endif /* Initialize the previous row to 0..cols */ for (i = 0; i m; i++) prev[i] = i * del_c; /* * For multibyte encoding we should increase x and y pointers by the * results of pg_mblen function. Also we should use CHAR_CMP macros * for character comparison. */ for (y = VARDATA_ANY(t), j = 1; j n; NEXT_Y, j++) { int *temp; #ifdef MULTIBYTE y_char_len = pg_mblen(y); #endif curr[0] = j * ins_c; for (x = VARDATA_ANY(s), i = 1; i m; NEXT_X, i++) { intins; intdel; intsub; ins = prev[i] + ins_c; del = curr[i - 1] + del_c; sub = prev[i - 1] + (CMP ? 0 : sub_c); curr[i] = Min(ins, del); curr[i] = Min(curr[i], sub); } temp = curr; curr = prev; prev = temp; } d = prev[m - 1]; /* * Because the final value was swapped from the previous row to the * current row, that's where we'll find it. */ return d; } What do you thing about it? With best regards, Alexander Korotkov.
Re: [HACKERS] multibyte charater set in levenshtein function
On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas robertmh...@gmail.com wrote: Ah, I see. That's pretty compelling, I guess. Although it still seems like a lot of code... I think there is a way to merge single-byte and multi-byte versions of functions without loss in performance using macros and includes (like in 'backend/utils/adt/like.c'). Do you think it is acceptable in this case? With best regards, Alexander Korotkov.
Re: [HACKERS] multibyte charater set in levenshtein function
On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas robertmh...@gmail.com wrote: This patch still needs some work. It includes a bunch of stylistic changes that aren't relevant to the purpose of the patch. There's no reason that I can see to change the existing levenshtein_internal function to take text arguments instead of char *, or to change !m to m == 0 in existing code, or to change the whitespace in the comments of that function. All of those need to be reverted before we can consider committing this. I changed arguments of function from char * to text * in order to avoid text_to_cstring call. Same benefit can be achived by replacing char * with char * and length. I changed !m to m == 0 because Itagaki asked me to make it conforming coding style. Do you think there is no reason to fix coding style in existing code? There is a huge amount of duplicated code here. I think it makes sense to have the multibyte version of the function be separate, but perhaps we can merge the less-than-or-equal to bits into the main code, so that we only have two copies instead of four. Perhaps we can't just add a max_d argument max_distance to levenshtein_internal; and if this value is =0 then it represents the max allowable distance, but if it is 0 then there is no limit. Sure, that might slow down the existing code a bit, but it might not be significant. I'd at least like to see some numbers showing that it is significant before we go to this much trouble. In these case we should add many checks of max_d in levenshtein_internal function which make code more complex. Actually, we can merge all four functions into one function. But such function will have many checks about multibyte encoding and max_d. So, I see four cases here: 1) one function with checks for multibyte encoding and max_d 2) two functions with checks for multibyte encoding 3) two functions with checks for max_d 4) four separate functions If you prefer case number 3 you should argue your position little more. The code doesn't follow the project coding style. Braces must be uncuddled. Comment paragraphs will be reflowed unless they begin and end with --. Function definitions should have the type declaration on one line and the function name at the start of the next. Freeing memory with pfree is likely a waste of time; is there any reason not to just rely on the memory context reset, as the original coding did? Ok, I'll fix this things. I think we might need to remove the acknowledgments section from this code. If everyone who touches this code adds their name, we're quickly going to have a mess. If we're not going to remove the acknowledgments section, then please add my name, too, because I've already patched this code once... In that case I think we can leave original acknowledgments section. With best regards, Alexander Korotkov.
Re: [HACKERS] multibyte charater set in levenshtein function
On Wed, Jul 21, 2010 at 7:40 AM, Alexander Korotkov aekorot...@gmail.com wrote: On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas robertmh...@gmail.com wrote: This patch still needs some work. It includes a bunch of stylistic changes that aren't relevant to the purpose of the patch. There's no reason that I can see to change the existing levenshtein_internal function to take text arguments instead of char *, or to change !m to m == 0 in existing code, or to change the whitespace in the comments of that function. All of those need to be reverted before we can consider committing this. I changed arguments of function from char * to text * in order to avoid text_to_cstring call. *scratches head* Aren't you just moving the same call to a different place? Same benefit can be achived by replacing char * with char * and length. I changed !m to m == 0 because Itagaki asked me to make it conforming coding style. Do you think there is no reason to fix coding style in existing code? Yeah, we usually try to avoid changing that sort of thing in existing code, unless there's a very good reason. There is a huge amount of duplicated code here. I think it makes sense to have the multibyte version of the function be separate, but perhaps we can merge the less-than-or-equal to bits into the main code, so that we only have two copies instead of four. Perhaps we can't just add a max_d argument max_distance to levenshtein_internal; and if this value is =0 then it represents the max allowable distance, but if it is 0 then there is no limit. Sure, that might slow down the existing code a bit, but it might not be significant. I'd at least like to see some numbers showing that it is significant before we go to this much trouble. In these case we should add many checks of max_d in levenshtein_internal function which make code more complex. When you say many checks, how many? Actually, we can merge all four functions into one function. But such function will have many checks about multibyte encoding and max_d. So, I see four cases here: 1) one function with checks for multibyte encoding and max_d 2) two functions with checks for multibyte encoding 3) two functions with checks for max_d 4) four separate functions If you prefer case number 3 you should argue your position little more. I'm somewhat convinced that separating the multibyte case out has a performance benefit both by intuition and because you posted some numbers, but I haven't seen any argument for separating out the other case, so I'm asking if you've checked and whether there is an effect and whether it's significant. The default is always to try to avoid maintaining multiple copies of substantially identical code, due to the danger that a future patch might fail to update all of them and thus introduce a bug. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas robertmh...@gmail.com wrote: *scratches head* Aren't you just moving the same call to a different place? So, where you can find this different place? :) In this patch null-terminated strings are not used at all. Yeah, we usually try to avoid changing that sort of thing in existing code, unless there's a very good reason. Ok. In these case we should add many checks of max_d in levenshtein_internal function which make code more complex. When you say many checks, how many? Actually, we can merge all four functions into one function. But such function will have many checks about multibyte encoding and max_d. So, I see four cases here: 1) one function with checks for multibyte encoding and max_d 2) two functions with checks for multibyte encoding 3) two functions with checks for max_d 4) four separate functions If you prefer case number 3 you should argue your position little more. I'm somewhat convinced that separating the multibyte case out has a performance benefit both by intuition and because you posted some numbers, but I haven't seen any argument for separating out the other case, so I'm asking if you've checked and whether there is an effect and whether it's significant. The default is always to try to avoid maintaining multiple copies of substantially identical code, due to the danger that a future patch might fail to update all of them and thus introduce a bug. I've tested it with big value of max_d and I thought that it's evident that checking for negative value of max_d will not produce significant benefit. Anyway, I tried to add checking for negative max_d into levenshtein_less_equal_mb function. static int levenshtein_less_equal_internal_mb(char *s, char *t, int s_len, int t_len, int ins_c, int del_c, int sub_c, int max_d) { intm, n; int *prev; int *curr; inti, j; const char *x; const char *y; CharLengthAndOffset *lengths_and_offsets; inty_char_len; intcurr_left, curr_right, prev_left, prev_right, d; intdelta, min_d; /* * We should calculate number of characters for multibyte encodings */ m = pg_mbstrlen_with_len(s, s_len); n = pg_mbstrlen_with_len(t, t_len); /* * We can transform an empty s into t with n insertions, or a non-empty t * into an empty s with m deletions. */ if (m == 0) return n * ins_c; if (n == 0) return m * del_c; /* * We can find the minimal distance by the difference of lengths */ delta = m - n; if (delta 0) min_d = delta * del_c; else if (delta 0) min_d = - delta * ins_c; else min_d = 0; if (max_d = 0 min_d max_d) return max_d + 1; /* * For security concerns, restrict excessive CPU+RAM usage. (This * implementation uses O(m) memory and has O(mn) complexity.) */ if (m MAX_LEVENSHTEIN_STRLEN || n MAX_LEVENSHTEIN_STRLEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg(argument exceeds the maximum length of %d bytes, MAX_LEVENSHTEIN_STRLEN))); /* One more cell for initialization column and row. */ ++m; ++n; /* * Instead of building an (m+1)x(n+1) array, we'll use two different * arrays of size m+1 for storing accumulated values. At each step one * represents the previous row and one is the current row of the * notional large array. * For multibyte encoding we'll also store array of lengths of * characters and array with character offsets in first string * in order to avoid great number of * pg_mblen calls. */ prev = (int *) palloc((2 * sizeof(int) + sizeof(CharLengthAndOffset)) * m ); curr = prev + m; lengths_and_offsets = (CharLengthAndOffset *)(prev + 2 * m); lengths_and_offsets[0].offset = 0; for (i = 0, x = s; i m - 1; i++) { lengths_and_offsets[i].length = pg_mblen(x); lengths_and_offsets[i + 1].offset = lengths_and_offsets[i].offset + lengths_and_offsets[i].length; x += lengths_and_offsets[i].length; } lengths_and_offsets[i].length = 0; /* Initialize the previous row to 0..cols */ curr_left = 1; d = min_d; for (i = 0; i delta; i++) { prev[i] = d; } curr_right = m; for (; i m; i++) { prev[i] = d; d += (ins_c + del_c); if (max_d = 0 d max_d) { curr_right = i; break; } } /* * There are following optimizations: * 1) Actually the minimal possible value of final distance (in the case of * all possible matches) is stored is the cells of the matrix. In the case * of movement towards diagonal, which contain last cell, value
Re: [HACKERS] multibyte charater set in levenshtein function
Excerpts from Robert Haas's message of mié jul 21 14:25:47 -0400 2010: On Wed, Jul 21, 2010 at 7:40 AM, Alexander Korotkov aekorot...@gmail.com wrote: On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas robertmh...@gmail.com wrote: Same benefit can be achived by replacing char * with char * and length. I changed !m to m == 0 because Itagaki asked me to make it conforming coding style. Do you think there is no reason to fix coding style in existing code? Yeah, we usually try to avoid changing that sort of thing in existing code, unless there's a very good reason. I think fixing a stylistic issue in code that's being edited for other purposes is fine, and a good idea going forward. We wouldn't commit a patch that would *only* fix those, because that would cause a problem for backpatches for no benefit, but if the patch touches something else, then a backpatch of another patch is going to need manual intervention anyway. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
On Wed, Jul 21, 2010 at 2:47 PM, Alexander Korotkov aekorot...@gmail.com wrote: On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas robertmh...@gmail.com wrote: *scratches head* Aren't you just moving the same call to a different place? So, where you can find this different place? :) In this patch null-terminated strings are not used at all. I can't. You win. :-) Actually, I wonder if there's enough performance improvement there that we might think about extracting that part of the patch and apply it separately. Then we could continue trying to figure out what to do with the rest. Sometimes it's simpler to deal with one change at a time. I tested it with american-english dictionary with 98569 words. test=# select sum(levenshtein(word, 'qwerqwerqwer')) from words; sum - 1074376 (1 row) Time: 131,435 ms test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',100)) from words; sum - 1074376 (1 row) Time: 221,078 ms test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',-1)) from words; sum - 1074376 (1 row) Time: 254,819 ms The function with negative value of max_d didn't become faster than with just big value of max_d. Ah, I see. That's pretty compelling, I guess. Although it still seems like a lot of code... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
2010/7/13 Alexander Korotkov aekorot...@gmail.com: Anyway I think that overhead is not ignorable. That's why I have splited levenshtein_internal into levenshtein_internal and levenshtein_internal_mb, and levenshtein_less_equal_internal into levenshtein_less_equal_internal and levenshtein_less_equal_internal_mb. Thank you for good measurement! Then, it's reasonable to have multiple implementations. It also has documentation. I'll change status of the patch to Ready for Committer. The patch is good enough except argument types for some functions. For example: - char* vs. const char* - text* vs. const char* + length I hope committers would check whether there are better types for them. -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
On Tue, Jul 20, 2010 at 3:37 AM, Itagaki Takahiro itagaki.takah...@gmail.com wrote: 2010/7/13 Alexander Korotkov aekorot...@gmail.com: Anyway I think that overhead is not ignorable. That's why I have splited levenshtein_internal into levenshtein_internal and levenshtein_internal_mb, and levenshtein_less_equal_internal into levenshtein_less_equal_internal and levenshtein_less_equal_internal_mb. Thank you for good measurement! Then, it's reasonable to have multiple implementations. It also has documentation. I'll change status of the patch to Ready for Committer. The patch is good enough except argument types for some functions. For example: - char* vs. const char* - text* vs. const char* + length I hope committers would check whether there are better types for them. This patch still needs some work. It includes a bunch of stylistic changes that aren't relevant to the purpose of the patch. There's no reason that I can see to change the existing levenshtein_internal function to take text arguments instead of char *, or to change !m to m == 0 in existing code, or to change the whitespace in the comments of that function. All of those need to be reverted before we can consider committing this. There is a huge amount of duplicated code here. I think it makes sense to have the multibyte version of the function be separate, but perhaps we can merge the less-than-or-equal to bits into the main code, so that we only have two copies instead of four. Perhaps we can't just add a max_d argument max_distance to levenshtein_internal; and if this value is =0 then it represents the max allowable distance, but if it is 0 then there is no limit. Sure, that might slow down the existing code a bit, but it might not be significant. I'd at least like to see some numbers showing that it is significant before we go to this much trouble. The code doesn't follow the project coding style. Braces must be uncuddled. Comment paragraphs will be reflowed unless they begin and end with --. Function definitions should have the type declaration on one line and the function name at the start of the next. Freeing memory with pfree is likely a waste of time; is there any reason not to just rely on the memory context reset, as the original coding did? I think we might need to remove the acknowledgments section from this code. If everyone who touches this code adds their name, we're quickly going to have a mess. If we're not going to remove the acknowledgments section, then please add my name, too, because I've already patched this code once... I'm going to set this back to Waiting on Author. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
Hi! * levenshtein_internal() and levenshtein_less_equal_internal() are very similar. Can you merge the code? We can always use less_equal_internal() if the overhead is ignorable. Did you compare them? With big value of max_d overhead is significant. Here is example on american-english dictionary from Openoffice. test=# select sum(levenshtein('qweqweqweqweqwe',word)) from words; sum - 1386456 (1 row) Time: 195,083 ms test=# select sum(levenshtein_less_equal('qweqweqweqweqwe',word,100)) from words; sum - 1386456 (1 row) Time: 317,821 ms * There are many if (!multibyte_encoding) in levenshtein_internal(). How about split the function into two funcs for single-byte chars and multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always use multi-byte version if the overhead is small. The overhead of multi-byte version was about 4 times slower. But I have rewritten my CHAR_CMP macro with inline function. And now it's only about 1.5 times slower. In database with muti-byte encoding: test=# select * from words where levenshtein('qweqweqwe',word)=5; id | word ---+-- 69053 | peewee 69781 | pewee 81279 | sequence 88421 | sweetie (4 rows) Time: 136,742 ms In database with single-byte encoding: test2=# select * from words where levenshtein('qweqweqwe',word)=5; id | word ---+-- 69053 | peewee 69781 | pewee 81279 | sequence 88421 | sweetie (4 rows) Time: 88,471 ms Anyway I think that overhead is not ignorable. That's why I have splited levenshtein_internal into levenshtein_internal and levenshtein_internal_mb, and levenshtein_less_equal_internal into levenshtein_less_equal_internal and levenshtein_less_equal_internal_mb. * I prefer a struct rather than an array. 4 * m and 3 * m look like magic numbers for me. Could you name the entries with definition of a struct? /* * For multibyte encoding we'll also store array of lengths of * characters and array with character offsets in first string * in order to avoid great number of pg_mblen calls. */ prev = (int *) palloc(4 * m * sizeof(int)); I this line of code the memory is allocated for 4 arrays: prev, curr, offsets, char_lens. So I have joined offsets and char_lens into struct. But I can't join prev and curr because of this trick: temp = curr; curr = prev; prev = temp; * There are some compiler warnings. Avoid them if possible. fuzzystrmatch.c: In function ‘levenshtein_less_equal_internal’: fuzzystrmatch.c:428: warning: ‘char_lens’ may be used uninitialized in this function fuzzystrmatch.c:428: warning: ‘offsets’ may be used uninitialized in this function fuzzystrmatch.c:430: warning: ‘curr_right’ may be used uninitialized in this function fuzzystrmatch.c: In function ‘levenshtein_internal’: fuzzystrmatch.c:222: warning: ‘char_lens’ may be used uninitialized in this function Fixed. * Coding style: Use if (m == 0) instead of if (!m) when the type of 'm' is an integer. Fixed. * Need to fix the caution in docs. http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html | Caution: At present, fuzzystrmatch does not work well with | multi-byte encodings (such as UTF-8). but now levenshtein supports multi-byte encoding! We should mention which function supports mbchars not to confuse users. I've updated this notification. Also I've added documentation for levenshtein_less_equal function. * (Not an issue for the patch, but...) Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR? Unpacking versions make the core a bit faster. Fixed. With best regards, Alexander Korotkov. fuzzystrmatch-0.4.diff.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
Hi, I'm reviewing Multibyte charater set in levenshtein function patch. https://commitfest.postgresql.org/action/patch_view?id=304 The main logic seems to be good, but I have some comments about the coding style and refactoring. * levenshtein_internal() and levenshtein_less_equal_internal() are very similar. Can you merge the code? We can always use less_equal_internal() if the overhead is ignorable. Did you compare them? * There are many if (!multibyte_encoding) in levenshtein_internal(). How about split the function into two funcs for single-byte chars and multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always use multi-byte version if the overhead is small. * I prefer a struct rather than an array. 4 * m and 3 * m look like magic numbers for me. Could you name the entries with definition of a struct? /* * For multibyte encoding we'll also store array of lengths of * characters and array with character offsets in first string * in order to avoid great number of pg_mblen calls. */ prev = (int *) palloc(4 * m * sizeof(int)); * There are some compiler warnings. Avoid them if possible. fuzzystrmatch.c: In function ‘levenshtein_less_equal_internal’: fuzzystrmatch.c:428: warning: ‘char_lens’ may be used uninitialized in this function fuzzystrmatch.c:428: warning: ‘offsets’ may be used uninitialized in this function fuzzystrmatch.c:430: warning: ‘curr_right’ may be used uninitialized in this function fuzzystrmatch.c: In function ‘levenshtein_internal’: fuzzystrmatch.c:222: warning: ‘char_lens’ may be used uninitialized in this function * Coding style: Use if (m == 0) instead of if (!m) when the type of 'm' is an integer. * Need to fix the caution in docs. http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html | Caution: At present, fuzzystrmatch does not work well with | multi-byte encodings (such as UTF-8). but now levenshtein supports multi-byte encoding! We should mention which function supports mbchars not to confuse users. * (Not an issue for the patch, but...) Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR? Unpacking versions make the core a bit faster. -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
Hello Hackers! I have extended my patch by introducing levenshtein_less_equal function. This function have additional argument max_d and stops calculating when distance exceeds max_d. With low values of max_d function works much faster than original one. The example of original levenshtein function usage: test=# select word, levenshtein(word, 'consistent') as dist from words where levenshtein(word, 'consistent') = 2 order by dist; word | dist -+-- consistent |0 insistent |2 consistency |2 coexistent |2 consistence |2 (5 rows) test=# explain analyze select word, levenshtein(word, 'consistent') as dist from words where levenshtein(word, 'consistent') = 2 order by dist; QUERY PLAN --- Sort (cost=2779.13..2830.38 rows=20502 width=8) (actual time=203.652..203.658 rows=5 loops=1) Sort Key: (levenshtein(word, 'consistent'::text)) Sort Method: quicksort Memory: 25kB - Seq Scan on words (cost=0.00..1310.83 rows=20502 width=8) (actual time=19.019..203.601 rows=5 loops=1) Filter: (levenshtein(word, 'consistent'::text) = 2) Total runtime: 203.723 ms (6 rows) Example of levenshtein_less_equal usage in this case: test=# select word, levenshtein_less_equal(word, 'consistent', 2) as dist from words where levenshtein_less_equal(word, 'consistent', 2) = 2 order by dist; word | dist -+-- consistent |0 insistent |2 consistency |2 coexistent |2 consistence |2 test=# explain analyze select word, levenshtein_less_equal(word, 'consistent', 2) as dist from words where levenshtein_less_equal(word, 'consistent', 2) = 2 order by dist; QUERY PLAN - Sort (cost=2779.13..2830.38 rows=20502 width=8) (actual time=42.198..42.203 rows=5 loops=1) Sort Key: (levenshtein_less_equal(word, 'consistent'::text, 2)) Sort Method: quicksort Memory: 25kB - Seq Scan on words (cost=0.00..1310.83 rows=20502 width=8) (actual time=5.391..42.143 rows=5 loops=1) Filter: (levenshtein_less_equal(word, 'consistent'::text, 2) = 2) Total runtime: 42.292 ms (6 rows) In the example above levenshtein_less_equal works about 5 times faster. With best regards, Alexander Korotkov. fuzzystrmatch-0.3.diff.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
On Thu, May 13, 2010 at 6:03 AM, Alvaro Herrera alvhe...@commandprompt.com wrote: Well, since it's only used in one place, why are you defining a macro at all? In order to structure code better. My question was about another. Is memcmp function good choice to compare very short sequences of bytes (from 1 to 4 bytes)?
Re: [HACKERS] multibyte charater set in levenshtein function
On Wed, May 12, 2010 at 11:04 PM, Alvaro Herrera alvhe...@alvh.no-ip.orgwrote: On a quick look, I didn't like the way you separated the pg_database_encoding_max_length() 1 cases. There seem to be too much common code. Can that be refactored a bit better? I did a little refactoring in order to avoid some similar code. I'm not quite sure about my CHAR_CMP macro. Is it a good idea? fuzzystrmatch-0.2.diff.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] multibyte charater set in levenshtein function
Hackers, The current version of levenshtein function in fuzzystrmatch contrib modulte doesn't work properly with multibyte charater sets. test=# select levenshtein('фыва','аыва'); levenshtein - 2 (1 row) My patch make this function works properly with multibyte charater sets. test=# select levenshtein('фыва','аыва'); levenshtein - 1 (1 row) Also it avoids text_to_cstring call. Regards, Alexander Korotkov. fuzzystrmatch.diff.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
Excerpts from Alexander Korotkov's message of lun may 10 11:35:02 -0400 2010: Hackers, The current version of levenshtein function in fuzzystrmatch contrib modulte doesn't work properly with multibyte charater sets. My patch make this function works properly with multibyte charater sets. Great. Please add it to the next commitfest: http://commitfest.postgresql.org On a quick look, I didn't like the way you separated the pg_database_encoding_max_length() 1 cases. There seem to be too much common code. Can that be refactored a bit better? -- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multibyte charater set in levenshtein function
Alexander Korotkov escribió: On Wed, May 12, 2010 at 11:04 PM, Alvaro Herrera alvhe...@alvh.no-ip.orgwrote: On a quick look, I didn't like the way you separated the pg_database_encoding_max_length() 1 cases. There seem to be too much common code. Can that be refactored a bit better? I did a little refactoring in order to avoid some similar code. I'm not quite sure about my CHAR_CMP macro. Is it a good idea? Well, since it's only used in one place, why are you defining a macro at all? -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers