Re: [HACKERS] levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Robert Haas robertmh...@gmail.com writes: I spent some time hacking on this. It doesn't appear to be too easy to get levenshtein_less_equal() working without slowing down plain old levenshtein() by about 6%. Is that really enough slowdown to be worth contorting the code to avoid? I've never heard of an application where the speed of this function was the bottleneck. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Excerpts from Tom Lane's message of mié oct 13 10:32:36 -0300 2010: Robert Haas robertmh...@gmail.com writes: I spent some time hacking on this. It doesn't appear to be too easy to get levenshtein_less_equal() working without slowing down plain old levenshtein() by about 6%. Is that really enough slowdown to be worth contorting the code to avoid? I've never heard of an application where the speed of this function was the bottleneck. What if it's used on a expression index on a large table? -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Alvaro Herrera alvhe...@commandprompt.com writes: Excerpts from Tom Lane's message of mié oct 13 10:32:36 -0300 2010: Robert Haas robertmh...@gmail.com writes: I spent some time hacking on this. It doesn't appear to be too easy to get levenshtein_less_equal() working without slowing down plain old levenshtein() by about 6%. Is that really enough slowdown to be worth contorting the code to avoid? I've never heard of an application where the speed of this function was the bottleneck. What if it's used on a expression index on a large table? So? Expression indexes don't result in multiple evaluations of the function. If anything, that context would probably be even less sensitive to the function runtime than non-index use. But the main point is that 6% performance penalty in a non-core function is well below my threshold of pain. If it were important enough to care about that kind of performance difference, it'd be in core. I'd rather see us keeping the code simple, short, and maintainable. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] levenshtein_less_equal (was: multibyte charater set in levenshtein function)
On Wed, Oct 13, 2010 at 10:51 AM, Tom Lane t...@sss.pgh.pa.us wrote: Alvaro Herrera alvhe...@commandprompt.com writes: Excerpts from Tom Lane's message of mié oct 13 10:32:36 -0300 2010: Robert Haas robertmh...@gmail.com writes: I spent some time hacking on this. It doesn't appear to be too easy to get levenshtein_less_equal() working without slowing down plain old levenshtein() by about 6%. Is that really enough slowdown to be worth contorting the code to avoid? I've never heard of an application where the speed of this function was the bottleneck. What if it's used on a expression index on a large table? So? Expression indexes don't result in multiple evaluations of the function. If anything, that context would probably be even less sensitive to the function runtime than non-index use. But the main point is that 6% performance penalty in a non-core function is well below my threshold of pain. If it were important enough to care about that kind of performance difference, it'd be in core. I'd rather see us keeping the code simple, short, and maintainable. Well, then you have to wonder whether it's worth having the lesss-than-or-equal-to version around at all. That's only about 2x faster on the same test case. I do think it's likely that people who call this function will call it many times, however - e.g. trying to find the closest matches from a dictionary for a given input string, so the worry about performance doesn't seem totally out of place. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Robert Haas robertmh...@gmail.com writes: On Wed, Oct 13, 2010 at 10:51 AM, Tom Lane t...@sss.pgh.pa.us wrote: But the main point is that 6% performance penalty in a non-core function is well below my threshold of pain. Well, then you have to wonder whether it's worth having the lesss-than-or-equal-to version around at all. That's only about 2x faster on the same test case. Same test case? I thought they did different things? I do think it's likely that people who call this function will call it many times, however - e.g. trying to find the closest matches from a dictionary for a given input string, so the worry about performance doesn't seem totally out of place. No doubt, but the actual function runtime is only one component of the cost of applying it to a lot of dictionary entries --- I would think that the table read costs are the larger component anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] levenshtein_less_equal (was: multibyte charater set in levenshtein function)
On Wednesday 13 October 2010 16:18:01 Alvaro Herrera wrote: Excerpts from Tom Lane's message of mié oct 13 10:32:36 -0300 2010: Robert Haas robertmh...@gmail.com writes: I spent some time hacking on this. It doesn't appear to be too easy to get levenshtein_less_equal() working without slowing down plain old levenshtein() by about 6%. Is that really enough slowdown to be worth contorting the code to avoid? I've never heard of an application where the speed of this function was the bottleneck. What if it's used on a expression index on a large table? Its hard to use it as an sensible expression index, given that you use it to calculate difference between two strings. Whats more important is, that its used for sorting the results of a query - where its more important that its fast. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] levenshtein_less_equal (was: multibyte charater set in levenshtein function)
On Wed, Oct 13, 2010 at 11:42 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Wed, Oct 13, 2010 at 10:51 AM, Tom Lane t...@sss.pgh.pa.us wrote: But the main point is that 6% performance penalty in a non-core function is well below my threshold of pain. Well, then you have to wonder whether it's worth having the lesss-than-or-equal-to version around at all. That's only about 2x faster on the same test case. Same test case? I thought they did different things? levenshtein_less_equal(a, b, max_d) returns the same value as levenshtein(a, b) if levenshtein(a, b) = max_d. Otherwise it returns max_d + 1. So it's the same test case with a small distance bound (2) applied. As Alexander says, the value of levenshtein_less_equal accelerates pretty rapidly when long strings are involved, so it seems worth having, but I'm not sure I agree that the slowdown to the basic function is negligible. It is not really all that much #ifdef hackery to avoid it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] levenshtein_less_equal (was: multibyte charater set in levenshtein function)
No doubt, but the actual function runtime is only one component of the cost of applying it to a lot of dictionary entries --- I would think that the table read costs are the larger component anyway. Data domain can be not only dictionary but also something like article titles, urls and so on. On such relatively long strings (about 100 characters and more) this component will be significant (especially if most part of the table is lying in cache). In this case search of near strings can be accelerated in more than 10 times. I think that this use case justifies presence of separate leveshtein_less_equal function. With best regards, Alexander Korotkov.
Re: [HACKERS] levenshtein_less_equal (was: multibyte charater set in levenshtein function)
In this case search of near strings can be accelerated in more than 10 times. I mean that component of function runtime can be accelerated in more than 10 times. Of course, actual overall speedup depends on many other factors, but I belive that it also can be significant. With best regards, Alexander Korotkov.
Re: [HACKERS] levenshtein_less_equal (was: multibyte charater set in levenshtein function)
On Fri, Oct 8, 2010 at 12:51 PM, Alexander Korotkov aekorot...@gmail.com wrote: Sorry, I'm pretty unconversant in git. Now, it should be ok. I spent some time hacking on this. It doesn't appear to be too easy to get levenshtein_less_equal() working without slowing down plain old levenshtein() by about 6%. The slowdown was similar on the 0.2 version of the patch, the 0.3.1 version of the patch, and another version I put together that's a bit different from either. The test query was: SELECT * FROM words WHERE levenshtein(a, 'extensize') = 2; In the version I was experimenting with, I had a large chunk of code that looked like this: if (max_d = 0) { /* Do stuff */ } Removing that block of code buys back most of the performance loss that occurs in the case where max_d 0. I have to conclude that sticking more stuff into the main loop defeats some optimization opportunities that would otherwise be available, even in the case where the branch isn't taken. So I'm tempted to employ the previously-discussed sledgehammer of compiling levenshtein_internal twice, with and without support for max_d. A patch taking this approach is attached. I'm not totally sold on this approach, if someone has a constructive suggestion. Comments? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company levenshtein_less_equal-0.4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] levenshtein_less_equal (was: multibyte charater set in levenshtein function)
Sorry, I'm pretty *unconversant in git. Now, it should be ok.* With best regards, Alexander Korotkov. *** a/contrib/fuzzystrmatch/fuzzystrmatch.c --- b/contrib/fuzzystrmatch/fuzzystrmatch.c *** *** 61,66 PG_MODULE_MAGIC; --- 61,68 */ extern Datum levenshtein_with_costs(PG_FUNCTION_ARGS); extern Datum levenshtein(PG_FUNCTION_ARGS); + extern Datum levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS); + extern Datum levenshtein_less_equal(PG_FUNCTION_ARGS); extern Datum metaphone(PG_FUNCTION_ARGS); extern Datum soundex(PG_FUNCTION_ARGS); extern Datum difference(PG_FUNCTION_ARGS); *** *** 92,98 soundex_code(char letter) #define MAX_LEVENSHTEIN_STRLEN 255 static int levenshtein_internal(text *s, text *t, ! int ins_c, int del_c, int sub_c); /* --- 94,100 #define MAX_LEVENSHTEIN_STRLEN 255 static int levenshtein_internal(text *s, text *t, ! int ins_c, int del_c, int sub_c, int max_d); /* *** *** 202,211 rest_of_char_same(const char *s1, const char *s2, int len) * between supplied strings. Generally * (1, 1, 1) penalty costs suffices common * cases, but your mileage may vary. */ static int levenshtein_internal(text *s, text *t, ! int ins_c, int del_c, int sub_c) { int m, n, --- 204,217 * between supplied strings. Generally * (1, 1, 1) penalty costs suffices common * cases, but your mileage may vary. + * Returns accurate value if max_d 0 or + * actual distance is less or equal than + * max_d, otherwise returns value greater + * than max_d. */ static int levenshtein_internal(text *s, text *t, ! int ins_c, int del_c, int sub_c, int max_d) { int m, n, *** *** 219,224 levenshtein_internal(text *s, text *t, --- 225,233 const char *s_data; const char *t_data; const char *y; + const char *prev_x = NULL; + intmin_i = 0, max_i = 0, d; + intdelta = 0, min_d = 0, theor_max_d; /* Extract a pointer to the actual character data. */ s_data = VARDATA_ANY(s); *** *** 240,245 levenshtein_internal(text *s, text *t, --- 249,268 return m * del_c; /* + * There is theoretical maximum distance based of string lengths. It + * represents the case, when no characters are matching. If max_d + * reaches this value than we can use accurate calculation of distance + * which is faster in this case. + */ + if (max_d = 0) + { + theor_max_d = Min(m*del_c + n*ins_c, (m n) ? + (n * sub_c + (m - n) * del_c):(m * sub_c + (n - m) * ins_c)) - 1; + if (max_d = theor_max_d) + max_d = -1; + } + + /* * For security concerns, restrict excessive CPU+RAM usage. (This * implementation uses O(m) memory and has O(mn) complexity.) */ *** *** 251,256 levenshtein_internal(text *s, text *t, --- 274,296 MAX_LEVENSHTEIN_STRLEN))); /* + * We can find the minimal distance by the difference of string lengths. + */ + if (max_d = 0) + { + delta = m - n; + if (delta 0) + min_d = delta * del_c; + else if (delta 0) + min_d = - delta * ins_c; + else + min_d = 0; + + if (min_d max_d) + return max_d + 1; + } + + /* * In order to avoid calling pg_mblen() repeatedly on each character in s, * we cache all the lengths before starting the main loop -- but if all the * characters in both strings are single byte, then we skip this and use *** *** 298,309 levenshtein_internal(text *s, text *t, */ for (i = 0; i m; i++) prev[i] = i * del_c; /* Loop through rows of the notional array */ for (y = t_data, j = 1; j n; j++) { int *temp; - const char *x = s_data; int y_char_len = n != t_bytes + 1 ? pg_mblen(y) : 1; /* --- 338,372 */ for (i = 0; i m; i++) prev[i] = i * del_c; + /* + * If we have limitation of max_d, than we'll maintain [min_i; max_i] + * interval, which bound cells, where sum of cell value and smallest + * possible residual cost is less or equal to max_d (we don't include + * 0 index into this interval). Residual cost represent cost of insertions + * or deletions for moving to diagonal, which containing bottom right cell. + * The sum value saves important property of original matrix, that this + * sum for cell always greater or equal than such sums for cells, which are + * used for it's calculation. That's why this sum can be used for bound + * interval. + */ + if (max_d = 0) + { + min_i = 1; + max_i = 1; + while (max_i m prev[max_i] + + ((delta - max_i 0) ? (delta - max_i) * del_c : + (-delta + max_i) * ins_c) = max_d) + { + max_i++; + } + max_i--; + prev_x = s_data; + } /* Loop through rows of the notional array */ for (y = t_data, j = 1; j n; j++) { int
Re: [HACKERS] levenshtein_less_equal (was: multibyte charater set in levenshtein function)
2010/10/4 Alexander Korotkov aekorot...@gmail.com: I've reworked patch with your suggestion. In this version I found a little slowdown in comparison with previous version: SELECT * FROM words WHERE levenshtein_less_equal(a, 'extensize', 2) = 2; 48,069 ms = 57,875 ms SELECT * FROM words2 WHERE levenshtein_less_equal(a, 'клубничный', 3) = 2; 100,073 ms = 113,975 ms select * from phrases where levenshtein_less_equal('nucklehead courtliest sapphires be coniferous emolument antarctic Laocoon''s deadens unseemly', a, 10) = 10; 22,876 ms = 24,721 ms test=# select * from phrases2 where levenshtein_less_equal('таяй раскупорившийся передислоцируется юлианович праздничный лачужка присыхать опппливший ффехтовальный добряющий', a, 10) = 10; 55,405 ms = 57,760 ms I think it is caused by multiplication operation for each bound movement. Probably, this slowdown is ignorable or there is some way to achieve the same performance. This patch doesn't apply cleanly. It also seems to revert some recent commits to fuzzystrmatch.c. Can you please send a corrected version? [rhaas pgsql]$ patch -p1 ~/Downloads/levenshtein_less_equal-0.3.patch patching file contrib/fuzzystrmatch/fuzzystrmatch.c Reversed (or previously applied) patch detected! Assume -R? [n] Apply anyway? [n] y Hunk #1 FAILED at 5. Hunk #8 FAILED at 317. Hunk #9 succeeded at 543 (offset 10 lines). Hunk #10 succeeded at 567 (offset 10 lines). Hunk #11 succeeded at 578 (offset 10 lines). 2 out of 11 hunks FAILED -- saving rejects to file contrib/fuzzystrmatch/fuzzystrmatch.c.rej patching file contrib/fuzzystrmatch/fuzzystrmatch.sql.in patching file doc/src/sgml/fuzzystrmatch.sgml -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] levenshtein_less_equal (was: multibyte charater set in levenshtein function)
I've reworked patch with your suggestion. In this version I found a little slowdown in comparison with previous version: SELECT * FROM words WHERE levenshtein_less_equal(a, 'extensize', 2) = 2; 48,069 ms = 57,875 ms SELECT * FROM words2 WHERE levenshtein_less_equal(a, 'клубничный', 3) = 2; 100,073 ms = 113,975 ms select * from phrases where levenshtein_less_equal('nucklehead courtliest sapphires be coniferous emolument antarctic Laocoon''s deadens unseemly', a, 10) = 10; 22,876 ms = 24,721 ms test=# select * from phrases2 where levenshtein_less_equal('таяй раскупорившийся передислоцируется юлианович праздничный лачужка присыхать опппливший ффехтовальный добряющий', a, 10) = 10; 55,405 ms = 57,760 ms I think it is caused by multiplication operation for each bound movement. Probably, this slowdown is ignorable or there is some way to achieve the same performance. With best regards, Alexander Korotkov. diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c index fe5fa88..7739431 100644 --- a/contrib/fuzzystrmatch/fuzzystrmatch.c +++ b/contrib/fuzzystrmatch/fuzzystrmatch.c @@ -5,7 +5,7 @@ * * Joe Conway m...@joeconway.com * - * $PostgreSQL$ + * contrib/fuzzystrmatch/fuzzystrmatch.c * Copyright (c) 2001-2010, PostgreSQL Global Development Group * ALL RIGHTS RESERVED; * @@ -61,6 +61,8 @@ PG_MODULE_MAGIC; */ extern Datum levenshtein_with_costs(PG_FUNCTION_ARGS); extern Datum levenshtein(PG_FUNCTION_ARGS); +extern Datum levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS); +extern Datum levenshtein_less_equal(PG_FUNCTION_ARGS); extern Datum metaphone(PG_FUNCTION_ARGS); extern Datum soundex(PG_FUNCTION_ARGS); extern Datum difference(PG_FUNCTION_ARGS); @@ -92,7 +94,7 @@ soundex_code(char letter) #define MAX_LEVENSHTEIN_STRLEN 255 static int levenshtein_internal(text *s, text *t, - int ins_c, int del_c, int sub_c); + int ins_c, int del_c, int sub_c, int max_d); /* @@ -202,10 +204,14 @@ rest_of_char_same(const char *s1, const char *s2, int len) * between supplied strings. Generally * (1, 1, 1) penalty costs suffices common * cases, but your mileage may vary. + * Returns accurate value if max_d 0 or + * actual distance is less or equal than + * max_d, otherwise returns value greater + * than max_d. */ static int levenshtein_internal(text *s, text *t, - int ins_c, int del_c, int sub_c) + int ins_c, int del_c, int sub_c, int max_d) { int m, n, @@ -219,6 +225,9 @@ levenshtein_internal(text *s, text *t, const char *s_data; const char *t_data; const char *y; + const char *prev_x = NULL; + intmin_i = 0, max_i = 0, d; + intdelta = 0, min_d = 0, theor_max_d; /* Extract a pointer to the actual character data. */ s_data = VARDATA_ANY(s); @@ -240,6 +249,20 @@ levenshtein_internal(text *s, text *t, return m * del_c; /* + * There is theoretical maximum distance based of string lengths. It + * represents the case, when no characters are matching. If max_d + * reaches this value than we can use accurate calculation of distance + * which is faster in this case. + */ + if (max_d = 0) + { + theor_max_d = Min(m*del_c + n*ins_c, (m n) ? + (n * sub_c + (m - n) * del_c):(m * sub_c + (n - m) * ins_c)) - 1; + if (max_d = theor_max_d) + max_d = -1; + } + + /* * For security concerns, restrict excessive CPU+RAM usage. (This * implementation uses O(m) memory and has O(mn) complexity.) */ @@ -251,6 +274,23 @@ levenshtein_internal(text *s, text *t, MAX_LEVENSHTEIN_STRLEN))); /* + * We can find the minimal distance by the difference of string lengths. + */ + if (max_d = 0) + { + delta = m - n; + if (delta 0) + min_d = delta * del_c; + else if (delta 0) + min_d = - delta * ins_c; + else + min_d = 0; + + if (min_d max_d) + return max_d + 1; + } + + /* * In order to avoid calling pg_mblen() repeatedly on each character in s, * we cache all the lengths before starting the main loop -- but if all the * characters in both strings are single byte, then we skip this and use @@ -277,92 +317,209 @@ levenshtein_internal(text *s, text *t, ++n; /* - * Instead of building an (m+1)x(n+1) array, we'll use two different - * arrays of size m+1 for storing accumulated values. At each step one - * represents the previous row and one is the current row of the - * notional large array. + * One way to compute Levenshtein distance is to incrementally construct + * an (m+1)x(n+1) matrix where cell (i, j) represents the minimum number + * of operations required to transform the first i characters of s into + * the first j characters of t. The last column of the final row is the + * answer. + * + * We use that algorithm here with some modification. In lieu of holding + * the entire array in memory at once, we'll just use two arrays of size + * m+1 for storing accumulated values. At each step
Re: [HACKERS] levenshtein_less_equal (was: multibyte charater set in levenshtein function)
I'll try to illustrate different approaches in examples. k i t t e n 0 1 2 3 4 5 6 s 1 1 2 3 4 5 6 i 2 2 1 2 3 4 5 t 3 3 2 1 2 3 4 t 4 4 3 2 1 2 3 i 5 5 4 3 2 2 3 n 6 6 5 4 3 3 2 g 7 7 6 5 4 4 3 The approach mentioned in Wikipedia limits filling of the matrix by diagonals, which are in k-distance from main diagonal (diagonal which contain left top cell). All other cell is guaranteed to have value greater than k. This approach can be extended to the case of costs different from 1, in this case limit diagonals will be closer to main diagonal proportional to the costs. Here is example of such limited matrix with k = 3. k i t t e n 0 1 2 3 s 1 1 2 3 4 i 2 2 1 2 3 4 t 3 3 2 1 2 3 4 t 4 3 2 1 2 3 i4 3 2 2 3 n 4 3 3 2 g 4 4 3 The first idea of my approach is to use actual cell values to limit matrix filling. For each row the range of columns where cell values are not greater than k is stored. And in the next row only cells are caclucated, which use values of cells from previous row in stored range. Here is example of this approach with k = 3. There are slightly less filled cells but calculation are more complicated than in previoud approach. k i t t e n 0 1 2 3 s 1 1 2 3 i 2 2 1 2 3 t 3 3 2 1 2 3 t3 2 1 2 3 i 3 2 2 3 n 3 3 2 g3 The second idea is to make values in matrix possible greater. I analyze what exactly is matrix in this case. It is sum of original matrix, which represent distances between prefixes of strings, and matrix, which represent cost of insertions or deletions for moving to diagonal, which containing bottom right cell. There is an example of second matrix summand: k i t t e n 1 2 3 4 5 6 7 s 0 1 2 3 4 5 6 i 1 0 1 2 3 4 5 t 2 1 0 1 2 3 4 t 3 2 1 0 1 2 3 i 4 3 2 1 0 1 2 n 5 4 3 2 1 0 1 g 6 5 4 3 2 1 0 And an example of resulting matrix: k i t t e n 1 3 5 7 9 11 13 s 1 2 4 6 8 10 12 i 3 2 2 4 6 8 10 t 5 4 2 2 4 6 8 t 7 6 4 2 2 4 6 i 9 8 6 4 2 3 5 n 11 10 8 6 4 3 3 g 13 12 10 8 6 5 3 The resulting matrix saves important property of original matrix, that cell value always greater or equal than values, which are used for it's calculation. That's why we can use idea about matrix filling limiting for this matrix, but values in this matrix are greater and it allow to avoid filling bigger part of matrix. There is an example of matrix filling limiting for this matrix: k i t t e n 1 3 s 1 2 i 2 2 t2 2 t 2 2 i 2 3 n 3 3 g3 With best regards, Alexander Korotkov.
Re: [HACKERS] levenshtein_less_equal (was: multibyte charater set in levenshtein function)
On Mon, Sep 27, 2010 at 8:42 AM, Alexander Korotkov aekorot...@gmail.com wrote: The second idea is to make values in matrix possible greater. I analyze what exactly is matrix in this case. It is sum of original matrix, which represent distances between prefixes of strings, and matrix, which represent cost of insertions or deletions for moving to diagonal, which containing bottom right cell. There is an example of second matrix summand: k i t t e n 1 2 3 4 5 6 7 s 0 1 2 3 4 5 6 i 1 0 1 2 3 4 5 t 2 1 0 1 2 3 4 t 3 2 1 0 1 2 3 i 4 3 2 1 0 1 2 n 5 4 3 2 1 0 1 g 6 5 4 3 2 1 0 And an example of resulting matrix: k i t t e n 1 3 5 7 9 11 13 s 1 2 4 6 8 10 12 i 3 2 2 4 6 8 10 t 5 4 2 2 4 6 8 t 7 6 4 2 2 4 6 i 9 8 6 4 2 3 5 n 11 10 8 6 4 3 3 g 13 12 10 8 6 5 3 The resulting matrix saves important property of original matrix, that cell value always greater or equal than values, which are used for it's calculation. Hmm. So if I understand correctly (and it's possible that I don't), the idea here is that for each cell in the matrix, we store the sum of the following two quantities: 1. The cost of transforming an initial substring of s into an initial substring of t (as before), and 2. The smallest imaginable cost of transforming the remaining portion of s into the remaining portion of t, based on the difference in the string lengths. Thus, any cell whose value is already max_d can't be relevant to the final result. The code seems a bit complex, though. In particular, I'm wondering if we couldn't simplify things by leaving the meaning of the cells in the matrix unchanged and just using this calculation to adjust the lower and upper bounds for each scan. Perhaps instead of curr/prev_left/right we could call them min_i and max_i, and after each inner loop from min_i to max_i we could try to increment min_i and decrement max_i based on whether cur[min/max_i] + the smallest possible residual cost max_d. Then you've got to also increment max_i once for each value of j. Does that make any sense or am I all wet? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] levenshtein_less_equal (was: multibyte charater set in levenshtein function)
On Sat, Aug 28, 2010 at 8:34 AM, Alexander Korotkov aekorot...@gmail.com wrote: Here is the patch which adds levenshtein_less_equal function. I'm going to add it to current commitfest. There are some minor stylistic issues with this patch - e.g. lines ending in whitespace, cuddled elses - but those don't look too terribly difficult to fix. I'm more concerned about the fact that I don't really understand the algorithm you're using. Actually, I didn't really understand the original algorithm either until I went and read up on it, and I just adjusted the comments to make it a bit more clear what it's doing. That caused some minor merge conflicts with your patch, so I'm attaching a rebased version that applies cleanly over my changes. Can you explain a bit more what algorithm this is using? It seems like in the max_d = 0 case the cells of the notional array have a meaning which is completely different from what they mean in otherwise, and it's not clear to me from reading the comments what that meaning is. I took a look on that font of all human knowledge, Wikipedia: http://en.wikipedia.org/wiki/Levenshtein_distance Their suggestion for handling this case is: If we are only interested in the distance if it is smaller than a threshold k, then it suffices to compute a diagonal stripe of width 2k+1 in the matrix. In this way, the algorithm can be run in O(kl) time, where l is the length of the shortest string. It seems like that may be similar to what you're doing here but I don't think that's exactly it. I don't think that exact thing would work in our case anyhow because we've got configurable costs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company levenshtein_less_equal-0.2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers