Re: [HACKERS] multibyte charater set in levenshtein function

2010-09-01 Thread Robert Haas
2010/8/28 Alexander Korotkov aekorot...@gmail.com:
 Now test for levenshtein_less_equal performance.

Nice results.  I'll try to find time to look at this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-28 Thread Robert Haas
On Aug 28, 2010, at 8:34 AM, Alexander Korotkov aekorot...@gmail.com wrote:
 Here is the patch which adds levenshtein_less_equal function. I'm going to 
 add it to current commitfest.

Cool. Please submit some performance results comparing levenshtein in HEAD vs. 
levenshtein with this patch vs. levenshtein_less_equal. Perhaps the test cases 
we used previously would be a good place to start.

...Robert

Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-28 Thread Alexander Korotkov
Here is the patch which adds levenshtein_less_equal function. I'm going to
add it to current commitfest.


With best regards,
Alexander Korotkov.


On Tue, Aug 3, 2010 at 3:23 AM, Robert Haas robertmh...@gmail.com wrote:

 On Mon, Aug 2, 2010 at 5:07 PM, Alexander Korotkov aekorot...@gmail.com
 wrote:
  Now I think patch is as good as can be. :)

 OK, committed.

  I'm going to prepare less-or-equal function in same manner as this patch.

 Sounds good.  Since we're now more than half-way through this
 CommitFest and this patch has undergone quite a bit of change from
 what we started out with, I'd like to ask that you submit the new
 patch for the next CommitFest, to begin September 15th.

 https://commitfest.postgresql.org/action/commitfest_view/open

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise Postgres Company

*** a/contrib/fuzzystrmatch/fuzzystrmatch.c
--- b/contrib/fuzzystrmatch/fuzzystrmatch.c
***
*** 61,66  PG_MODULE_MAGIC;
--- 61,68 
   */
  extern Datum levenshtein_with_costs(PG_FUNCTION_ARGS);
  extern Datum levenshtein(PG_FUNCTION_ARGS);
+ extern Datum levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS);
+ extern Datum levenshtein_less_equal(PG_FUNCTION_ARGS);
  extern Datum metaphone(PG_FUNCTION_ARGS);
  extern Datum soundex(PG_FUNCTION_ARGS);
  extern Datum difference(PG_FUNCTION_ARGS);
***
*** 92,98  soundex_code(char letter)
  #define MAX_LEVENSHTEIN_STRLEN		255
  
  static int levenshtein_internal(text *s, text *t,
! 	 int ins_c, int del_c, int sub_c);
  
  
  /*
--- 94,100 
  #define MAX_LEVENSHTEIN_STRLEN		255
  
  static int levenshtein_internal(text *s, text *t,
! 	 int ins_c, int del_c, int sub_c, int max_d);
  
  
  /*
***
*** 202,211  rest_of_char_same(const char *s1, const char *s2, int len)
   *		  between supplied strings. Generally
   *		  (1, 1, 1) penalty costs suffices common
   *		  cases, but your mileage may vary.
   */
  static int
  levenshtein_internal(text *s, text *t,
! 	 int ins_c, int del_c, int sub_c)
  {
  	int			m,
  n,
--- 204,217 
   *		  between supplied strings. Generally
   *		  (1, 1, 1) penalty costs suffices common
   *		  cases, but your mileage may vary.
+  *		  Returns accurate value if max_d  0 or
+  *		  actual distance is less or equal than
+  *		  max_d, otherwise returns value greater
+  *		  than max_d.
   */
  static int
  levenshtein_internal(text *s, text *t,
! 	 int ins_c, int del_c, int sub_c, int max_d)
  {
  	int			m,
  n,
***
*** 219,224  levenshtein_internal(text *s, text *t,
--- 225,233 
  	const char *s_data;
  	const char *t_data;
  	const char *y;
+ 	const char *prev_x = NULL;
+ 	intcurr_left = 0, curr_right = 0, prev_left, prev_right, d;
+ 	intdelta = 0, min_d = 0, theor_max_d;
  
  	/* Extract a pointer to the actual character data. */
  	s_data = VARDATA_ANY(s);
***
*** 238,243  levenshtein_internal(text *s, text *t,
--- 247,266 
  		return n * ins_c;
  	if (!n)
  		return m * del_c;
+ 	
+ 	/*
+ 	 * There is theoretical maximum distance based of string lengths. It
+ 	 * represents the case, when no characters are matching. If max_d
+ 	 * reaches this value than we can use accurate calculation of distance
+ 	 * which is faster in this case.
+ 	 */
+ 	if (max_d = 0)
+ 	{
+ 		theor_max_d = Min(m*del_c + n*ins_c, (m  n) ?
+ 			(n * sub_c + (m - n) * del_c):(m * sub_c + (n - m) * ins_c)) - 1;
+ 		if (max_d = theor_max_d)
+ 			max_d = -1;
+ 	}
  
  	/*
  	 * For security concerns, restrict excessive CPU+RAM usage. (This
***
*** 251,256  levenshtein_internal(text *s, text *t,
--- 274,296 
  		MAX_LEVENSHTEIN_STRLEN)));
  
  	/*
+ 	 * We can find the minimal distance by the difference of string lengths.
+ 	 */
+ 	if (max_d = 0)
+ 	{
+ 		delta = m - n;
+ 		if (delta  0)
+ 			min_d = delta * del_c;
+ 		else if (delta  0)
+ 			min_d = - delta * ins_c;
+ 		else
+ 			min_d = 0;
+ 
+ 		if (min_d  max_d)
+ 			return max_d + 1;
+ 	}
+ 
+ 	/*
  	 * In order to avoid calling pg_mblen() repeatedly on each character in s,
  	 * we cache all the lengths before starting the main loop -- but if all the
  	 * characters in both strings are single byte, then we skip this and use
***
*** 286,369  levenshtein_internal(text *s, text *t,
  	curr = prev + m;
  
  	/* Initialize the previous row to 0..cols */
! 	for (i = 0; i  m; i++)
! 		prev[i] = i * del_c;
  
  	/* Loop through rows of the notional array */
  	for (y = t_data, j = 1; j  n; j++)
  	{
  		int		   *temp;
- 		const char *x = s_data;
  		int			y_char_len = n != t_bytes + 1 ? pg_mblen(y) : 1;
  
  		/*
! 		 * First cell must increment sequentially, as we're on the j'th row of
! 		 * the (m+1)x(n+1) array.
! 		 */
! 		curr[0] = j * ins_c;
! 
! 		/*
! 		 * This inner loop is critical to performance, so we include a
  		 * fast-path to handle the 

Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-28 Thread Alexander Korotkov
SELECT SUM(levenshtein(a, 'foo')) from words;
SELECT SUM(levenshtein(a, 'Urbański')) FROM words;
SELECT SUM(levenshtein(a, 'ańs')) FROM words;
SELECT SUM(levenshtein(a, 'foo')) from words2;
SELECT SUM(levenshtein(a, 'дом')) FROM words2;
SELECT SUM(levenshtein(a, 'компьютер')) FROM words2;

Before this patch results was:
67 98 63 102 108 177
And after patch:
65 100 65 104 111 182
It is slower a bit, but I think it is a price for having less code
duplication.

Now test for levenshtein_less_equal performance.

test=# SELECT * FROM words WHERE levenshtein(a, 'extensize') = 2;
 a
---
 expensive
 extension
 extensive
(3 rows)

Time: 98,083 ms

test=# SELECT * FROM words WHERE levenshtein_less_equal(a, 'extensize', 2)
= 2;
 a
---
 expensive
 extension
 extensive
(3 rows)

Time: 48,069 ms

test=# SELECT * FROM words2 WHERE levenshtein(a, 'клубничный') = 2;
 a

 кузничный
 кубичный
 клубочный
 клубничный
(4 rows)

Time: 197,499 ms

test=# SELECT * FROM words2 WHERE levenshtein_less_equal(a, 'клубничный', 3)
= 2;
 a

 кузничный
 кубичный
 клубочный
 клубничный
(4 rows)

Time: 100,073 ms

It gives some speedup for search on word with small distances.

test=# SELECT sum(levenshtein(a, 'extensize')) from words;
  sum

 835545
(1 row)

Time: 96,253 ms

test=# SELECT sum(levenshtein_less_equal(a, 'extensize', 20)) from words;
  sum

 835545
(1 row)

Time: 98,438 ms

With big distances it gives similar performance because in this case similar
branch of code is used.
In order to test it with longer words I created a table with random
combinations of words.

test=# create table phrases (a text primary key);
test=# insert into phrases select string_agg(y.a, ' ') from (select x.a,
row_number() over () from (select a from words order by random()) x) y group
by (y.row_number / 10);

test=# select * from phrases limit 10;

a
--
 hosiery's spermatozoa Haleakala disco greenness beehive paranoids atrophy
unfair
 Weinberg's all profanity's individualized bellowed perishables Romanov's
communicant's landlady's pauperized
 Ito seasoned jawbreakers you'll hurling's showcasing anecdota cauliflower
intellectualism facilitate
 infantryman's busheled designing lucid nutritionists Aesculapius's rifle
clefting exult Milosevic
 foresight lurking capitulation's pantsuits traumatizes moonlighting
lancer's Longfellow rising unclasps
 permutes centralest cavalryman Dwight granddaughter knight tallyho's sober
graduate locket's
 knucklehead courtliest sapphires baize coniferous emolument antarctic
Laocoon's deadens unseemly
 Tartars utopians Pekingese's Cartwright badge's Dollie's spryness Ajax's
Amber's bookkeeper
 spouting concordant Indus carbonation cinnamon's stockbrokers Evita's
Canaries Waldorf's rampage
 Amparo exertions Kuznetsk's divots humongous wolverine chugged laurels
Goliaths hazelnut
(10 rows)

test=# select * from phrases where levenshtein('nucklehead courtliest
sapphires be coniferous emolument antarctic Laocoon''s deadens unseemly', a)
= 10;

a
--
 knucklehead courtliest sapphires baize coniferous emolument antarctic
Laocoon's deadens unseemly
(1 row)

Time: 581,850 ms

test=# select * from phrases where levenshtein_less_equal('nucklehead
courtliest   sapphires be coniferous emolument antarctic Laocoon''s deadens
unseemly', a, 10) = 10;

a
--
 knucklehead courtliest sapphires baize coniferous emolument antarctic
Laocoon's deadens unseemly
(1 row)

Time: 22,876 ms

It gives great speedup for search with small distances on relatively long
strings.

test=# select sum(levenshtein('nucklehead courtliest   sapphires be
coniferous emolument antarctic Laocoon''s deadens unseemly', a)) from
phrases;
  sum

 794601
(1 row)

Time: 571,138 ms

test=# select sum(levenshtein_less_equal('nucklehead courtliest
sapphires be coniferous emolument antarctic Laocoon''s deadens unseemly', a,
100)) from phrases;
  sum

 794601
(1 row)

Time: 562,895 ms

Similar results appears for multi-byte strings.

test=# create table phrases2 (a text primary key);
test=# insert into phrases2 select string_agg(y.a, ' ') from (select x.a,
row_number() over () from (select a from words2 order by random()) x) y
group by (y.row_number / 10);
INSERT 0 14584


test=# select * from phrases2 limit 10;

a

---
 борис спиннинг растопочный цифра выдохнуть иридий гнёздышко омоновский
базовый
 пожжёте закосить насыщающий паратый продрых обеспложивание милливатт
бамбуковый отпекающий книжница
 приложиться разный устремляющийся отучающийся абрикосовка 

Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-04 Thread Alexander Korotkov
On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas robertmh...@gmail.com wrote:

 I reviewed this code in a fair amount of detail today and ended up
 rewriting it.  In general terms, it's best to avoid changing things
 that are not relevant to the central purpose of the patch.  This patch
 randomly adds a whole bunch of whitespace and removes a whole bunch of
 comments which I find useful and do not want to see removed.  It took
 me about an hour to reverse out all of those changes, which then let
 me get down to the business of actually analyzing the code.

I'm sorry. This is my first patch, I will be more accurate next time. But, I
think there is no unified opinion about some changes like replacement !m
with m==0.

I think this line is not correct:
if (m != s_bytes  n != t_bytes)
The correct negation of assumption, that both strings are single-byte, is
the assumption, that at least one string is not single-byte. In this patch
levenshtein function can calculate distance incorrectly:
test=# select levenshtein('фw', 'ww');
 levenshtein
-
   2
(1 row)
This line should be rewritten by this.
if (m != s_bytes || n != t_bytes)

The attached patch takes the approach of using a fast-path for just
 the innermost loop when neither string contains any multi-byte
 characters (regardless of whether or not the encoding allows
 multi-byte characters).  It's still slower than CVS HEAD, but for
 strings containing no multi-byte characters it's only marginally, if
 at all, slower than previous releases, because 9.0 doesn't contain
 your fix to avoid the extra string copy; and it's faster than your
 implementation even when multi-byte characters ARE present.  It is
 slightly slower on single-byte encoding databases (I tested
 SQL_ASCII), but still faster than previous releases.  It's also quite
 a bit less code duplication.


Can I look at the test which shows that your implementation is faster than
my when multi-byte characters are present? My tests show reverse. For
example, with russian dictionary of openoffice:

With my version of patch:
test=# select sum(levenshtein(word, 'фывафыва')) from test;
   sum
-
 1675281
(1 row)

Time: 277,190 ms

With your version of patch:
test=# select sum(levenshtein(word, 'фывафыва')) from test;
   sum
-
 1675281
(1 row)

Time: 398,011 ms

I think that the cause of slow down slowdown is memcmp function. This
function is very fast for long parts of memory, but of few bytes inline
function is faster, because compiler have more freedom for optimization.
After replacing memcmp with inline function like following in your
implementation:

static inline bool char_cmp(const char *s, const char *d, int len)
{
while (len--)
{
if (*s++ != *d++)
return false;
}
return true;
}

Code becomes much faster:

test=# select sum(levenshtein(word, 'фывафыва')) from test;
   sum
-
 1675281
(1 row)

Time: 241,272 ms


With best regards,
Alexander Korotkov.


Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-04 Thread Alexander Korotkov
Now I think patch is as good as can be. :)
I'm going to prepare less-or-equal function in same manner as this patch.


With best regards,
Alexander Korotkov.


Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-02 Thread Robert Haas
2010/8/2 Alexander Korotkov aekorot...@gmail.com:
 On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas robertmh...@gmail.com wrote:
 I reviewed this code in a fair amount of detail today and ended up
 rewriting it.  In general terms, it's best to avoid changing things
 that are not relevant to the central purpose of the patch.  This patch
 randomly adds a whole bunch of whitespace and removes a whole bunch of
 comments which I find useful and do not want to see removed.  It took
 me about an hour to reverse out all of those changes, which then let
 me get down to the business of actually analyzing the code.
 I'm sorry. This is my first patch, I will be more accurate next time. But, I
 think there is no unified opinion about some changes like replacement !m
 with m==0.

Sure; if that were the only such change, I wouldn't have mentioned it.

 I think this line is not correct:
 if (m != s_bytes  n != t_bytes)

Good catch, fixed in the attached version.

 The attached patch takes the approach of using a fast-path for just
 the innermost loop when neither string contains any multi-byte
 characters (regardless of whether or not the encoding allows
 multi-byte characters).  It's still slower than CVS HEAD, but for
 strings containing no multi-byte characters it's only marginally, if
 at all, slower than previous releases, because 9.0 doesn't contain
 your fix to avoid the extra string copy; and it's faster than your
 implementation even when multi-byte characters ARE present.  It is
 slightly slower on single-byte encoding databases (I tested
 SQL_ASCII), but still faster than previous releases.  It's also quite
 a bit less code duplication.

 Can I look at the test which shows that your implementation is faster than
 my when multi-byte characters are present? My tests show reverse. For
 example, with russian dictionary of openoffice:

Hmm, with the correction you mention above, I'm getting about the same
results with multi-byte characters (but faster with only single-byte
characters).  I did this:

CREATE TABLE words (a text not null primary key);
COPY words from '/usr/share/dict/words';

SELECT SUM(levenshtein(a, 'foo')) from words;
SELECT SUM(levenshtein(a, 'Urbański')) FROM words;
SELECT SUM(levenshtein(a, 'ańs')) FROM words;

With the updated patch attached below, these ran about 102 ms, 222 ms,
129 ms.  With your patch, I get about 134 ms, 222 ms, 130 ms.  Perhaps
this is because I only have multi-byte characters in one of the
inputs, not both.  I don't have a dictionary handy with multi-byte
words in it.  Can you send me a file?

 I think that the cause of slow down slowdown is memcmp function. This
 function is very fast for long parts of memory, but of few bytes inline
 function is faster, because compiler have more freedom for optimization.
 After replacing memcmp with inline function like following in your
 implementation:

Yeah, that does seem to be slightly better.  I've done a version of
this in the attached patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


mb_levenshtein_rmh-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-02 Thread Robert Haas
2010/8/2 Alexander Korotkov aekorot...@gmail.com:
 The dump of the table with russian dictionary is in attachment.

 I use following tests:
 SELECT SUM(levenshtein(a, 'foo')) from words;
 SELECT SUM(levenshtein(a, 'Urbański')) FROM words;
 SELECT SUM(levenshtein(a, 'ańs')) FROM words;
 SELECT SUM(levenshtein(a, 'foo')) from words2;
 SELECT SUM(levenshtein(a, 'дом')) FROM words2;
 SELECT SUM(levenshtein(a, 'компьютер')) FROM words2;

 With your last version of patch results are:
 63ms 94ms 61ms 100ms 121ms 217ms.

 But I found some other optimization. With this condition:
 if (x[x_char_len-1] == y[y_char_len-1]  x_char_len == y_char_len 
     (x_char_len == 1 || char_same(x, y, x_char_len - 1)))
 results become:
 58ms 96ms 63ms 102ms 107ms 171ms

 On single-byte characters results almost didn't change, but they come better
 with multi-byte characters. Generally, this improvement is because first
 bytes of different characters are frequently same (for example, almost all
 Cyrillic characters start from same byte, and I think that there is similar
 situation in other alphabets), but match of last bytes is just a
 coincidence.

Hey, that's really cool.  It helps a lot here, too:

My previous patch, with your 5 test cases:
Time: 100.556 ms
Time: 205.254 ms
Time: 127.070 ms
Time: 77.734 ms
Time: 90.345 ms
Time: 166.954 ms

Your original patch, same 5 test cases:
Time: 131.489 ms
Time: 215.048 ms
Time: 125.471 ms
Time: 80.068 ms
Time: 87.110 ms
Time: 166.918 ms

My patch, with your proposed change to compare the last character
first, same 5 test cases:
Time: 96.489 ms
Time: 201.497 ms
Time: 121.876 ms
Time: 75.005 ms
Time: 80.219 ms
Time: 142.844 ms

Updated patch attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


mb_levenshtein_rmh-v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-02 Thread Robert Haas
On Mon, Aug 2, 2010 at 5:07 PM, Alexander Korotkov aekorot...@gmail.com wrote:
 Now I think patch is as good as can be. :)

OK, committed.

 I'm going to prepare less-or-equal function in same manner as this patch.

Sounds good.  Since we're now more than half-way through this
CommitFest and this patch has undergone quite a bit of change from
what we started out with, I'd like to ask that you submit the new
patch for the next CommitFest, to begin September 15th.

https://commitfest.postgresql.org/action/commitfest_view/open

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-01 Thread Robert Haas
On Fri, Jul 30, 2010 at 1:14 PM, Alexander Korotkov
aekorot...@gmail.com wrote:
 Ok, here is the patch for multi-byte characters.
 I changed arguments of levenshtein_internal function from text * to const
 char * and int. I think that it makes levenshtein_internal more reusable.
 For example, this function can be used for calculation distance of two
 fragments of strings without need of copying of these fragments.
 Actullay, I'm not fully satisfied with my solution in merging of single-byte
 and multi-byte versions of function. There are ifdef's in body of function
 and non context-free macros. This is due to multi-byte needs to store
 characters lengths returned by pg_mblen when single-byte version doesn't
 need that. I tried multi-byte version without storing characters lengths,
 but in this case function is about 1.5 times slower (function needs to call
 pg_mblen n*m times).

I reviewed this code in a fair amount of detail today and ended up
rewriting it.  In general terms, it's best to avoid changing things
that are not relevant to the central purpose of the patch.  This patch
randomly adds a whole bunch of whitespace and removes a whole bunch of
comments which I find useful and do not want to see removed.  It took
me about an hour to reverse out all of those changes, which then let
me get down to the business of actually analyzing the code.   The
biggest problem I found is that, as coded, this is more than 50%
slower on UTF-8 databases, because the multi-byte path is really slow,
even if there are actually no multi-byte characters present.

The attached patch takes the approach of using a fast-path for just
the innermost loop when neither string contains any multi-byte
characters (regardless of whether or not the encoding allows
multi-byte characters).  It's still slower than CVS HEAD, but for
strings containing no multi-byte characters it's only marginally, if
at all, slower than previous releases, because 9.0 doesn't contain
your fix to avoid the extra string copy; and it's faster than your
implementation even when multi-byte characters ARE present.  It is
slightly slower on single-byte encoding databases (I tested
SQL_ASCII), but still faster than previous releases.  It's also quite
a bit less code duplication.

Please review and let me know your thoughts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


mb_levenshtein_rmh.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-29 Thread Alexander Korotkov
I forgot attribution in levenshtein.c file.


With best regards,
Alexander Korotkov.


fuzzystrmatch-0.5.1.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-29 Thread Robert Haas
On Wed, Jul 21, 2010 at 5:59 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Jul 21, 2010 at 2:47 PM, Alexander Korotkov
 aekorot...@gmail.com wrote:
 On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas robertmh...@gmail.com wrote:

 *scratches head*  Aren't you just moving the same call to a different
 place?

 So, where you can find this different place? :) In this patch
 null-terminated strings are not used at all.

 I can't.  You win.  :-)

 Actually, I wonder if there's enough performance improvement there
 that we might think about extracting that part of the patch and apply
 it separately.  Then we could continue trying to figure out what to do
 with the rest.  Sometimes it's simpler to deal with one change at a
 time.

I tested this today and the answer was a resounding yes.  I ran
sum(levenshtein(t, 'foo')) over a dictionary file with about 2 million
words and got a speedup of around 15% just by eliminating the
text_to_cstring() calls.  So I've committed that part of this patch.

I'll try to look at the rest of the patch when I get a chance, but I'm
wondering if it might make sense to split it into two patches -
specifically, one patch to handle multi-byte characters correctly, and
then a second patch for the less-than-or-equal-to functions.  I think
that might simplify reviewing a bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-27 Thread Alexander Korotkov
Here is new version of my patch. There are following changes:

1) I've merged singlebyte and multibyte versions of levenshtein_internal and
levenshtein_less_equal_internal using macros and includes.
2) I found that levenshtein takes reasonable time even for long strings.
There is an example with strings which contain random combinations of words.

test=# select count(*) from words;
 count
---
 98569
(1 row)

test=# select * from words limit 10;
  id   |
word| next_id
---++-
   200 | Agnew diodes drilled composts Wassermann nonprofit adjusting
Chautauqua|   17370
  4589 | Dhaka craziness savvies teenager ploughs Barents's unwed
zither|   70983
  5418 | Eroses gauchos bowls smellier weeklies tormentors cornmeal
punctuality |   96685
  6103 | G prompt boron overthrew cricking begonia snuffers
blazed  |   34518
  4707 | Djibouti Mari pencilling approves pointing's pashas magpie
rams|   71200
 10125 | Lyle Edgardo livers gruelled capable cancels gnawing's
inhospitable|   28018
  9615 | Leos deputy evener yogis assault palatals Octobers
surceased   |   21878
 15640 | Starr's arrests malapropism Koufax's aesthetes Howell rustier
Algeria  |   19316
 15776 | Sucre battle's misapprehension's predicating nutmegged
electrification bowl planks |   65739
 16878 | Updike Forster ragout keenly exhalation grayed switch
guava's  |   43013
(10 rows)

Time: 26,125 ms

Time: 137,061 ms
test=# select avg(length(word)) from words;
 avg
-
 74.6052207083362923
(1 row)

Time: 96,415 ms
test=# select * from words where levenshtein(word, 'Dhaka craziness savvies
teeae luhs Barentss unwe zher')=10;
  id  |  word   |
next_id
--+-+-
 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 2957,426 ms
test=# select * from words where levenshtein_less_equal(word, 'Dhaka
craziness savvies teeae luhs Barentss unwe zher',10)=10;
  id  |  word   |
next_id
--+-+-
 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 84,755 ms

It takes O(max(n, m) * max_d / min(ins_c, max_c)) in worst case. That's why
I changed limitation of this function to max_d = 255 * min(ins_c, del_c)

3) I found that there is no need for storing character offsets in
levenshtein_less_equal_internal_mb. Instead of this first position in
string, where value of distance wasn't greater than max_d, can be stored.

4) I found that there is theoretical maximum distance based of string
lengths. It represents the case, when no characters are matching. If
theoretical maximum distance is less than given maximum distance we can use
theoretical maximum distance. It improves behavior of levenshtein_less_equal
with high max_d, but it's still slower than levenshtein:

test=# select * from words where levenshtein_less_equal(word, 'Dhaka
craziness savvies teeae luhs Barentss unwe zher',1000)=10;
  id  |  word   |
next_id
--+-+-
 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 4102,731 ms
test=# select * from words where levenshtein(word, 'Dhaka craziness savvies
teeae luhs Barentss unwe zher')=10;
  id  |  word   |
next_id
--+-+-
 4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 2983,244 ms


With best regards,
Alexander Korotkov.


fuzzystrmatch-0.5.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-23 Thread Alvaro Herrera
Excerpts from Alexander Korotkov's message of jue jul 22 03:21:57 -0400 2010:
 On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas robertmh...@gmail.com wrote:
 
  Ah, I see.  That's pretty compelling, I guess.  Although it still
  seems like a lot of code...
 
 I think there is a way to merge single-byte and multi-byte versions of
 functions without loss in performance using macros and includes (like in
 'backend/utils/adt/like.c'). Do you think it is acceptable in this case?

Using the trick of a single source file similar to like_match.c that
implements all the different behaviors, and include that file more than
once with different macro definitions, is probably a good idea.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-22 Thread Robert Haas
On Thu, Jul 22, 2010 at 3:21 AM, Alexander Korotkov
aekorot...@gmail.com wrote:
 On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas robertmh...@gmail.com wrote:

 Ah, I see.  That's pretty compelling, I guess.  Although it still
 seems like a lot of code...

 I think there is a way to merge single-byte and multi-byte versions of
 functions without loss in performance using macros and includes (like in
 'backend/utils/adt/like.c'). Do you think it is acceptable in this case?

I'm not sure.  I'd like to get a second opinion from someone else on
which way to go with this, but unfortunately 2 or 3 of the main
committers are on vacation at the moment.

Does anyone else have an opinion on what to do about this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-22 Thread Alexander Korotkov
Such version with macros and includes can look like this:

#ifdef MULTIBYTE
#define NEXT_X (x+= char_lens[i-1])
#define NEXT_Y (y+= y_char_len)
#define CMP (char_cmp(x, char_lens[i-1], y, y_char_len))
#else
#define NEXT_X (x++)
#define NEXT_Y (y++)
#define CMP (*x == *y)
#endif

static int
levenshtein_internal(text *s, text *t, int ins_c, int del_c, int sub_c)
{
intm,
n,
d;
int   *prev;
int   *curr;
inti,
j;
const char *x;
const char *y;
#ifdef MULTIBYTE
int   *char_lens;
inty_char_len;
#endif

/*
 * We should calculate number of characters for multibyte encodings
 */
m = pg_mbstrlen_with_len(VARDATA_ANY(s), VARSIZE_ANY_EXHDR(s));
n = pg_mbstrlen_with_len(VARDATA_ANY(t), VARSIZE_ANY_EXHDR(t));

/*
 * We can transform an empty s into t with n insertions, or a non-empty
t
 * into an empty s with m deletions.
 */
if (m == 0)
return n * ins_c;
if (n == 0)
return m * del_c;

/*
 * For security concerns, restrict excessive CPU+RAM usage. (This
 * implementation uses O(m) memory and has O(mn) complexity.)
 */
if (m  MAX_LEVENSHTEIN_STRLEN ||
n  MAX_LEVENSHTEIN_STRLEN)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 errmsg(argument exceeds the maximum length of %d bytes,
MAX_LEVENSHTEIN_STRLEN)));

/* One more cell for initialization column and row. */
++m;
++n;

/*
 * Instead of building an (m+1)x(n+1) array, we'll use two different
 * arrays of size m+1 for storing accumulated values. At each step one
 * represents the previous row and one is the current row of the
 * notional large array.
 * For multibyte encoding we'll also store array of lengths of
 * characters of first string in order to avoid great number of
  * pg_mblen calls.
 */
#ifdef MULTIBYTE
prev = (int *) palloc(3 * m * sizeof(int));
curr = prev + m;
char_lens = prev + 2 * m;
for (i = 0, x = VARDATA_ANY(s); i  m - 1; i++)
{
char_lens[i] = pg_mblen(x);
x += char_lens[i];
}
char_lens[i] = 0;
#else
prev = (int *) palloc(2 * m * sizeof(int));
curr = prev + m;
#endif

/* Initialize the previous row to 0..cols */
for (i = 0; i  m; i++)
prev[i] = i * del_c;

/*
 * For multibyte encoding we should increase x and y pointers by the
 * results of pg_mblen function. Also we should use CHAR_CMP macros
 * for character comparison.
 */
for (y = VARDATA_ANY(t), j = 1; j  n; NEXT_Y, j++)
{
int   *temp;
#ifdef MULTIBYTE
y_char_len = pg_mblen(y);
#endif

curr[0] = j * ins_c;

for (x = VARDATA_ANY(s), i = 1; i  m; NEXT_X, i++)
{
intins;
intdel;
intsub;

ins = prev[i] + ins_c;
del = curr[i - 1] + del_c;
sub = prev[i - 1] + (CMP ? 0 : sub_c);
curr[i] = Min(ins, del);
curr[i] = Min(curr[i], sub);
}

temp = curr;
curr = prev;
prev = temp;
}

d = prev[m - 1];

/*
 * Because the final value was swapped from the previous row to the
 * current row, that's where we'll find it.
 */
return d;
}

What do you thing about it?


With best regards,
Alexander Korotkov.


Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-22 Thread Alexander Korotkov
On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas robertmh...@gmail.com wrote:

 Ah, I see.  That's pretty compelling, I guess.  Although it still
 seems like a lot of code...

I think there is a way to merge single-byte and multi-byte versions of
functions without loss in performance using macros and includes (like in
'backend/utils/adt/like.c'). Do you think it is acceptable in this case?


With best regards,
Alexander Korotkov.


Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-21 Thread Alexander Korotkov
On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas robertmh...@gmail.com wrote:

 This patch still needs some work.  It includes a bunch of stylistic
 changes that aren't relevant to the purpose of the patch.  There's no
 reason that I can see to change the existing levenshtein_internal
 function to take text arguments instead of char *, or to change !m to
 m == 0 in existing code, or to change the whitespace in the comments
 of that function.  All of those need to be reverted before we can
 consider committing this.

I changed arguments of function from char * to text * in order to avoid
text_to_cstring call. Same benefit can be achived by replacing char * with
char * and length.
I changed !m to m == 0 because Itagaki asked me to make it conforming coding
style. Do you think there is no reason to fix coding style in existing
code?

 There is a huge amount of duplicated code here.  I think it makes
 sense to have the multibyte version of the function be separate, but
 perhaps we can merge the less-than-or-equal to bits  into the main
 code, so that we only have two copies instead of four.  Perhaps we
 can't just add a max_d argument max_distance to levenshtein_internal;
 and if this value is =0 then it represents the max allowable
 distance, but if it is 0 then there is no limit.  Sure, that might
 slow down the existing code a bit, but it might not be significant.
 I'd at least like to see some numbers showing that it is significant
 before we go to this much trouble.

In these case we should add many checks of max_d in levenshtein_internal
function which make code more complex.
Actually, we can merge all four functions into one function. But such
function will have many checks about multibyte encoding and max_d. So, I see
four cases here:
1) one function with checks for multibyte encoding and max_d
2) two functions with checks for multibyte encoding
3) two functions with checks for max_d
4) four separate functions
If you prefer case number 3 you should argue your position little more.

 The code doesn't follow the project coding style.  Braces must be
 uncuddled.  Comment paragraphs will be reflowed unless they begin and
 end with --.  Function definitions should have the type
 declaration on one line and the function name at the start of the
 next.

 Freeing memory with pfree is likely a waste of time; is there any
 reason not to just rely on the memory context reset, as the original
 coding did?

Ok, I'll fix this things.


 I think we might need to remove the acknowledgments section from this
 code.  If everyone who touches this code adds their name, we're
 quickly going to have a mess.  If we're not going to remove the
 acknowledgments section, then please add my name, too, because I've
 already patched this code once...

In that case I think we can leave original acknowledgments section.


With best regards,
Alexander Korotkov.


Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-21 Thread Robert Haas
On Wed, Jul 21, 2010 at 7:40 AM, Alexander Korotkov
aekorot...@gmail.com wrote:
 On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas robertmh...@gmail.com wrote:
 This patch still needs some work.  It includes a bunch of stylistic
 changes that aren't relevant to the purpose of the patch.  There's no
 reason that I can see to change the existing levenshtein_internal
 function to take text arguments instead of char *, or to change !m to
 m == 0 in existing code, or to change the whitespace in the comments
 of that function.  All of those need to be reverted before we can
 consider committing this.

 I changed arguments of function from char * to text * in order to avoid
 text_to_cstring call.

*scratches head*  Aren't you just moving the same call to a different place?

 Same benefit can be achived by replacing char * with
 char * and length.
 I changed !m to m == 0 because Itagaki asked me to make it conforming coding
 style. Do you think there is no reason to fix coding style in existing
 code?

Yeah, we usually try to avoid changing that sort of thing in existing
code, unless there's a very good reason.

 There is a huge amount of duplicated code here.  I think it makes
 sense to have the multibyte version of the function be separate, but
 perhaps we can merge the less-than-or-equal to bits  into the main
 code, so that we only have two copies instead of four.  Perhaps we
 can't just add a max_d argument max_distance to levenshtein_internal;
 and if this value is =0 then it represents the max allowable
 distance, but if it is 0 then there is no limit.  Sure, that might
 slow down the existing code a bit, but it might not be significant.
 I'd at least like to see some numbers showing that it is significant
 before we go to this much trouble.

 In these case we should add many checks of max_d in levenshtein_internal
 function which make code more complex.

When you say many checks, how many?

 Actually, we can merge all four functions into one function. But such
 function will have many checks about multibyte encoding and max_d. So, I see
 four cases here:
 1) one function with checks for multibyte encoding and max_d
 2) two functions with checks for multibyte encoding
 3) two functions with checks for max_d
 4) four separate functions
 If you prefer case number 3 you should argue your position little more.

I'm somewhat convinced that separating the multibyte case out has a
performance benefit both by intuition and because you posted some
numbers, but I haven't seen any argument for separating out the other
case, so I'm asking if you've checked and whether there is an effect
and whether it's significant.  The default is always to try to avoid
maintaining multiple copies of substantially identical code, due to
the danger that a future patch might fail to update all of them and
thus introduce a bug.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-21 Thread Alexander Korotkov
On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas robertmh...@gmail.com wrote:

 *scratches head*  Aren't you just moving the same call to a different
 place?

So, where you can find this different place? :) In this patch
null-terminated strings are not used at all.


 Yeah, we usually try to avoid changing that sort of thing in existing
  code, unless there's a very good reason.

Ok.

 In these case we should add many checks of max_d in levenshtein_internal
  function which make code more complex.

 When you say many checks, how many?

  Actually, we can merge all four functions into one function. But such
  function will have many checks about multibyte encoding and max_d. So, I
 see
  four cases here:
  1) one function with checks for multibyte encoding and max_d
  2) two functions with checks for multibyte encoding
  3) two functions with checks for max_d
  4) four separate functions
  If you prefer case number 3 you should argue your position little more.

 I'm somewhat convinced that separating the multibyte case out has a
 performance benefit both by intuition and because you posted some
 numbers, but I haven't seen any argument for separating out the other
 case, so I'm asking if you've checked and whether there is an effect
 and whether it's significant.  The default is always to try to avoid
 maintaining multiple copies of substantially identical code, due to
 the danger that a future patch might fail to update all of them and
 thus introduce a bug.


I've tested it with big value of max_d and I thought that it's evident that
checking for negative value of max_d will not produce significant benefit.
Anyway, I tried to add checking for negative max_d into
levenshtein_less_equal_mb function.

static int
levenshtein_less_equal_internal_mb(char *s, char *t, int s_len, int t_len,
int ins_c, int del_c, int sub_c, int max_d)
{
intm,
n;
int   *prev;
int   *curr;
inti,
j;
const char *x;
const char *y;
CharLengthAndOffset *lengths_and_offsets;
inty_char_len;
intcurr_left, curr_right, prev_left, prev_right, d;
intdelta, min_d;


/*
 * We should calculate number of characters for multibyte encodings
 */
m = pg_mbstrlen_with_len(s, s_len);
n = pg_mbstrlen_with_len(t, t_len);

/*
 * We can transform an empty s into t with n insertions, or a non-empty
t
 * into an empty s with m deletions.
 */
if (m == 0)
return n * ins_c;
if (n == 0)
return m * del_c;

/*
 * We can find the minimal distance by the difference of lengths
 */
delta = m - n;
if (delta  0)
min_d = delta * del_c;
else if (delta  0)
min_d = - delta * ins_c;
else
min_d = 0;
if (max_d = 0  min_d  max_d)
return max_d + 1;

/*
 * For security concerns, restrict excessive CPU+RAM usage. (This
 * implementation uses O(m) memory and has O(mn) complexity.)
 */
if (m  MAX_LEVENSHTEIN_STRLEN ||
n  MAX_LEVENSHTEIN_STRLEN)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 errmsg(argument exceeds the maximum length of %d bytes,
MAX_LEVENSHTEIN_STRLEN)));

/* One more cell for initialization column and row. */
++m;
++n;


/*
 * Instead of building an (m+1)x(n+1) array, we'll use two different
 * arrays of size m+1 for storing accumulated values. At each step one
 * represents the previous row and one is the current row of the
 * notional large array.
 * For multibyte encoding we'll also store array of lengths of
 * characters and array with character offsets in first string
 * in order to avoid great number of
 * pg_mblen calls.
 */
prev = (int *) palloc((2 * sizeof(int) + sizeof(CharLengthAndOffset)) *
m );
curr = prev + m;
lengths_and_offsets = (CharLengthAndOffset *)(prev + 2 * m);
lengths_and_offsets[0].offset = 0;
for (i = 0, x = s; i  m - 1; i++)
{
lengths_and_offsets[i].length = pg_mblen(x);
lengths_and_offsets[i + 1].offset = lengths_and_offsets[i].offset +
lengths_and_offsets[i].length;
x += lengths_and_offsets[i].length;
}
lengths_and_offsets[i].length = 0;


/* Initialize the previous row to 0..cols */
curr_left = 1;
d = min_d;
for (i = 0; i  delta; i++)
{
prev[i] = d;
}
curr_right = m;
for (; i  m; i++)
{
prev[i] = d;
d += (ins_c + del_c);
if (max_d = 0  d  max_d)
{
curr_right = i;
break;
}
}

/*
 * There are following optimizations:
 * 1) Actually the minimal possible value of final distance (in the case
of
 * all possible matches) is stored is the cells of the matrix. In the
case
 * of movement towards diagonal, which contain last cell, value 

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-21 Thread Alvaro Herrera
Excerpts from Robert Haas's message of mié jul 21 14:25:47 -0400 2010:
 On Wed, Jul 21, 2010 at 7:40 AM, Alexander Korotkov
 aekorot...@gmail.com wrote:
  On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas robertmh...@gmail.com wrote:

  Same benefit can be achived by replacing char * with
  char * and length.
  I changed !m to m == 0 because Itagaki asked me to make it conforming coding
  style. Do you think there is no reason to fix coding style in existing
  code?
 
 Yeah, we usually try to avoid changing that sort of thing in existing
 code, unless there's a very good reason.

I think fixing a stylistic issue in code that's being edited for other
purposes is fine, and a good idea going forward.  We wouldn't commit a
patch that would *only* fix those, because that would cause a problem
for backpatches for no benefit, but if the patch touches something else,
then a backpatch of another patch is going to need manual intervention
anyway.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-21 Thread Robert Haas
On Wed, Jul 21, 2010 at 2:47 PM, Alexander Korotkov
aekorot...@gmail.com wrote:
 On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas robertmh...@gmail.com wrote:

 *scratches head*  Aren't you just moving the same call to a different
 place?

 So, where you can find this different place? :) In this patch
 null-terminated strings are not used at all.

I can't.  You win.  :-)

Actually, I wonder if there's enough performance improvement there
that we might think about extracting that part of the patch and apply
it separately.  Then we could continue trying to figure out what to do
with the rest.  Sometimes it's simpler to deal with one change at a
time.

 I tested it with american-english dictionary with 98569 words.

 test=# select sum(levenshtein(word, 'qwerqwerqwer')) from words;
    sum
 -
  1074376
 (1 row)

 Time: 131,435 ms
 test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',100)) from
 words;
    sum
 -
  1074376
 (1 row)

 Time: 221,078 ms
 test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',-1)) from
 words;
    sum
 -
  1074376
 (1 row)

 Time: 254,819 ms

 The function with negative value of max_d didn't become faster than with
 just big value of max_d.

Ah, I see.  That's pretty compelling, I guess.  Although it still
seems like a lot of code...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-20 Thread Itagaki Takahiro
2010/7/13 Alexander Korotkov aekorot...@gmail.com:
 Anyway I think that overhead is not ignorable. That's why I have splited
 levenshtein_internal into levenshtein_internal and levenshtein_internal_mb,
 and levenshtein_less_equal_internal into levenshtein_less_equal_internal and
 levenshtein_less_equal_internal_mb.

Thank you for good measurement! Then, it's reasonable to have multiple
implementations. It also has documentation. I'll change status of the
patch to Ready for Committer.

The patch is good enough except argument types for some functions.
For example:
  - char* vs. const char*
  - text* vs. const char* + length
I hope committers would check whether there are better types for them.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-20 Thread Robert Haas
On Tue, Jul 20, 2010 at 3:37 AM, Itagaki Takahiro
itagaki.takah...@gmail.com wrote:
 2010/7/13 Alexander Korotkov aekorot...@gmail.com:
 Anyway I think that overhead is not ignorable. That's why I have splited
 levenshtein_internal into levenshtein_internal and levenshtein_internal_mb,
 and levenshtein_less_equal_internal into levenshtein_less_equal_internal and
 levenshtein_less_equal_internal_mb.

 Thank you for good measurement! Then, it's reasonable to have multiple
 implementations. It also has documentation. I'll change status of the
 patch to Ready for Committer.

 The patch is good enough except argument types for some functions.
 For example:
  - char* vs. const char*
  - text* vs. const char* + length
 I hope committers would check whether there are better types for them.

This patch still needs some work.  It includes a bunch of stylistic
changes that aren't relevant to the purpose of the patch.  There's no
reason that I can see to change the existing levenshtein_internal
function to take text arguments instead of char *, or to change !m to
m == 0 in existing code, or to change the whitespace in the comments
of that function.  All of those need to be reverted before we can
consider committing this.

There is a huge amount of duplicated code here.  I think it makes
sense to have the multibyte version of the function be separate, but
perhaps we can merge the less-than-or-equal to bits  into the main
code, so that we only have two copies instead of four.  Perhaps we
can't just add a max_d argument max_distance to levenshtein_internal;
and if this value is =0 then it represents the max allowable
distance, but if it is 0 then there is no limit.  Sure, that might
slow down the existing code a bit, but it might not be significant.
I'd at least like to see some numbers showing that it is significant
before we go to this much trouble.

The code doesn't follow the project coding style.  Braces must be
uncuddled.  Comment paragraphs will be reflowed unless they begin and
end with --.  Function definitions should have the type
declaration on one line and the function name at the start of the
next.

Freeing memory with pfree is likely a waste of time; is there any
reason not to just rely on the memory context reset, as the original
coding did?

I think we might need to remove the acknowledgments section from this
code.  If everyone who touches this code adds their name, we're
quickly going to have a mess.  If we're not going to remove the
acknowledgments section, then please add my name, too, because I've
already patched this code once...

I'm going to set this back to Waiting on Author.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-13 Thread Alexander Korotkov
Hi!

* levenshtein_internal() and levenshtein_less_equal_internal() are very
  similar. Can you merge the code? We can always use less_equal_internal()
  if the overhead is ignorable. Did you compare them?

With big value of max_d overhead is significant. Here is example on
american-english dictionary from Openoffice.

test=# select sum(levenshtein('qweqweqweqweqwe',word)) from words;
   sum
-
 1386456
(1 row)

Time: 195,083 ms
test=# select sum(levenshtein_less_equal('qweqweqweqweqwe',word,100)) from
words;
   sum
-
 1386456
(1 row)

Time: 317,821 ms


 * There are many if (!multibyte_encoding) in levenshtein_internal().
  How about split the function into two funcs for single-byte chars and
  multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always
  use multi-byte version if the overhead is small.

The overhead of multi-byte version was about 4 times slower. But I have
rewritten my CHAR_CMP macro with inline function. And now it's only about
1.5 times slower.

In database with muti-byte encoding:
test=# select * from words where levenshtein('qweqweqwe',word)=5;
  id   |   word
---+--
 69053 | peewee
 69781 | pewee
 81279 | sequence
 88421 | sweetie
(4 rows)

Time: 136,742 ms

In database with single-byte encoding:
test2=# select * from words where levenshtein('qweqweqwe',word)=5;
  id   |   word
---+--
 69053 | peewee
 69781 | pewee
 81279 | sequence
 88421 | sweetie
(4 rows)

Time: 88,471 ms

Anyway I think that overhead is not ignorable. That's why I have splited
levenshtein_internal into levenshtein_internal and levenshtein_internal_mb,
and levenshtein_less_equal_internal into levenshtein_less_equal_internal and
levenshtein_less_equal_internal_mb.


 * I prefer a struct rather than an array.  4 * m and 3 * m look like
 magic
  numbers for me. Could you name the entries with definition of a struct?
/*
 * For multibyte encoding we'll also store array of lengths of
 * characters and array with character offsets in first string
 * in order to avoid great number of pg_mblen calls.
 */
prev = (int *) palloc(4 * m * sizeof(int));

I this line of code the memory is allocated for 4 arrays: prev, curr,
offsets, char_lens. So I have joined offsets and char_lens into struct. But
I can't join prev and curr because of this trick:
temp = curr;
curr = prev;
prev = temp;

* There are some compiler warnings. Avoid them if possible.
 fuzzystrmatch.c: In function ‘levenshtein_less_equal_internal’:
 fuzzystrmatch.c:428: warning: ‘char_lens’ may be used uninitialized in
 this function
 fuzzystrmatch.c:428: warning: ‘offsets’ may be used uninitialized in
 this function
 fuzzystrmatch.c:430: warning: ‘curr_right’ may be used uninitialized
 in this function
 fuzzystrmatch.c: In function ‘levenshtein_internal’:
 fuzzystrmatch.c:222: warning: ‘char_lens’ may be used uninitialized in
 this function

Fixed.

* Coding style: Use if (m == 0) instead of if (!m) when the type
 of 'm' is an integer.

Fixed.


 * Need to fix the caution in docs.
 http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html
 | Caution: At present, fuzzystrmatch does not work well with
 | multi-byte encodings (such as UTF-8).
 but now levenshtein supports multi-byte encoding!  We should
 mention which function supports mbchars not to confuse users.

I've updated this notification. Also I've added documentation for
levenshtein_less_equal function.

* (Not an issue for the patch, but...)
  Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to
  PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR?
  Unpacking versions make the core a bit faster.

 Fixed.

With best regards,
Alexander Korotkov.


fuzzystrmatch-0.4.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-11 Thread Itagaki Takahiro
Hi, I'm reviewing Multibyte charater set in levenshtein function patch.
https://commitfest.postgresql.org/action/patch_view?id=304

The main logic seems to be good, but I have some comments about
the coding style and refactoring.

* levenshtein_internal() and levenshtein_less_equal_internal() are very
  similar. Can you merge the code? We can always use less_equal_internal()
  if the overhead is ignorable. Did you compare them?

* There are many if (!multibyte_encoding) in levenshtein_internal().
  How about split the function into two funcs for single-byte chars and
  multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always
  use multi-byte version if the overhead is small.

* I prefer a struct rather than an array.  4 * m and 3 * m look like magic
  numbers for me. Could you name the entries with definition of a struct?
/*
 * For multibyte encoding we'll also store array of lengths of
 * characters and array with character offsets in first string
 * in order to avoid great number of pg_mblen calls.
 */
prev = (int *) palloc(4 * m * sizeof(int));

* There are some compiler warnings. Avoid them if possible.
fuzzystrmatch.c: In function ‘levenshtein_less_equal_internal’:
fuzzystrmatch.c:428: warning: ‘char_lens’ may be used uninitialized in
this function
fuzzystrmatch.c:428: warning: ‘offsets’ may be used uninitialized in
this function
fuzzystrmatch.c:430: warning: ‘curr_right’ may be used uninitialized
in this function
fuzzystrmatch.c: In function ‘levenshtein_internal’:
fuzzystrmatch.c:222: warning: ‘char_lens’ may be used uninitialized in
this function

* Coding style: Use if (m == 0) instead of if (!m) when the type
of 'm' is an integer.

* Need to fix the caution in docs.
http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html
| Caution: At present, fuzzystrmatch does not work well with
| multi-byte encodings (such as UTF-8).
but now levenshtein supports multi-byte encoding!  We should
mention which function supports mbchars not to confuse users.

* (Not an issue for the patch, but...)
  Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to
  PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR?
  Unpacking versions make the core a bit faster.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-06-07 Thread Alexander Korotkov
Hello Hackers!

I have extended my patch by introducing levenshtein_less_equal function.
This function have additional argument max_d and stops calculating when
distance exceeds max_d. With low values of max_d function works much faster
than original one.

The example of original levenshtein function usage:

test=# select word, levenshtein(word, 'consistent') as dist from words where
levenshtein(word, 'consistent') = 2 order by dist;
word | dist
-+--
  consistent  |0
 insistent   |2
 consistency |2
 coexistent  |2
 consistence |2
(5 rows)

test=# explain analyze select word, levenshtein(word, 'consistent') as dist
from words where levenshtein(word, 'consistent') = 2 order by dist;
  QUERY PLAN

---
 Sort  (cost=2779.13..2830.38 rows=20502 width=8) (actual
time=203.652..203.658 rows=5 loops=1)
   Sort Key: (levenshtein(word, 'consistent'::text))
   Sort Method:  quicksort  Memory: 25kB
   -  Seq Scan on words  (cost=0.00..1310.83 rows=20502 width=8) (actual
time=19.019..203.601 rows=5 loops=1)
 Filter: (levenshtein(word, 'consistent'::text) = 2)
 Total runtime: 203.723 ms
(6 rows)

Example of levenshtein_less_equal usage in this case:

test=# select word, levenshtein_less_equal(word, 'consistent', 2) as dist
from words where levenshtein_less_equal(word, 'consistent', 2) = 2 order by
dist;
word | dist
-+--
 consistent  |0
 insistent   |2
 consistency |2
 coexistent  |2
 consistence |2

test=# explain analyze select word, levenshtein_less_equal(word,
'consistent', 2) as dist from words where levenshtein_less_equal(word,
'consistent', 2) = 2 order by dist;
 QUERY PLAN

-
 Sort  (cost=2779.13..2830.38 rows=20502 width=8) (actual
time=42.198..42.203 rows=5 loops=1)
   Sort Key: (levenshtein_less_equal(word, 'consistent'::text, 2))
   Sort Method:  quicksort  Memory: 25kB
   -  Seq Scan on words  (cost=0.00..1310.83 rows=20502 width=8) (actual
time=5.391..42.143 rows=5 loops=1)
 Filter: (levenshtein_less_equal(word, 'consistent'::text, 2) = 2)
 Total runtime: 42.292 ms
(6 rows)

In the example above levenshtein_less_equal works about 5 times faster.

With best regards,
Alexander Korotkov.


fuzzystrmatch-0.3.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-05-13 Thread Alexander Korotkov
 On Thu, May 13, 2010 at 6:03 AM, Alvaro Herrera alvhe...@commandprompt.com
 wrote:

 Well, since it's only used in one place, why are you defining a macro at
 all?

In order to structure code better. My question was about another. Is memcmp
function good choice to compare very short sequences of bytes (from 1 to 4
bytes)?


Re: [HACKERS] multibyte charater set in levenshtein function

2010-05-13 Thread Alexander Korotkov
On Wed, May 12, 2010 at 11:04 PM, Alvaro Herrera alvhe...@alvh.no-ip.orgwrote:

 On a quick look, I didn't like the way you separated the
 pg_database_encoding_max_length()  1 cases.  There seem to be too
 much common code.  Can that be refactored a bit better?

I did a little refactoring in order to avoid some similar code.
I'm not quite sure about my CHAR_CMP macro. Is it a good idea?


fuzzystrmatch-0.2.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] multibyte charater set in levenshtein function

2010-05-12 Thread Alexander Korotkov
Hackers,

The current version of levenshtein function in fuzzystrmatch contrib modulte
doesn't work properly with multibyte charater sets.

test=# select levenshtein('фыва','аыва');
 levenshtein
-
   2
(1 row)

My patch make this function works properly with multibyte charater sets.

test=# select levenshtein('фыва','аыва');
 levenshtein
-
   1
(1 row)

Also it avoids text_to_cstring call.

Regards,
Alexander Korotkov.


fuzzystrmatch.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-05-12 Thread Alvaro Herrera
Excerpts from Alexander Korotkov's message of lun may 10 11:35:02 -0400 2010:
 Hackers,
 
 The current version of levenshtein function in fuzzystrmatch contrib modulte
 doesn't work properly with multibyte charater sets.

 My patch make this function works properly with multibyte charater sets.

Great.  Please add it to the next commitfest:
http://commitfest.postgresql.org

On a quick look, I didn't like the way you separated the 
pg_database_encoding_max_length()  1 cases.  There seem to be too
much common code.  Can that be refactored a bit better?
-- 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multibyte charater set in levenshtein function

2010-05-12 Thread Alvaro Herrera
Alexander Korotkov escribió:
 On Wed, May 12, 2010 at 11:04 PM, Alvaro Herrera 
 alvhe...@alvh.no-ip.orgwrote:
 
  On a quick look, I didn't like the way you separated the
  pg_database_encoding_max_length()  1 cases.  There seem to be too
  much common code.  Can that be refactored a bit better?
 
 I did a little refactoring in order to avoid some similar code.
 I'm not quite sure about my CHAR_CMP macro. Is it a good idea?

Well, since it's only used in one place, why are you defining a macro at
all?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers