Re: [HACKERS] Generalized edit function?

2011-02-26 Thread Josh Berkus
Fork,

 1.  Does anybody else care? I would love to see this in contrib, but if the
 chances are slim, then I would like to know that too.

That really depends on how well it works, and how much code it is.  It's
way too early for anyone to have a viewpoint on this.  For example, a
few years ago I'd have said that trigrams were hopelessly specialized
for mainstream PostgreSQL, but not they're not.

The path for this is to create an external project (on pgfoundry,
github, sourceforge, code.google,com, etc.) and then an Extension for
this.  Once it's production-quality, and if contrib is even relevant at
that point, you can propose it.

 3.  I will probably implement this for ascii characters -- if anyone has any
 thoughts on other encodings, please share.

Why would the implementation for ASCII be different from Unicode?
Surely any distance algorithm is encoding-neutral.

Anyway, if it's ASCII-only, that's a guaranteed way to make sure it
isn't taken seriously.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Generalized edit function?

2011-02-26 Thread Robert Haas
On Sat, Feb 26, 2011 at 4:19 PM, Josh Berkus j...@agliodbs.com wrote:
 Anyway, if it's ASCII-only, that's a guaranteed way to make sure it
 isn't taken seriously.

Pre-9.1 levenshtein is ASCII-only, and I think some of the other stuff
in contrib/fuzzystrmatch still is.  We had to work pretty hard to
avoid a massive performance loss when we made it multi-byte aware, and
IIRC there is still a pretty significant performance loss if any
multibyte characters are actually present.  But at least now it
doesn't return totally bogus answers.

So I have some sympathy with the OP's desire not to burden himself
with the non-ASCII case if he doesn't need it for his application, but
I also agree with your point that we probably wouldn't accept code
into contrib that doesn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Generalized edit function?

2011-02-26 Thread fork
Robert Haas robertmhaas at gmail.com writes:

 
 On Sat, Feb 26, 2011 at 4:19 PM, Josh Berkus josh at agliodbs.com wrote:
  Anyway, if it's ASCII-only, that's a guaranteed way to make sure it
  isn't taken seriously.
 
 Pre-9.1 levenshtein is ASCII-only, and I think some of the other stuff
 in contrib/fuzzystrmatch still is.

I am only looking at 9.0.3 for levenshtein, so I don't have any thoughts yet on
multi-byteness so far.   I will have to figure out the multibyte character work
once I get the basic algorithm working -- any thoughts on that?  Any pitfalls in
porting?

 So I have some sympathy with the OP's desire not to burden himself
 with the non-ASCII case if he doesn't need it for his application,

 but
 I also agree with your point that we probably wouldn't accept code
 into contrib that doesn't.

Good to know.  I will try to avoid backing myself into an ascii corner.
 





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Generalized edit function?

2011-02-26 Thread Robert Haas
On Sat, Feb 26, 2011 at 7:40 PM, fork forkandw...@gmail.com wrote:
 Pre-9.1 levenshtein is ASCII-only, and I think some of the other stuff
 in contrib/fuzzystrmatch still is.

 I am only looking at 9.0.3 for levenshtein, so I don't have any thoughts yet 
 on
 multi-byteness so far.   I will have to figure out the multibyte character 
 work
 once I get the basic algorithm working -- any thoughts on that?  Any pitfalls 
 in
 porting?

The main thing with levenshtein() is that if you're working with
single byte characters then you can reference the i'th character as
x[i], whereas if you have multi-byte characters then you need to build
an offset table and look at length[i] bytes beginning at
x[offset[i]].  That turns out to be significantly more expensive.  As
initially proposed, the patch to add multi-byte awareness built this
lookup table for any multi-byte encoding and used the faster technique
for single-byte encodings, but that wasn't actually so hot, because
the most widely used encoding these days is probably UTF-8, which of
course is a multi-byte encoding.  What we ended up with is a fast-path
for the case where both strings contain only single-byte characters,
which will always be true in a single-byte encoding but might easily
also be true in a multi-byte encoding, especially for English
speakers.  I don't know if that's exactly right for what you're trying
to do - you'll probably need to try some different things and
benchmark.  I would however recommend that you look at the
master-branch implementation of levenshtein() rather than the old
9.0.x one, because it's significantly different, and forward-porting
your changes will probably be hard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers