Re: [sqlite] Fuzzy Matching

2008-07-07 Thread Alberto Simões
Hi, Folks.

While I do not have any expertise to compare levenshtein with other
approaches, my current approach for fuzzy matching is, taking a word
w, compute all possible near words with distance n. This is known to
generate a big number of near misses, but it is quicker to then search
for these words on the database, than to compare each entry in the
databse with the word being searched.

Well, not sure if I made it clear.

With the soundex option that DRH explained, it is possible to compute
before hand the soundex for each entry, and store it in a column.
Then, it should be easy to grab all words with the same soundex
quickly.

Cheers
ambs


On Mon, Jul 7, 2008 at 5:46 AM, Stephen Woodbridge
<[EMAIL PROTECTED]> wrote:
> OK, thanks for that. I had seen the references to the n-gram stuff and
> just started reading about them and not my head is ready to explode!
> Between n-grams, bit vectors, Bloom filters, perfect hashes, lots of
> academic papers spouting fancy equations and statistics, I'm not sure
> this is gelling into any course of action for me. While this technology
> seems very cool, I still get the sense that you have to scan each
> documents n-gram vector against the query n-gram vector to get a set of
> potential documents that might have what I want in them. Or am I missing
> something? What? While this type of search is probably faster then some
> alternatives, it does not lend itself to the fundamentals of RDBMS or
> indexed searches.
>
> I would appreciate any thoughts on how to proceed with this.
>
> Levenshtein edit distance is a good way to score the results of a fuzzy
> match. I have also used metaphone and double metaphone and soundex in
> the past for fuzzy searching.
>
> -Steve
>
> Harold Wood wrote:
>> I cant go into too much detail because of my current job, but for
>> fuzzy matching levenstien isnt very good, you need to try looking
>> into ngram matching techniques, it is absolutely awesome in reducing
>> over/under matches.
>>
>> Woody
>>
>> --- On Sat, 7/5/08, Stephen Woodbridge <[EMAIL PROTECTED]>
>> wrote:
>>
>> From: Stephen Woodbridge <[EMAIL PROTECTED]> Subject: Re:
>> [sqlite] Fuzzy Matching To: "General Discussion of SQLite Database"
>> <sqlite-users@sqlite.org> Date: Saturday, July 5, 2008, 11:24 PM
>>
>> Stephen Woodbridge wrote:
>>> I would be interested in having something like this also.
>>>
>>> What I don't understand in your approach is how you compute the
>>> (Levenstein) distance during a search. It seems like you have a
>>> fixed set of tokens from your document text and these are indexed.
>>> Then you have a query token the you want to compare to the index
>>> based on some fuzzy distance. Since every query can be different I
>>> think you have to compute the distance for every key in the index?
>>> that would require doing a full index scan.
>>>
>>> If there ware a function that you could run a token through that
>>> would given you that tokens "location" in some space then you could
>>>
>> generate a
>>> similar "location" for the query token and then use the rtree
>> and
>>> distance. I'm not aware of any such functions, but my expertise is
>> more
>>> in GIS the search searching.
>>
>> Hmmm, that was supposed to say text searching.
>>
>>> Thoughts?
>>>
>>> Best, -Steve
>>>
>>> Martin Pfeifle wrote:
>>>> Hi, I think there is nothing available except FTS. Doing a full
>>>> table scan and computing for each string the (Levenstein)
>>>> distance to the query object is too time consuming. So what I
>>>> would like to see is the implementation of a generic metric index
>>>> which needs as one parameter a metric distance function. Based on
>>>> such a distance function you could then do similarity search on
>>>> any objects , e.g. images, strings, etc. One possible index would
>>>> be the M-tree (which you can also organize relational as it was
>>>> done with the R*-tree). The idea is that you have a hierarchical
>>>> index and each node is represented by a database  object o and a
>>>> covering radius r reflecting the maximal distance of all objects
>>>> in that subtree to the object o. If you do a range query now, you
>>>> compute the distance of your query object to the object o. If
>>>> this distance minus the coverage radius r is bigger than your
>>>> query range you can prune that subtree. You can either implement
>>>> such a similarity mod

Re: [sqlite] Fuzzy Matching

2008-07-06 Thread Stephen Woodbridge
OK, thanks for that. I had seen the references to the n-gram stuff and 
just started reading about them and not my head is ready to explode! 
Between n-grams, bit vectors, Bloom filters, perfect hashes, lots of 
academic papers spouting fancy equations and statistics, I'm not sure 
this is gelling into any course of action for me. While this technology 
seems very cool, I still get the sense that you have to scan each 
documents n-gram vector against the query n-gram vector to get a set of 
potential documents that might have what I want in them. Or am I missing 
something? What? While this type of search is probably faster then some 
alternatives, it does not lend itself to the fundamentals of RDBMS or 
indexed searches.

I would appreciate any thoughts on how to proceed with this.

Levenshtein edit distance is a good way to score the results of a fuzzy 
match. I have also used metaphone and double metaphone and soundex in 
the past for fuzzy searching.

-Steve

Harold Wood wrote:
> I cant go into too much detail because of my current job, but for
> fuzzy matching levenstien isnt very good, you need to try looking
> into ngram matching techniques, it is absolutely awesome in reducing
> over/under matches.
> 
> Woody
> 
> --- On Sat, 7/5/08, Stephen Woodbridge <[EMAIL PROTECTED]>
> wrote:
> 
> From: Stephen Woodbridge <[EMAIL PROTECTED]> Subject: Re:
> [sqlite] Fuzzy Matching To: "General Discussion of SQLite Database"
> <sqlite-users@sqlite.org> Date: Saturday, July 5, 2008, 11:24 PM
> 
> Stephen Woodbridge wrote:
>> I would be interested in having something like this also.
>> 
>> What I don't understand in your approach is how you compute the 
>> (Levenstein) distance during a search. It seems like you have a
>> fixed set of tokens from your document text and these are indexed.
>> Then you have a query token the you want to compare to the index
>> based on some fuzzy distance. Since every query can be different I
>> think you have to compute the distance for every key in the index?
>> that would require doing a full index scan.
>> 
>> If there ware a function that you could run a token through that
>> would given you that tokens "location" in some space then you could
>> 
> generate a
>> similar "location" for the query token and then use the rtree
> and
>> distance. I'm not aware of any such functions, but my expertise is
> more
>> in GIS the search searching.
> 
> Hmmm, that was supposed to say text searching.
> 
>> Thoughts?
>> 
>> Best, -Steve
>> 
>> Martin Pfeifle wrote:
>>> Hi, I think there is nothing available except FTS. Doing a full
>>> table scan and computing for each string the (Levenstein)
>>> distance to the query object is too time consuming. So what I
>>> would like to see is the implementation of a generic metric index
>>> which needs as one parameter a metric distance function. Based on
>>> such a distance function you could then do similarity search on
>>> any objects , e.g. images, strings, etc. One possible index would
>>> be the M-tree (which you can also organize relational as it was
>>> done with the R*-tree). The idea is that you have a hierarchical
>>> index and each node is represented by a database  object o and a
>>> covering radius r reflecting the maximal distance of all objects
>>> in that subtree to the object o. If you do a range query now, you
>>> compute the distance of your query object to the object o. If
>>> this distance minus the coverage radius r is bigger than your
>>> query range you can prune that subtree. You can either implement
>>> such a similarity module as an own extension similar toFTS or the
>>> Spatial module, or integrate it into FTS and use it only for
>>> strings. Personally, I need the second solution because I'd like
>>> to do full and fuzzy text search. Are
> there
>>> any plans to implement something like this, if yes, I would like
>>> to take part in such a development. . Best Martin
>>> 
>>> 
>>> 
>>> 
>>> - Ursprüngliche Mail  Von: Alberto Simões 
>>> <[EMAIL PROTECTED]> An: General Discussion of SQLite Database 
>>> <sqlite-users@sqlite.org> Gesendet: Donnerstag, den 3. Juli
> 2008,
>>> 21:52:05 Uhr Betreff: [sqlite] Fuzzy Matching
>>> 
>>> Hello
>>> 
>>> Although I am quite certain that the answer is that SQLite does
>>> not provide any mechanism to help me on this, it doesn't hurt to
>>> ask.
> Who
>>> know if a

Re: [sqlite] Fuzzy Matching

2008-07-05 Thread Harold Wood
I cant go into too much detail because of my current job, but for fuzzy 
matching levenstien isnt very good, you need to try looking into ngram matching 
techniques, it is absolutely awesome in reducing over/under matches.
 
Woody

--- On Sat, 7/5/08, Stephen Woodbridge <[EMAIL PROTECTED]> wrote:

From: Stephen Woodbridge <[EMAIL PROTECTED]>
Subject: Re: [sqlite] Fuzzy Matching
To: "General Discussion of SQLite Database" <sqlite-users@sqlite.org>
Date: Saturday, July 5, 2008, 11:24 PM

Stephen Woodbridge wrote:
> I would be interested in having something like this also.
> 
> What I don't understand in your approach is how you compute the 
> (Levenstein) distance during a search. It seems like you have a fixed 
> set of tokens from your document text and these are indexed. Then you 
> have a query token the you want to compare to the index based on some 
> fuzzy distance. Since every query can be different I think you have to 
> compute the distance for every key in the index? that would require 
> doing a full index scan.
> 
> If there ware a function that you could run a token through that would 
> given you that tokens "location" in some space then you could
generate a 
> similar "location" for the query token and then use the rtree
and 
> distance. I'm not aware of any such functions, but my expertise is
more 
> in GIS the search searching.

Hmmm, that was supposed to say text searching.

> Thoughts?
> 
> Best,
>-Steve
> 
> Martin Pfeifle wrote:
>> Hi, I think there is nothing available except FTS. Doing a full table
>> scan and computing for each string the (Levenstein) distance to the
>> query object is too time consuming. So what I would like to see is
>> the implementation of a generic metric index which needs as one
>> parameter a metric distance function. Based on such a distance
>> function you could then do similarity search on any objects , e.g.
>> images, strings, etc. One possible index would be the M-tree (which
>> you can also organize relational as it was done with the R*-tree).
>> The idea is that you have a hierarchical index and each node is
>> represented by a database  object o and a covering radius r
>> reflecting the maximal distance of all objects in that subtree to the
>> object o. If you do a range query now, you compute the distance of
>> your query object to the object o. If this distance minus the
>> coverage radius r is bigger than your query range you can prune that
>> subtree. You can either implement such a similarity module as an own
>> extension similar toFTS or the Spatial module, or integrate it into
>> FTS and use it only for strings. Personally, I need the second
>> solution because I'd like to do full and fuzzy text search. Are
there
>> any plans to implement something like this, if yes, I would like to
>> take part in such a development. . Best Martin
>>
>>
>>
>>
>> - Ursprüngliche Mail  Von: Alberto Simões
>> <[EMAIL PROTECTED]> An: General Discussion of SQLite Database
>> <sqlite-users@sqlite.org> Gesendet: Donnerstag, den 3. Juli
2008,
>> 21:52:05 Uhr Betreff: [sqlite] Fuzzy Matching
>>
>> Hello
>>
>> Although I am quite certain that the answer is that SQLite does not 
>> provide any mechanism to help me on this, it doesn't hurt to ask.
Who
>>  know if anybody have any suggestion.
>>
>> Basically, I am using SQLite for a dictionary, and I want to let the 
>> user do fuzzy searches. OK, some simple Levenshtein distance of one
>> or two would do the trick, probably.
>>
>> I imagine that SQLite (given the lite), does not provide any kind of 
>> nearmisses search. But probably, somebody here did anything similar
>> in any language?
>>
>> Cheers Alberto
> 
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Fuzzy Matching

2008-07-05 Thread Stephen Woodbridge
Stephen Woodbridge wrote:
> I would be interested in having something like this also.
> 
> What I don't understand in your approach is how you compute the 
> (Levenstein) distance during a search. It seems like you have a fixed 
> set of tokens from your document text and these are indexed. Then you 
> have a query token the you want to compare to the index based on some 
> fuzzy distance. Since every query can be different I think you have to 
> compute the distance for every key in the index? that would require 
> doing a full index scan.
> 
> If there ware a function that you could run a token through that would 
> given you that tokens "location" in some space then you could generate a 
> similar "location" for the query token and then use the rtree and 
> distance. I'm not aware of any such functions, but my expertise is more 
> in GIS the search searching.

Hmmm, that was supposed to say text searching.

> Thoughts?
> 
> Best,
>-Steve
> 
> Martin Pfeifle wrote:
>> Hi, I think there is nothing available except FTS. Doing a full table
>> scan and computing for each string the (Levenstein) distance to the
>> query object is too time consuming. So what I would like to see is
>> the implementation of a generic metric index which needs as one
>> parameter a metric distance function. Based on such a distance
>> function you could then do similarity search on any objects , e.g.
>> images, strings, etc. One possible index would be the M-tree (which
>> you can also organize relational as it was done with the R*-tree).
>> The idea is that you have a hierarchical index and each node is
>> represented by a database  object o and a covering radius r
>> reflecting the maximal distance of all objects in that subtree to the
>> object o. If you do a range query now, you compute the distance of
>> your query object to the object o. If this distance minus the
>> coverage radius r is bigger than your query range you can prune that
>> subtree. You can either implement such a similarity module as an own
>> extension similar toFTS or the Spatial module, or integrate it into
>> FTS and use it only for strings. Personally, I need the second
>> solution because I'd like to do full and fuzzy text search. Are there
>> any plans to implement something like this, if yes, I would like to
>> take part in such a development. . Best Martin
>>
>>
>>
>>
>> - Ursprüngliche Mail  Von: Alberto Simões
>> <[EMAIL PROTECTED]> An: General Discussion of SQLite Database
>> <sqlite-users@sqlite.org> Gesendet: Donnerstag, den 3. Juli 2008,
>> 21:52:05 Uhr Betreff: [sqlite] Fuzzy Matching
>>
>> Hello
>>
>> Although I am quite certain that the answer is that SQLite does not 
>> provide any mechanism to help me on this, it doesn't hurt to ask. Who
>>  know if anybody have any suggestion.
>>
>> Basically, I am using SQLite for a dictionary, and I want to let the 
>> user do fuzzy searches. OK, some simple Levenshtein distance of one
>> or two would do the trick, probably.
>>
>> I imagine that SQLite (given the lite), does not provide any kind of 
>> nearmisses search. But probably, somebody here did anything similar
>> in any language?
>>
>> Cheers Alberto
> 
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Fuzzy Matching

2008-07-05 Thread Stephen Woodbridge
I would be interested in having something like this also.

What I don't understand in your approach is how you compute the 
(Levenstein) distance during a search. It seems like you have a fixed 
set of tokens from your document text and these are indexed. Then you 
have a query token the you want to compare to the index based on some 
fuzzy distance. Since every query can be different I think you have to 
compute the distance for every key in the index? that would require 
doing a full index scan.

If there ware a function that you could run a token through that would 
given you that tokens "location" in some space then you could generate a 
similar "location" for the query token and then use the rtree and 
distance. I'm not aware of any such functions, but my expertise is more 
in GIS the search searching.

Thoughts?

Best,
   -Steve

Martin Pfeifle wrote:
> Hi, I think there is nothing available except FTS. Doing a full table
> scan and computing for each string the (Levenstein) distance to the
> query object is too time consuming. So what I would like to see is
> the implementation of a generic metric index which needs as one
> parameter a metric distance function. Based on such a distance
> function you could then do similarity search on any objects , e.g.
> images, strings, etc. One possible index would be the M-tree (which
> you can also organize relational as it was done with the R*-tree).
> The idea is that you have a hierarchical index and each node is
> represented by a database  object o and a covering radius r
> reflecting the maximal distance of all objects in that subtree to the
> object o. If you do a range query now, you compute the distance of
> your query object to the object o. If this distance minus the
> coverage radius r is bigger than your query range you can prune that
> subtree. You can either implement such a similarity module as an own
> extension similar toFTS or the Spatial module, or integrate it into
> FTS and use it only for strings. Personally, I need the second
> solution because I'd like to do full and fuzzy text search. Are there
> any plans to implement something like this, if yes, I would like to
> take part in such a development. . Best Martin
> 
> 
> 
> 
> - Ursprüngliche Mail  Von: Alberto Simões
> <[EMAIL PROTECTED]> An: General Discussion of SQLite Database
> <sqlite-users@sqlite.org> Gesendet: Donnerstag, den 3. Juli 2008,
> 21:52:05 Uhr Betreff: [sqlite] Fuzzy Matching
> 
> Hello
> 
> Although I am quite certain that the answer is that SQLite does not 
> provide any mechanism to help me on this, it doesn't hurt to ask. Who
>  know if anybody have any suggestion.
> 
> Basically, I am using SQLite for a dictionary, and I want to let the 
> user do fuzzy searches. OK, some simple Levenshtein distance of one
> or two would do the trick, probably.
> 
> I imagine that SQLite (given the lite), does not provide any kind of 
> nearmisses search. But probably, somebody here did anything similar
> in any language?
> 
> Cheers Alberto

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Fuzzy Matching

2008-07-04 Thread D. Richard Hipp

On Jul 3, 2008, at 4:09 PM, John Stanton wrote:

> I believe Sqlite implemens Soundex as standard.  Thet might work for  
> you.

If you compile with -DSQLITE_SOUNDEX=1 then there is an SQL function  
named soundex() built-in.  That function takes a single argument and  
returns the soundex encoding of the text representation of that  
argument.  The soundex() function is undocumented and is not  
officially supported, but it does seem to work.

>
>
> Alberto Simões wrote:
>> Hello
>>
>> Although I am quite certain that the answer is that SQLite does not
>> provide any mechanism to help me on this, it doesn't hurt to ask. Who
>> know if anybody have any suggestion.
>>
>> Basically, I am using SQLite for a dictionary, and I want to let the
>> user do fuzzy searches. OK, some simple Levenshtein distance of one  
>> or
>> two would do the trick, probably.
>>
>> I imagine that SQLite (given the lite), does not provide any kind of
>> nearmisses search.
>> But probably, somebody here did anything similar in any language?
>>
>> Cheers
>> Alberto
>
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>

D. Richard Hipp
[EMAIL PROTECTED]



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Fuzzy Matching

2008-07-03 Thread Martin Pfeifle
Hi, I think there is nothing available except FTS.
Doing a full table scan and computing for each string the (Levenstein) distance
to the query object is too time consuming. So what I would like to see is the 
implementation
of a generic metric index which needs as one parameter a metric distance 
function. Based on such a 
distance function you could then do similarity search on any objects , e.g. 
images, strings, etc.
One possible index would be the M-tree (which you can also organize relational 
as it was done
with the R*-tree). The idea is that you have a hierarchical index and each node 
is represented by
a database  object o and a covering radius r reflecting the maximal distance of 
all objects in that subtree to
the object o. If you do a range query now, you compute the distance of your 
query object to the
 object o. If this distance minus the coverage radius r is bigger than your 
query range
you can prune that subtree.
You can either implement such a similarity module as an own extension similar 
toFTS or the Spatial module,
or integrate it into FTS and use it only for strings.
Personally, I need the second solution because I'd like to do full and fuzzy 
text search.
Are there any plans to implement something like this, if yes, I would like to 
take part in such a
development. 
.
Best Martin
 
 


- Ursprüngliche Mail 
Von: Alberto Simões <[EMAIL PROTECTED]>
An: General Discussion of SQLite Database <sqlite-users@sqlite.org>
Gesendet: Donnerstag, den 3. Juli 2008, 21:52:05 Uhr
Betreff: [sqlite] Fuzzy Matching

Hello

Although I am quite certain that the answer is that SQLite does not
provide any mechanism to help me on this, it doesn't hurt to ask. Who
know if anybody have any suggestion.

Basically, I am using SQLite for a dictionary, and I want to let the
user do fuzzy searches. OK, some simple Levenshtein distance of one or
two would do the trick, probably.

I imagine that SQLite (given the lite), does not provide any kind of
nearmisses search.
But probably, somebody here did anything similar in any language?

Cheers
Alberto
-- 
Alberto Simões
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



  __
Gesendet von Yahoo! Mail.
Dem pfiffigeren Posteingang.
http://de.overview.mail.yahoo.com
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Fuzzy Matching

2008-07-03 Thread John Stanton
I believe Sqlite implemens Soundex as standard.  Thet might work for you.

Alberto Simões wrote:
> Hello
> 
> Although I am quite certain that the answer is that SQLite does not
> provide any mechanism to help me on this, it doesn't hurt to ask. Who
> know if anybody have any suggestion.
> 
> Basically, I am using SQLite for a dictionary, and I want to let the
> user do fuzzy searches. OK, some simple Levenshtein distance of one or
> two would do the trick, probably.
> 
> I imagine that SQLite (given the lite), does not provide any kind of
> nearmisses search.
> But probably, somebody here did anything similar in any language?
> 
> Cheers
> Alberto

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Fuzzy Matching

2008-07-03 Thread Alberto Simões
Hello

Although I am quite certain that the answer is that SQLite does not
provide any mechanism to help me on this, it doesn't hurt to ask. Who
know if anybody have any suggestion.

Basically, I am using SQLite for a dictionary, and I want to let the
user do fuzzy searches. OK, some simple Levenshtein distance of one or
two would do the trick, probably.

I imagine that SQLite (given the lite), does not provide any kind of
nearmisses search.
But probably, somebody here did anything similar in any language?

Cheers
Alberto
-- 
Alberto Simões
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users