Re: [sqlite] Fuzzy Matching

Alberto Simões Mon, 07 Jul 2008 11:18:25 -0700

Hi, Folks.

While I do not have any expertise to compare levenshtein with other
approaches, my current approach for fuzzy matching is, taking a word
w, compute all possible near words with distance n. This is known to
generate a big number of near misses, but it is quicker to then search
for these words on the database, than to compare each entry in the
databse with the word being searched.


Well, not sure if I made it clear.

With the soundex option that DRH explained, it is possible to compute
before hand the soundex for each entry, and store it in a column.
Then, it should be easy to grab all words with the same soundex
quickly.

Cheers
ambs


On Mon, Jul 7, 2008 at 5:46 AM, Stephen Woodbridge
<[EMAIL PROTECTED]> wrote:
> OK, thanks for that. I had seen the references to the n-gram stuff and
> just started reading about them and not my head is ready to explode!
> Between n-grams, bit vectors, Bloom filters, perfect hashes, lots of
> academic papers spouting fancy equations and statistics, I'm not sure
> this is gelling into any course of action for me. While this technology
> seems very cool, I still get the sense that you have to scan each
> documents n-gram vector against the query n-gram vector to get a set of
> potential documents that might have what I want in them. Or am I missing
> something? What? While this type of search is probably faster then some
> alternatives, it does not lend itself to the fundamentals of RDBMS or
> indexed searches.
>
> I would appreciate any thoughts on how to proceed with this.
>
> Levenshtein edit distance is a good way to score the results of a fuzzy
> match. I have also used metaphone and double metaphone and soundex in
> the past for fuzzy searching.
>
> -Steve
>
> Harold Wood wrote:
>> I cant go into too much detail because of my current job, but for
>> fuzzy matching levenstien isnt very good, you need to try looking
>> into ngram matching techniques, it is absolutely awesome in reducing
>> over/under matches.
>>
>> Woody
>>
>> --- On Sat, 7/5/08, Stephen Woodbridge <[EMAIL PROTECTED]>
>> wrote:
>>
>> From: Stephen Woodbridge <[EMAIL PROTECTED]> Subject: Re:
>> [sqlite] Fuzzy Matching To: "General Discussion of SQLite Database"
>> <[email protected]> Date: Saturday, July 5, 2008, 11:24 PM
>>
>> Stephen Woodbridge wrote:
>>> I would be interested in having something like this also.
>>>
>>> What I don't understand in your approach is how you compute the
>>> (Levenstein) distance during a search. It seems like you have a
>>> fixed set of tokens from your document text and these are indexed.
>>> Then you have a query token the you want to compare to the index
>>> based on some fuzzy distance. Since every query can be different I
>>> think you have to compute the distance for every key in the index?
>>> that would require doing a full index scan.
>>>
>>> If there ware a function that you could run a token through that
>>> would given you that tokens "location" in some space then you could
>>>
>> generate a
>>> similar "location" for the query token and then use the rtree
>> and
>>> distance. I'm not aware of any such functions, but my expertise is
>> more
>>> in GIS the search searching.
>>
>> Hmmm, that was supposed to say text searching.
>>
>>> Thoughts?
>>>
>>> Best, -Steve
>>>
>>> Martin Pfeifle wrote:
>>>> Hi, I think there is nothing available except FTS. Doing a full
>>>> table scan and computing for each string the (Levenstein)
>>>> distance to the query object is too time consuming. So what I
>>>> would like to see is the implementation of a generic metric index
>>>> which needs as one parameter a metric distance function. Based on
>>>> such a distance function you could then do similarity search on
>>>> any objects , e.g. images, strings, etc. One possible index would
>>>> be the M-tree (which you can also organize relational as it was
>>>> done with the R*-tree). The idea is that you have a hierarchical
>>>> index and each node is represented by a database  object o and a
>>>> covering radius r reflecting the maximal distance of all objects
>>>> in that subtree to the object o. If you do a range query now, you
>>>> compute the distance of your query object to the object o. If
>>>> this distance minus the coverage radius r is bigger than your
>>>> query range you can prune that subtree. You can either implement
>>>> such a similarity module as an own extension similar toFTS or the
>>>> Spatial module, or integrate it into FTS and use it only for
>>>> strings. Personally, I need the second solution because I'd like
>>>> to do full and fuzzy text search. Are
>> there
>>>> any plans to implement something like this, if yes, I would like
>>>> to take part in such a development. . Best Martin
>>>>
>>>>
>>>>
>>>>
>>>> ----- Ursprüngliche Mail ---- Von: Alberto Simões
>>>> <[EMAIL PROTECTED]> An: General Discussion of SQLite Database
>>>> <[email protected]> Gesendet: Donnerstag, den 3. Juli
>> 2008,
>>>> 21:52:05 Uhr Betreff: [sqlite] Fuzzy Matching
>>>>
>>>> Hello
>>>>
>>>> Although I am quite certain that the answer is that SQLite does
>>>> not provide any mechanism to help me on this, it doesn't hurt to
>>>> ask.
>> Who
>>>> know if anybody have any suggestion.
>>>>
>>>> Basically, I am using SQLite for a dictionary, and I want to let
>>>> the user do fuzzy searches. OK, some simple Levenshtein distance
>>>> of one or two would do the trick, probably.
>>>>
>>>> I imagine that SQLite (given the lite), does not provide any kind
>>>> of nearmisses search. But probably, somebody here did anything
>>>> similar in any language?
>>>>
>>>> Cheers Alberto
>>> _______________________________________________ sqlite-users
>>> mailing list [email protected]
>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>
>> _______________________________________________ sqlite-users mailing
>> list [email protected]
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>> _______________________________________________ sqlite-users mailing
>> list [email protected]
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
> _______________________________________________
> sqlite-users mailing list
> [email protected]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>



-- 
Alberto Simões
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Fuzzy Matching

Reply via email to