Re: [sqlite] Fuzzy Matching

Stephen Woodbridge Sun, 06 Jul 2008 20:46:20 -0700

OK, thanks for that. I had seen the references to the n-gram stuff and 
just started reading about them and not my head is ready to explode! 
Between n-grams, bit vectors, Bloom filters, perfect hashes, lots of 
academic papers spouting fancy equations and statistics, I'm not sure 
this is gelling into any course of action for me. While this technology 
seems very cool, I still get the sense that you have to scan each 
documents n-gram vector against the query n-gram vector to get a set of 
potential documents that might have what I want in them. Or am I missing 
something? What? While this type of search is probably faster then some 
alternatives, it does not lend itself to the fundamentals of RDBMS or 
indexed searches.


I would appreciate any thoughts on how to proceed with this.

Levenshtein edit distance is a good way to score the results of a fuzzy 
match. I have also used metaphone and double metaphone and soundex in 
the past for fuzzy searching.

-Steve

Harold Wood wrote:
> I cant go into too much detail because of my current job, but for
> fuzzy matching levenstien isnt very good, you need to try looking
> into ngram matching techniques, it is absolutely awesome in reducing
> over/under matches.
> 
> Woody
> 
> --- On Sat, 7/5/08, Stephen Woodbridge <[EMAIL PROTECTED]>
> wrote:
> 
> From: Stephen Woodbridge <[EMAIL PROTECTED]> Subject: Re:
> [sqlite] Fuzzy Matching To: "General Discussion of SQLite Database"
> <[email protected]> Date: Saturday, July 5, 2008, 11:24 PM
> 
> Stephen Woodbridge wrote:
>> I would be interested in having something like this also.
>> 
>> What I don't understand in your approach is how you compute the 
>> (Levenstein) distance during a search. It seems like you have a
>> fixed set of tokens from your document text and these are indexed.
>> Then you have a query token the you want to compare to the index
>> based on some fuzzy distance. Since every query can be different I
>> think you have to compute the distance for every key in the index?
>> that would require doing a full index scan.
>> 
>> If there ware a function that you could run a token through that
>> would given you that tokens "location" in some space then you could
>> 
> generate a
>> similar "location" for the query token and then use the rtree
> and
>> distance. I'm not aware of any such functions, but my expertise is
> more
>> in GIS the search searching.
> 
> Hmmm, that was supposed to say text searching.
> 
>> Thoughts?
>> 
>> Best, -Steve
>> 
>> Martin Pfeifle wrote:
>>> Hi, I think there is nothing available except FTS. Doing a full
>>> table scan and computing for each string the (Levenstein)
>>> distance to the query object is too time consuming. So what I
>>> would like to see is the implementation of a generic metric index
>>> which needs as one parameter a metric distance function. Based on
>>> such a distance function you could then do similarity search on
>>> any objects , e.g. images, strings, etc. One possible index would
>>> be the M-tree (which you can also organize relational as it was
>>> done with the R*-tree). The idea is that you have a hierarchical
>>> index and each node is represented by a database  object o and a
>>> covering radius r reflecting the maximal distance of all objects
>>> in that subtree to the object o. If you do a range query now, you
>>> compute the distance of your query object to the object o. If
>>> this distance minus the coverage radius r is bigger than your
>>> query range you can prune that subtree. You can either implement
>>> such a similarity module as an own extension similar toFTS or the
>>> Spatial module, or integrate it into FTS and use it only for
>>> strings. Personally, I need the second solution because I'd like
>>> to do full and fuzzy text search. Are
> there
>>> any plans to implement something like this, if yes, I would like
>>> to take part in such a development. . Best Martin
>>> 
>>> 
>>> 
>>> 
>>> ----- Ursprüngliche Mail ---- Von: Alberto Simões 
>>> <[EMAIL PROTECTED]> An: General Discussion of SQLite Database 
>>> <[email protected]> Gesendet: Donnerstag, den 3. Juli
> 2008,
>>> 21:52:05 Uhr Betreff: [sqlite] Fuzzy Matching
>>> 
>>> Hello
>>> 
>>> Although I am quite certain that the answer is that SQLite does
>>> not provide any mechanism to help me on this, it doesn't hurt to
>>> ask.
> Who
>>> know if anybody have any suggestion.
>>> 
>>> Basically, I am using SQLite for a dictionary, and I want to let
>>> the user do fuzzy searches. OK, some simple Levenshtein distance
>>> of one or two would do the trick, probably.
>>> 
>>> I imagine that SQLite (given the lite), does not provide any kind
>>> of nearmisses search. But probably, somebody here did anything
>>> similar in any language?
>>> 
>>> Cheers Alberto
>> _______________________________________________ sqlite-users
>> mailing list [email protected] 
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> 
> _______________________________________________ sqlite-users mailing
> list [email protected] 
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users 
> _______________________________________________ sqlite-users mailing
> list [email protected] 
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Fuzzy Matching

Reply via email to