Re: [sqlite] Near misses

2009-07-01 Thread Alberto Simões
Hello

2009/6/26 Alberto Simões :
> I am trying to find words in a dictionary stored in sqlite, and trying
> a near miss approach.
> For that I tried an algorithm to create patterns corresponding to
> Levenshtein distance of 1 (edit distance of 1).
> That means, one adition, one remotion or one substitution.
>
> For that, my script receives a word (say, 'car') and generated all
> possible additions and remotions, and substitutions:
>
> Additions: _car c_ar ca_r car_
> Substitutions: _ar c_r ca_
> remotions: ar cr ca
>
> Then, the script constructs an SQL query:
>
> SELECT DISTINCT(word) FROM dict WHERE word = "ar" OR word = "ca" OR
> word LIKE "_car" OR word LIKE "c_r" OR word = "cr" OR word LIKE "_ar"
> OR word LIKE "ca_r" OR word LIKE "c_ar" OR word LIKE "ca_" OR word
> LIKE "car_";
>
> And this SQL quer works... but not as quickly as I need (specially
> because the speed is proportional to the word size).

My current solution is to make all combinations of words: having a
list of letters, cycle them and substitute the underscore by that
letter.
The resulting list is being searched with   SELECT word FROM dict
WHERE word IN ('worda','wordb',wordc') and so on.
While for big words this list can be big (it gets above 1000 easily)
the query executes very fast.

Hope this might be helpful for somebody.
Cheers
Alberto
-- 
Alberto Simões
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Near misses

2009-06-28 Thread Eric Bohlman
Simon Slavin wrote:
> On 26 Jun 2009, at 12:25pm, Alberto Simões wrote:
> 
>> one adition, one remotion or one substitution
> 
> I am always amazed at how well people use English.  For your word  
> 'remotion' you probably mean 'removal' or 'omission'.  You have joined  
> the two possibilities together !

Although Alberto has explained the etymology of the term, in general the 
condensation of two or more words into one is called a "portmanteau." My 
favorite portmanteau arose when about 30 years ago a co-worker reported 
that software problems on an embedded device were caused by two routines 
"interfecting with each other." Interacting, interfering, affecting, 
infecting and probably more, all packed with a remarkable economy of 
expression.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Near misses

2009-06-28 Thread John Machin
On 27/06/2009 7:00 AM, Jean-Christophe Deschamps wrote:
> At 13:25 26/06/2009, you wrote:
> ´¯¯¯
>> I am trying to find words in a dictionary stored in sqlite, and trying
>> a near miss approach.
>> For that I tried an algorithm to create patterns corresponding to
>> Levenshtein distance of 1 (edit distance of 1).
>> That means, one adition, one remotion or one substitution.
>>
>> Any hint on how to speed up this thing?
> `---
> 
> Hi,
> 
> I'm currently finishing an C extension offering, among other functions, 
> a "TYPOS" scalar operator which is meant to perform just that, and a 
> bit more.

There's a strong presumption that it doesn't handle CJK text, but what 
about alphabets other than Latin-based e.g. Arabic, Cyrillic, Greek, 
Hebrew, ...?

> Internally, it applies a Unicode fold() function,

What does fold() do? Strip off accents/umlauts/etc?

> a Unicode lower() 

upper() might be more suitable; consider the German eszett (U+00DF).

> function and then computes the Damerau-Levenshtein distance between the 
> strings.  It returns the number of insertions, omissions, change and 
> transposition (of adjacent letters only).

Consider an additional API which returns a scaled similarity score e.g 
1.0 - float(distance) / max(length(string1), length(string2))

> If the reference string is 'abcdef', it will return 1 (one typo) for
> 'abdef' missing c
> 'abcudef'   u inserted
> 'abzef' c changed into z
> 'abdcef'c & d exchanged
> 
> It will also accept a trailing '%' in string2 acting as in LIKE.
> 
> You can use it this way:
> 
>select * from t where typos(col, 'levencht%') <= 2;
> 
> or this way
> 
>select typos(str1, str2)
> 
> The code currently makes use of a couple of Win32 functions, which 
> should have Un*x equivalent.  It runs at really decent speed even if I 
> didn't fight for optimization.  It will obviously outperform any SQL 
> solution by a large factor.

Does it use the icu library? What is the memory footprint?

Cheers,
John
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Near misses

2009-06-28 Thread Alberto Simões
On Fri, Jun 26, 2009 at 10:00 PM, Jean-Christophe
Deschamps wrote:
> Hi,
>
> I'm currently finishing an C extension offering, among other functions,
> a "TYPOS" scalar operator which is meant to perform just that, and a
> bit more.
>
> Internally, it applies a Unicode fold() function, a Unicode lower()
> function and then computes the Damerau-Levenshtein distance between the
> strings.  It returns the number of insertions, omissions, change and
> transposition (of adjacent letters only).
>
> If the reference string is 'abcdef', it will return 1 (one typo) for
> 'abdef'     missing c
> 'abcudef'   u inserted
> 'abzef'     c changed into z
> 'abdcef'    c & d exchanged
>
> It will also accept a trailing '%' in string2 acting as in LIKE.
>
> You can use it this way:
>
>   select * from t where typos(col, 'levencht%') <= 2;
>
> or this way
>
>   select typos(str1, str2)
>
> The code currently makes use of a couple of Win32 functions, which
> should have Un*x equivalent.  It runs at really decent speed even if I
> didn't fight for optimization.  It will obviously outperform any SQL
> solution by a large factor.
>
> I can't promise a very clean version tomorrow but just mail if you're
> interested in the C source. You could tailor it to your precise needs
> easily.

I can't help and test it in the next few days. But I would be happy to
test and give some results about it
Cheers

-- 
Alberto Simões
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Near misses

2009-06-26 Thread Igor Tandetnik
Alberto Simoes wrote:
> On Fri, Jun 26, 2009 at 3:00 PM, Igor Tandetnik
> wrote:
>> Alberto Simoes wrote:
>>> SELECT DISTINCT(word) FROM dict WHERE word = "ar" OR word = "ca" OR
>>> word LIKE "_car" OR word LIKE "c_r" OR word = "cr" OR word LIKE
>>> "_ar" OR word LIKE "ca_r" OR word LIKE "c_ar" OR word LIKE "ca_" OR
>>> word LIKE "car_";
>>
>> I'd try writing a custom function that figures out whether two words
>> are "close enough" (most of the time, you should be able to declare a
>> negative by looking at just two first characters), then do
>>
>> select word from dict where closeEnough(word, 'car');
>
> Hmms, need to check how to do that. But that would mean call the
> function to all words in the database (110K atm).

Well, your current statement evaluates a complicated condition against 
every word in the database. I don't quite see how you can avoid checking 
every word - you can only try and make the check itself faster.

Igor Tandetnik 



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Near misses

2009-06-26 Thread Jean-Christophe Deschamps
At 13:25 26/06/2009, you wrote:
´¯¯¯
>I am trying to find words in a dictionary stored in sqlite, and trying
>a near miss approach.
>For that I tried an algorithm to create patterns corresponding to
>Levenshtein distance of 1 (edit distance of 1).
>That means, one adition, one remotion or one substitution.
>
>Any hint on how to speed up this thing?
`---

Hi,

I'm currently finishing an C extension offering, among other functions, 
a "TYPOS" scalar operator which is meant to perform just that, and a 
bit more.

Internally, it applies a Unicode fold() function, a Unicode lower() 
function and then computes the Damerau-Levenshtein distance between the 
strings.  It returns the number of insertions, omissions, change and 
transposition (of adjacent letters only).

If the reference string is 'abcdef', it will return 1 (one typo) for
'abdef' missing c
'abcudef'   u inserted
'abzef' c changed into z
'abdcef'c & d exchanged

It will also accept a trailing '%' in string2 acting as in LIKE.

You can use it this way:

   select * from t where typos(col, 'levencht%') <= 2;

or this way

   select typos(str1, str2)

The code currently makes use of a couple of Win32 functions, which 
should have Un*x equivalent.  It runs at really decent speed even if I 
didn't fight for optimization.  It will obviously outperform any SQL 
solution by a large factor.

I can't promise a very clean version tomorrow but just mail if you're 
interested in the C source. You could tailor it to your precise needs 
easily.

Hope it can help.


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Near misses

2009-06-26 Thread Alberto Simões
On Fri, Jun 26, 2009 at 3:43 PM, Simon
Slavin wrote:
>
> On 26 Jun 2009, at 12:25pm, Alberto Simões wrote:
>
>> one adition, one remotion or one substitution
>
> I am always amazed at how well people use English.  For your word
> 'remotion' you probably mean 'removal' or 'omission'.  You have joined
> the two possibilities together !

Probably I just tried a word similar to the Portuguese word: remoção ;)

> You could write a program to prepare another table in the same
> database with your near-misses in.  In other words, to take each word
> in the dictionary (like 'car') and put entries in this other table for
> each near miss you wish to accept:

Yep. That is one of my current options.
Was just wondering (and thus my mail) about any optimization I could
do in my query.

Thanks ;)
Alberto
-- 
Alberto Simões
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Near misses

2009-06-26 Thread Alberto Simões
Hello

On Fri, Jun 26, 2009 at 3:00 PM, Igor Tandetnik wrote:
> Alberto Simoes wrote:
>> For that, my script receives a word (say, 'car') and generated all
>> possible additions and remotions, and substitutions:
>>
>> Additions: _car c_ar ca_r car_
>> Substitutions: _ar c_r ca_
>> remotions: ar cr ca
>>
>> Then, the script constructs an SQL query:
>>
>> SELECT DISTINCT(word) FROM dict WHERE word = "ar" OR word = "ca" OR
>> word LIKE "_car" OR word LIKE "c_r" OR word = "cr" OR word LIKE "_ar"
>> OR word LIKE "ca_r" OR word LIKE "c_ar" OR word LIKE "ca_" OR word
>> LIKE "car_";
>>
>> And this SQL quer works... but not as quickly as I need (specially
>> because the speed is proportional to the word size).
>
> I'd try writing a custom function that figures out whether two words are
> "close enough" (most of the time, you should be able to declare a
> negative by looking at just two first characters), then do
>
> select word from dict where closeEnough(word, 'car');

Hmms, need to check how to do that. But that would mean call the
function to all words in the database (110K atm).

> I also don't see why you need DISTINCT. Do you have duplicate words in
> dict?

Yes, I have. Forgot to explain ;)

Thanks



-- 
Alberto Simões
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Near misses

2009-06-26 Thread Simon Slavin

On 26 Jun 2009, at 12:25pm, Alberto Simões wrote:

> one adition, one remotion or one substitution

I am always amazed at how well people use English.  For your word  
'remotion' you probably mean 'removal' or 'omission'.  You have joined  
the two possibilities together !

> Then, the script constructs an SQL query:
>
> SELECT DISTINCT(word) FROM dict WHERE word = "ar" OR word = "ca" OR
> word LIKE "_car" OR word LIKE "c_r" OR word = "cr" OR word LIKE "_ar"
> OR word LIKE "ca_r" OR word LIKE "c_ar" OR word LIKE "ca_" OR word
> LIKE "car_";
>
> And this SQL quer works... but not as quickly as I need (specially
> because the speed is proportional to the word size).

You could write a program to prepare another table in the same  
database with your near-misses in.  In other words, to take each word  
in the dictionary (like 'car') and put entries in this other table for  
each near miss you wish to accept:

nearMissrealWord

car car
ca  car
cr  car
ar  car
ca_ car
c_r car
_ar car
_carcar
c_arcar
ca_rcar
car_car
cat cat
ca  cat
ct  cat
at  cat
ca_ cat
c_t cat
_at cat
_catcat
c_atcat
ca_tcat
cat_cat

Then, in your search phase you just consult the near-miss table

SELECT realWord FROM nearMisses WHERE [whatever] LIKE  
nearMisses.nearMiss;

and find all the applicable entries: a single lookup against one index  
should be extremely fast.  Look up the word 'ca' and you get the both  
'car' and 'cat' realWords.  You could even include a JOIN to find the  
entries in your dict table too.

It should be easy to write software which goes through every  
permutation of missing letter, extra letter, etc..  It will lead to  
one very big table, but it will give you instant lookup.  You can  
shrink the table by using the LIKE operator both ways around, at the  
penalty of doubling the time taken.  The choice of whether to bother  
consulting the nearMiss table if the user typed a real word to start  
with is up to you.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Near misses

2009-06-26 Thread Igor Tandetnik
Alberto Simoes wrote:
> For that, my script receives a word (say, 'car') and generated all
> possible additions and remotions, and substitutions:
>
> Additions: _car c_ar ca_r car_
> Substitutions: _ar c_r ca_
> remotions: ar cr ca
>
> Then, the script constructs an SQL query:
>
> SELECT DISTINCT(word) FROM dict WHERE word = "ar" OR word = "ca" OR
> word LIKE "_car" OR word LIKE "c_r" OR word = "cr" OR word LIKE "_ar"
> OR word LIKE "ca_r" OR word LIKE "c_ar" OR word LIKE "ca_" OR word
> LIKE "car_";
>
> And this SQL quer works... but not as quickly as I need (specially
> because the speed is proportional to the word size).

I'd try writing a custom function that figures out whether two words are 
"close enough" (most of the time, you should be able to declare a 
negative by looking at just two first characters), then do

select word from dict where closeEnough(word, 'car');

I also don't see why you need DISTINCT. Do you have duplicate words in 
dict?

Igor Tandetnik 



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Near misses

2009-06-26 Thread Martin Pfeifle
Hi, 

I guess the speed could significantly be improved,
if you leave out _car and _ar.
The inverted index which is basically (term, 
blob_containing_all_document_ids_of_this_term),
cannot skip any of the alphabetically ordered terms if the first character is 
variable.
At least that's my understanding.

Thank you for your idea, because I am also thinking of putting some fuzzy 
search on top of FTS.

Best Martin




Von: Alberto Simões 
An: General Discussion of SQLite Database 
Gesendet: Freitag, den 26. Juni 2009, 13:25:57 Uhr
Betreff: [sqlite] Near misses

Hello.

I am trying to find words in a dictionary stored in sqlite, and trying
a near miss approach.
For that I tried an algorithm to create patterns corresponding to
Levenshtein distance of 1 (edit distance of 1).
That means, one adition, one remotion or one substitution.

For that, my script receives a word (say, 'car') and generated all
possible additions and remotions, and substitutions:

Additions: _car c_ar ca_r car_
Substitutions: _ar c_r ca_
remotions: ar cr ca

Then, the script constructs an SQL query:

SELECT DISTINCT(word) FROM dict WHERE word = "ar" OR word = "ca" OR
word LIKE "_car" OR word LIKE "c_r" OR word = "cr" OR word LIKE "_ar"
OR word LIKE "ca_r" OR word LIKE "c_ar" OR word LIKE "ca_" OR word
LIKE "car_";

And this SQL quer works... but not as quickly as I need (specially
because the speed is proportional to the word size).

Any hint on how to speed up this thing?

THank you
Alberto

--
Alberto Simões
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



  
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users