You can lazy-init the table of substrings and substring matches when 
you save words, so you only use space for substrings that appear in 
words that have been saved in your database. That avoids entering 
substrings that never occur in English, and substrings that occur in 
English but not in any words you are searching.

----- Original Message ----- 
From: "Jim Showalter" <j...@jimandlisa.com>
To: "General Discussion of SQLite Database" <sqlite-users@sqlite.org>
Sent: Thursday, August 06, 2009 8:59 AM
Subject: Re: [sqlite] FTS and postfix search


>I hadn't thought about it until now, but there's no reason you can't 
>treat the table of substrings as substrings appearing anywhere in a 
>word, and join each substring of a word to the substrings table.
>
> For 26 letters, you have > 17k three-letter substrings and almost 
> 500k four-letter substrings, so this technique probably is limited 
> to three-letter substrings. But that's not bad--it means that if 
> character sequences are evenly distributed in words (they're not, 
> but assume they are for a moment), then the set of all possible 
> words is cut into 1/>17,000 on the first search. There are only 
> about 600,000 words in the English language, so with even 
> distribution your first search cuts the results to ~34 words. But 
> uneven character-sequence distributions will skew that--some 
> sequences will result in more words than others. Sill, even if you 
> get back 500 words to search for the full match, that's better than 
> searching all of your words. (You also have to store all one-letter 
> and two-letter substrings, but those don't take up much additional 
> room.)
>
> You would have to have a separate table that has the multiple joins 
> (one per substring) in your words, because now each word can 
> potentially have M substrings in it, so M foreign keys.
>
> For your example, Motor/motor would match mot, oto, and tor. You 
> would use a rule to arbitrarily match the first N letters in a the 
> search term, so if someone searched for "%motor%", that would be a 
> search in the substrings table for "mot". This would return the 
> primary key for "mot". Then you search the substring-matches table 
> for all words that have that substring, and this gives you back a 
> list of words containing "mot". Then you search that result set for 
> "%motor%". Similarly, if someone searchs for "%oto%", you get the 
> substring key for "oto" and find words containing that substring.
>
> For all I know, full-text-search engines may do something like this.
>
> ----- Original Message ----- 
> From: "Lukas Haase" <lukasha...@gmx.at>
> To: <sqlite-users@sqlite.org>
> Sent: Thursday, August 06, 2009 6:54 AM
> Subject: Re: [sqlite] FTS and postfix search
>
>
>> Hi Jim,
>>
>> and thank you for the great idea. But I thought it would be 
>> possible to
>> search '*word*' - but this is not possible with this method either.
>>
>> Is there any chance for searching '*word*' quickly?
>>
>> Regards,
>> Luke
>>
>> Jim Showalter schrieb:
>>> You could store the words reversed (in addition to storing them in
>>> forward order). Then like 'xxx%' would be fast.
>>>
>>> This would double your disk footprint, but could give you the 
>>> search
>>> performance you're looking for.
>>>
>>> If that's too goofy, you could create a table of all one, two, and
>>> three-character word endings, and join to it from all of your 
>>> words
>>> (stored in forward order). Then search first for the primary key 
>>> of
>>> the word ending you want to search for, then search your words for
>>> that key.
>>>
>>> Index the join.
>>>
>>> ----- Original Message ----- 
>>> From: "Lukas Haase" <lukasha...@gmx.at>
>>> To: <sqlite-users@sqlite.org>
>>> Sent: Wednesday, August 05, 2009 6:16 PM
>>> Subject: Re: [sqlite] FTS and postfix search
>>>
>>>
>>>> Wes Freeman schrieb:
>>>>> I clearly am not in the right mindset to be answering list 
>>>>> emails.
>>>>> Please ignore my response (it's too late now)--back to my 
>>>>> stressful
>>>>> deadline.
>>>> :-)
>>>>
>>>>> Strange that it's implemented for prefix and not postfix?
>>>> Well, an explanation is easy: Same as with LIKE, LIKE 'xxx' or 
>>>> LIKE
>>>> 'xxx%' can be performed easy because only the beginning of words
>>>> need to
>>>> be compared.
>>>>
>>>> However, there /is/ a way to also do postfix searches. I have the
>>>> *same*
>>>> database in *.hlp format and with WinHelp it's possible to search
>>>> '*otor' (and others) with almost zero CPU and time consumption. 
>>>> I'd
>>>> be
>>>> curious how they did this.
>>>>
>>>> For a solution for SQLite I would accept a small performance 
>>>> penalty
>>>> in
>>>> that case (but very few secs max); additionally I would also 
>>>> accept
>>>> the
>>>> index being bigger.
>>>>
>>>> Regards,
>>>> Luke
>>>>
>>>>> Wes
>>>>>
>>>>> On Wed, Aug 5, 2009 at 8:58 PM, Lukas Haase<lukasha...@gmx.at>
>>>>> wrote:
>>>>>> Wes Freeman schrieb:
>>>>>>> Why not LIKE '%otor'?
>>>>>> SELECT topic_title FROM topics
>>>>>> WHERE topic LIKE '%otor%'
>>>>>> ORDER BY topic_title ASC;
>>>>>>
>>>>>> This is very, very slow, especially on my > 100 MB database.
>>>>>> "Realtime"
>>>>>> search in the GUI is a requirement. This is exactly the reason 
>>>>>> why
>>>>>> I
>>>>>> want to use FTS instead of LIKE...
>>>>>>
>>>>>> Regards,
>>>>>> Luke
>>>>>>
>>>>>>> Wes
>>>>>>>
>>>>>>> On Wed, Aug 5, 2009 at 7:47 PM, Lukas Haase<lukasha...@gmx.at>
>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> It's me again, sorry. The next big problem concerning FTS. I
>>>>>>>> have the
>>>>>>>> requirement to do postfix searches, like:
>>>>>>>>
>>>>>>>> SELECT topic_title FROM topics
>>>>>>>> WHERE topic MATCH '*otor'
>>>>>>>> ORDER BY topic_title ASC;
>>>>>>>>
>>>>>>>> should find Motor, motor, Monotor etc. But this does not seem 
>>>>>>>> to
>>>>>>>> work.
>>>>>>>> Is there any chance to get this working?
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Luke
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> sqlite-users mailing list
>>>>>>>> sqlite-users@sqlite.org
>>>>>>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> sqlite-users mailing list
>>>>>>> sqlite-users@sqlite.org
>>>>>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>>>>>>
>>>>>> _______________________________________________
>>>>>> sqlite-users mailing list
>>>>>> sqlite-users@sqlite.org
>>>>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>>>>>
>>>>> _______________________________________________
>>>>> sqlite-users mailing list
>>>>> sqlite-users@sqlite.org
>>>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>>>>
>>>> _______________________________________________
>>>> sqlite-users mailing list
>>>> sqlite-users@sqlite.org
>>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>>
>>> _______________________________________________
>>> sqlite-users mailing list
>>> sqlite-users@sqlite.org
>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>>
>>
>>
>> _______________________________________________
>> sqlite-users mailing list
>> sqlite-users@sqlite.org
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> 

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to