I wonder if the idea of suffix arrays would belp.
http://www.cs.dartmouth.edu/~doug/sarray/
Also covered nicely in Programming Pearls, by Jon Bentley

But I've joined this thread in the middle, so may have missed soemthing
important.
        Hugh


On Thu, 6 Aug 2009, Jim Showalter wrote:

> I hadn't thought about it until now, but there's no reason you can't 
> treat the table of substrings as substrings appearing anywhere in a 
> word, and join each substring of a word to the substrings table.
> 
> For 26 letters, you have > 17k three-letter substrings and almost 500k 
> four-letter substrings, so this technique probably is limited to 
> three-letter substrings. But that's not bad--it means that if 
> character sequences are evenly distributed in words (they're not, but 
> assume they are for a moment), then the set of all possible words is 
> cut into 1/>17,000 on the first search. There are only about 600,000 
> words in the English language, so with even distribution your first 
> search cuts the results to ~34 words. But uneven character-sequence 
> distributions will skew that--some sequences will result in more words 
> than others. Sill, even if you get back 500 words to search for the 
> full match, that's better than searching all of your words. (You also 
> have to store all one-letter and two-letter substrings, but those 
> don't take up much additional room.)
> 
> You would have to have a separate table that has the multiple joins 
> (one per substring) in your words, because now each word can 
> potentially have M substrings in it, so M foreign keys.
> 
> For your example, Motor/motor would match mot, oto, and tor. You would 
> use a rule to arbitrarily match the first N letters in a the search 
> term, so if someone searched for "%motor%", that would be a search in 
> the substrings table for "mot". This would return the primary key for 
> "mot". Then you search the substring-matches table for all words that 
> have that substring, and this gives you back a list of words 
> containing "mot". Then you search that result set for "%motor%". 
> Similarly, if someone searchs for "%oto%", you get the substring key 
> for "oto" and find words containing that substring.
> 
> For all I know, full-text-search engines may do something like this.
> 
> ----- Original Message ----- 
> From: "Lukas Haase" <lukasha...@gmx.at>
> To: <sqlite-users@sqlite.org>
> Sent: Thursday, August 06, 2009 6:54 AM
> Subject: Re: [sqlite] FTS and postfix search
> 
> 
> > Hi Jim,
> >
> > and thank you for the great idea. But I thought it would be possible 
> > to
> > search '*word*' - but this is not possible with this method either.
> >
> > Is there any chance for searching '*word*' quickly?
> >
> > Regards,
> > Luke
> >
> > Jim Showalter schrieb:
> >> You could store the words reversed (in addition to storing them in
> >> forward order). Then like 'xxx%' would be fast.
> >>
> >> This would double your disk footprint, but could give you the 
> >> search
> >> performance you're looking for.
> >>
> >> If that's too goofy, you could create a table of all one, two, and
> >> three-character word endings, and join to it from all of your words
> >> (stored in forward order). Then search first for the primary key of
> >> the word ending you want to search for, then search your words for
> >> that key.
> >>
> >> Index the join.
> >>
> >> ----- Original Message ----- 
> >> From: "Lukas Haase" <lukasha...@gmx.at>
> >> To: <sqlite-users@sqlite.org>
> >> Sent: Wednesday, August 05, 2009 6:16 PM
> >> Subject: Re: [sqlite] FTS and postfix search
> >>
> >>
> >>> Wes Freeman schrieb:
> >>>> I clearly am not in the right mindset to be answering list 
> >>>> emails.
> >>>> Please ignore my response (it's too late now)--back to my 
> >>>> stressful
> >>>> deadline.
> >>> :-)
> >>>
> >>>> Strange that it's implemented for prefix and not postfix?
> >>> Well, an explanation is easy: Same as with LIKE, LIKE 'xxx' or 
> >>> LIKE
> >>> 'xxx%' can be performed easy because only the beginning of words
> >>> need to
> >>> be compared.
> >>>
> >>> However, there /is/ a way to also do postfix searches. I have the
> >>> *same*
> >>> database in *.hlp format and with WinHelp it's possible to search
> >>> '*otor' (and others) with almost zero CPU and time consumption. 
> >>> I'd
> >>> be
> >>> curious how they did this.
> >>>
> >>> For a solution for SQLite I would accept a small performance 
> >>> penalty
> >>> in
> >>> that case (but very few secs max); additionally I would also 
> >>> accept
> >>> the
> >>> index being bigger.
> >>>
> >>> Regards,
> >>> Luke
> >>>
> >>>> Wes
> >>>>
> >>>> On Wed, Aug 5, 2009 at 8:58 PM, Lukas Haase<lukasha...@gmx.at>
> >>>> wrote:
> >>>>> Wes Freeman schrieb:
> >>>>>> Why not LIKE '%otor'?
> >>>>> SELECT topic_title FROM topics
> >>>>> WHERE topic LIKE '%otor%'
> >>>>> ORDER BY topic_title ASC;
> >>>>>
> >>>>> This is very, very slow, especially on my > 100 MB database.
> >>>>> "Realtime"
> >>>>> search in the GUI is a requirement. This is exactly the reason 
> >>>>> why
> >>>>> I
> >>>>> want to use FTS instead of LIKE...
> >>>>>
> >>>>> Regards,
> >>>>> Luke
> >>>>>
> >>>>>> Wes
> >>>>>>
> >>>>>> On Wed, Aug 5, 2009 at 7:47 PM, Lukas Haase<lukasha...@gmx.at>
> >>>>>> wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> It's me again, sorry. The next big problem concerning FTS. I
> >>>>>>> have the
> >>>>>>> requirement to do postfix searches, like:
> >>>>>>>
> >>>>>>> SELECT topic_title FROM topics
> >>>>>>> WHERE topic MATCH '*otor'
> >>>>>>> ORDER BY topic_title ASC;
> >>>>>>>
> >>>>>>> should find Motor, motor, Monotor etc. But this does not seem 
> >>>>>>> to
> >>>>>>> work.
> >>>>>>> Is there any chance to get this working?
> >>>>>>>
> >>>>>>> Best regards,
> >>>>>>> Luke
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> sqlite-users mailing list
> >>>>>>> sqlite-users@sqlite.org
> >>>>>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> >>>>>>>
> >>>>>> _______________________________________________
> >>>>>> sqlite-users mailing list
> >>>>>> sqlite-users@sqlite.org
> >>>>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> >>>>>>
> >>>>> _______________________________________________
> >>>>> sqlite-users mailing list
> >>>>> sqlite-users@sqlite.org
> >>>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> >>>>>
> >>>> _______________________________________________
> >>>> sqlite-users mailing list
> >>>> sqlite-users@sqlite.org
> >>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> >>>>
> >>> _______________________________________________
> >>> sqlite-users mailing list
> >>> sqlite-users@sqlite.org
> >>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> >>
> >> _______________________________________________
> >> sqlite-users mailing list
> >> sqlite-users@sqlite.org
> >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> >>
> >
> >
> > _______________________________________________
> > sqlite-users mailing list
> > sqlite-users@sqlite.org
> > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users 
> 
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> 
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to