I wonder if the idea of suffix arrays would belp. http://www.cs.dartmouth.edu/~doug/sarray/ Also covered nicely in Programming Pearls, by Jon Bentley
But I've joined this thread in the middle, so may have missed soemthing important. Hugh On Thu, 6 Aug 2009, Jim Showalter wrote: > I hadn't thought about it until now, but there's no reason you can't > treat the table of substrings as substrings appearing anywhere in a > word, and join each substring of a word to the substrings table. > > For 26 letters, you have > 17k three-letter substrings and almost 500k > four-letter substrings, so this technique probably is limited to > three-letter substrings. But that's not bad--it means that if > character sequences are evenly distributed in words (they're not, but > assume they are for a moment), then the set of all possible words is > cut into 1/>17,000 on the first search. There are only about 600,000 > words in the English language, so with even distribution your first > search cuts the results to ~34 words. But uneven character-sequence > distributions will skew that--some sequences will result in more words > than others. Sill, even if you get back 500 words to search for the > full match, that's better than searching all of your words. (You also > have to store all one-letter and two-letter substrings, but those > don't take up much additional room.) > > You would have to have a separate table that has the multiple joins > (one per substring) in your words, because now each word can > potentially have M substrings in it, so M foreign keys. > > For your example, Motor/motor would match mot, oto, and tor. You would > use a rule to arbitrarily match the first N letters in a the search > term, so if someone searched for "%motor%", that would be a search in > the substrings table for "mot". This would return the primary key for > "mot". Then you search the substring-matches table for all words that > have that substring, and this gives you back a list of words > containing "mot". Then you search that result set for "%motor%". > Similarly, if someone searchs for "%oto%", you get the substring key > for "oto" and find words containing that substring. > > For all I know, full-text-search engines may do something like this. > > ----- Original Message ----- > From: "Lukas Haase" <lukasha...@gmx.at> > To: <sqlite-users@sqlite.org> > Sent: Thursday, August 06, 2009 6:54 AM > Subject: Re: [sqlite] FTS and postfix search > > > > Hi Jim, > > > > and thank you for the great idea. But I thought it would be possible > > to > > search '*word*' - but this is not possible with this method either. > > > > Is there any chance for searching '*word*' quickly? > > > > Regards, > > Luke > > > > Jim Showalter schrieb: > >> You could store the words reversed (in addition to storing them in > >> forward order). Then like 'xxx%' would be fast. > >> > >> This would double your disk footprint, but could give you the > >> search > >> performance you're looking for. > >> > >> If that's too goofy, you could create a table of all one, two, and > >> three-character word endings, and join to it from all of your words > >> (stored in forward order). Then search first for the primary key of > >> the word ending you want to search for, then search your words for > >> that key. > >> > >> Index the join. > >> > >> ----- Original Message ----- > >> From: "Lukas Haase" <lukasha...@gmx.at> > >> To: <sqlite-users@sqlite.org> > >> Sent: Wednesday, August 05, 2009 6:16 PM > >> Subject: Re: [sqlite] FTS and postfix search > >> > >> > >>> Wes Freeman schrieb: > >>>> I clearly am not in the right mindset to be answering list > >>>> emails. > >>>> Please ignore my response (it's too late now)--back to my > >>>> stressful > >>>> deadline. > >>> :-) > >>> > >>>> Strange that it's implemented for prefix and not postfix? > >>> Well, an explanation is easy: Same as with LIKE, LIKE 'xxx' or > >>> LIKE > >>> 'xxx%' can be performed easy because only the beginning of words > >>> need to > >>> be compared. > >>> > >>> However, there /is/ a way to also do postfix searches. I have the > >>> *same* > >>> database in *.hlp format and with WinHelp it's possible to search > >>> '*otor' (and others) with almost zero CPU and time consumption. > >>> I'd > >>> be > >>> curious how they did this. > >>> > >>> For a solution for SQLite I would accept a small performance > >>> penalty > >>> in > >>> that case (but very few secs max); additionally I would also > >>> accept > >>> the > >>> index being bigger. > >>> > >>> Regards, > >>> Luke > >>> > >>>> Wes > >>>> > >>>> On Wed, Aug 5, 2009 at 8:58 PM, Lukas Haase<lukasha...@gmx.at> > >>>> wrote: > >>>>> Wes Freeman schrieb: > >>>>>> Why not LIKE '%otor'? > >>>>> SELECT topic_title FROM topics > >>>>> WHERE topic LIKE '%otor%' > >>>>> ORDER BY topic_title ASC; > >>>>> > >>>>> This is very, very slow, especially on my > 100 MB database. > >>>>> "Realtime" > >>>>> search in the GUI is a requirement. This is exactly the reason > >>>>> why > >>>>> I > >>>>> want to use FTS instead of LIKE... > >>>>> > >>>>> Regards, > >>>>> Luke > >>>>> > >>>>>> Wes > >>>>>> > >>>>>> On Wed, Aug 5, 2009 at 7:47 PM, Lukas Haase<lukasha...@gmx.at> > >>>>>> wrote: > >>>>>>> Hi, > >>>>>>> > >>>>>>> It's me again, sorry. The next big problem concerning FTS. I > >>>>>>> have the > >>>>>>> requirement to do postfix searches, like: > >>>>>>> > >>>>>>> SELECT topic_title FROM topics > >>>>>>> WHERE topic MATCH '*otor' > >>>>>>> ORDER BY topic_title ASC; > >>>>>>> > >>>>>>> should find Motor, motor, Monotor etc. But this does not seem > >>>>>>> to > >>>>>>> work. > >>>>>>> Is there any chance to get this working? > >>>>>>> > >>>>>>> Best regards, > >>>>>>> Luke > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> sqlite-users mailing list > >>>>>>> sqlite-users@sqlite.org > >>>>>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > >>>>>>> > >>>>>> _______________________________________________ > >>>>>> sqlite-users mailing list > >>>>>> sqlite-users@sqlite.org > >>>>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > >>>>>> > >>>>> _______________________________________________ > >>>>> sqlite-users mailing list > >>>>> sqlite-users@sqlite.org > >>>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > >>>>> > >>>> _______________________________________________ > >>>> sqlite-users mailing list > >>>> sqlite-users@sqlite.org > >>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > >>>> > >>> _______________________________________________ > >>> sqlite-users mailing list > >>> sqlite-users@sqlite.org > >>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > >> > >> _______________________________________________ > >> sqlite-users mailing list > >> sqlite-users@sqlite.org > >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > >> > > > > > > _______________________________________________ > > sqlite-users mailing list > > sqlite-users@sqlite.org > > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users