Forgot to add. I got idea for last method from this[1] post.
[1]
http://sqlite.1065341.n5.nabble.com/fast-string-prefix-matching-td10777.html
On Thu, Aug 2, 2012 at 11:36 PM, Navaneeth.K.N wrote:
> Hello,
>
> I have a table which contains a set of words. These are stored in a table
> called 'patterns_content' with pattern as unique key. There is a FTS4 table
> named 'patterns' which has 'content=patterns_content' set. It will have
> only ASCII characters. Something like,
>
> ab
> abd
> abcd
> .
> .
>
> Given a string, 'abcdef', I'd like to tokenize it based on the above
> table. My tokenizer is greedy and always looks for longest prefix match.
> Which means,
>
> Eg1: tokenize('abcdef') = 'abcd' + tokenize(ef)
> Eg2: tokenize('xyzabdab') = tokenize(xyx) + 'abd' + 'ab'
>
> I have tried all these methods to find the longest prefix match.
>
> Method1 - select * from patterns where pattern match 'abcdef OR abcde OR
> abcd OR abc OR ab OR a' order by length(pattern) desc limit 1
>
> Method2 - select * from patterns_content where pattern = 'abcdef' OR
> pattern = 'abcde' OR pattern = 'abcd' OR pattern = 'abc' OR pattern = 'ab'
> OR pattern = 'a' order by length(pattern) desc limit 1
>
> Method3: Start reading from left and read one character at a time. Each
> run, execute select 1 from patterns_content as c where c.pattern >= 'a' and
> c.pattern <= 'a' || 'z' limit 1. Next run, it will test for 'ab', then
> 'abc' etc.
>
> All the above methods works well. Surprisingly, the last method
> outperforms all other methods. I am wondering what optimization is making
> the last query always execute fast? Is it reliable to assume that the query
> will have it performance even with huge number of rows?
>
> I am also wondering if there are other ways to solve this problem in an
> efficient way? Any help would be great. My table will have more than a
> million of data.
>
> --
> -Navaneeth
>
--
-n
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users