Hello, I have a table which contains a set of words. These are stored in a table called 'patterns_content' with pattern as unique key. There is a FTS4 table named 'patterns' which has 'content=patterns_content' set. It will have only ASCII characters. Something like,
ab abd abcd ..... ..... Given a string, 'abcdef', I'd like to tokenize it based on the above table. My tokenizer is greedy and always looks for longest prefix match. Which means, Eg1: tokenize('abcdef') = 'abcd' + tokenize(ef) Eg2: tokenize('xyzabdab') = tokenize(xyx) + 'abd' + 'ab' I have tried all these methods to find the longest prefix match. Method1 - select * from patterns where pattern match 'abcdef OR abcde OR abcd OR abc OR ab OR a' order by length(pattern) desc limit 1 Method2 - select * from patterns_content where pattern = 'abcdef' OR pattern = 'abcde' OR pattern = 'abcd' OR pattern = 'abc' OR pattern = 'ab' OR pattern = 'a' order by length(pattern) desc limit 1 Method3: Start reading from left and read one character at a time. Each run, execute select 1 from patterns_content as c where c.pattern >= 'a' and c.pattern <= 'a' || 'z' limit 1. Next run, it will test for 'ab', then 'abc' etc. All the above methods works well. Surprisingly, the last method outperforms all other methods. I am wondering what optimization is making the last query always execute fast? Is it reliable to assume that the query will have it performance even with huge number of rows? I am also wondering if there are other ways to solve this problem in an efficient way? Any help would be great. My table will have more than a million of data. -- -Navaneeth _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users