Hello,

I have a table which contains a set of words. These are stored in a table
called 'patterns_content' with pattern as unique key. There is a FTS4 table
named 'patterns' which has 'content=patterns_content' set. It will have
only ASCII characters. Something like,

ab
abd
abcd
.....
.....

Given a string, 'abcdef', I'd like to tokenize it based on the above table.
My tokenizer is greedy and always looks for longest prefix match. Which
means,

Eg1: tokenize('abcdef') = 'abcd' + tokenize(ef)
Eg2: tokenize('xyzabdab') = tokenize(xyx) + 'abd' + 'ab'

I have tried all these methods to find the longest prefix match.

Method1 - select * from patterns where pattern match 'abcdef OR abcde OR
abcd OR abc OR ab OR a' order by length(pattern) desc limit 1

Method2 - select * from patterns_content where pattern = 'abcdef' OR
pattern = 'abcde' OR pattern = 'abcd' OR pattern = 'abc' OR pattern = 'ab'
OR pattern = 'a' order by length(pattern) desc limit 1

Method3: Start reading from left and read one character at a time. Each
run, execute select 1 from patterns_content as c where c.pattern >= 'a' and
c.pattern <= 'a' || 'z' limit 1. Next run, it will test for 'ab', then
'abc' etc.

All the above methods works well. Surprisingly, the last method outperforms
all other methods. I am wondering what optimization is making the last
query always execute fast? Is it reliable to assume that the query will
have it performance even with huge number of rows?

I am also wondering if there are other ways to solve this problem in an
efficient way? Any help would be great. My table will have more than a
million of data.

-- 
-Navaneeth
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to