Is there any way for a custom FTS4 tokenizer to know when it’s tokenizing a 
search string (the argument of a MATCH expression), as opposed to text to be 
indexed?

Here’s my problem: I’ve implemented a custom tokenizer that skips “stop words” 
(noise words, like “the” and “a” in English.) It works well. But I’ve just 
gotten a bug report that some search strings with wild-cards don’t work. For 
example, “mo* AND the*” would be expected to match text containing the words 
“Moog” and “theremin”, but instead the query fails with the SQLite error 
"malformed MATCH expression: [mo* AND the*]”.

The reason for the error is that when the query runs, FTS4 uses my tokenizer to 
break the search string into words. My tokenizer skips “the” because it’s a 
stop word, so the sequence of tokens FTS4 gets is “mo”, “*”, “AND”, “*” … which 
is invalid since there’s no prefix before the second “*”.

I can fix this by preserving stop-words when the tokenizer is being used to 
scan the search string. But I can’t find any way for the tokenizer to tell the 
difference! It’s the same tokenizer instance used for indexing, and the SQLite 
function getNextToken opens it in the normal way and calls its xNext function.

The best workaround I can think of is to make the tokenizer preserve a 
stop-word when it’s followed by a “*” … but there are contexts where this can 
happen in regular text being indexed, when the “*” is a footnote marker or the 
end of a Markdown emphasis sequence.

—Jens
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to