Is there any way for a custom FTS4 tokenizer to know when it’s tokenizing a search string (the argument of a MATCH expression), as opposed to text to be indexed?
Here’s my problem: I’ve implemented a custom tokenizer that skips “stop words” (noise words, like “the” and “a” in English.) It works well. But I’ve just gotten a bug report that some search strings with wild-cards don’t work. For example, “mo* AND the*” would be expected to match text containing the words “Moog” and “theremin”, but instead the query fails with the SQLite error "malformed MATCH expression: [mo* AND the*]”. The reason for the error is that when the query runs, FTS4 uses my tokenizer to break the search string into words. My tokenizer skips “the” because it’s a stop word, so the sequence of tokens FTS4 gets is “mo”, “*”, “AND”, “*” … which is invalid since there’s no prefix before the second “*”. I can fix this by preserving stop-words when the tokenizer is being used to scan the search string. But I can’t find any way for the tokenizer to tell the difference! It’s the same tokenizer instance used for indexing, and the SQLite function getNextToken opens it in the normal way and calls its xNext function. The best workaround I can think of is to make the tokenizer preserve a stop-word when it’s followed by a “*” … but there are contexts where this can happen in regular text being indexed, when the “*” is a footnote marker or the end of a Markdown emphasis sequence. —Jens _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users