On 12/12/2018 03:37 AM, Jens Alfke wrote:
Is there any way for a custom FTS4 tokenizer to know when it’s
tokenizing a search string (the argument of a MATCH expression), as
opposed to text to be indexed?

Here’s my problem: I’ve implemented a custom tokenizer that skips
“stop words” (noise words, like “the” and “a” in English.) It works
well. But I’ve just gotten a bug report that some search strings with
wild-cards don’t work. For example, “mo* AND the*” would be expected
to match text containing the words “Moog” and “theremin”, but instead
the query fails with the SQLite error "malformed MATCH expression:
[mo* AND the*]”.

The reason for the error is that when the query runs, FTS4 uses my
tokenizer to break the search string into words. My tokenizer skips
“the” because it’s a stop word, so the sequence of tokens FTS4 gets
is “mo”, “*”, “AND”, “*” … which is invalid since there’s no prefix
before the second “*”.

I can fix this by preserving stop-words when the tokenizer is being
used to scan the search string. But I can’t find any way for the
tokenizer to tell the difference! It’s the same tokenizer instance
used for indexing, and the SQLite function getNextToken opens it in
the normal way and calls its xNext function.


I don't think there is any way to tell with FTS3/4. FTS5 passes a parameter to the tokenizer to indicate this (the mask of FTS5_TOKENIZER_* flags), but FTS3/4 does not. But you wouldn't have this problem with FTS5 anyhow, because it handles the AND or "*" syntax before passing whatever is left to the tokenizer.

  https://sqlite.org/fts5.html#custom_tokenizers

Leaving stop words in while parsing queries won't quite work anyway. If your tokenizer returns "the" when parsing a query, FTS3/4 will search for "the" in the index. And it won't be there if the tokenizer used for parsing documents stripped it out.

I think your best options might be to switch to FTS5 or to write a tokenizer smart enough to remove the AND or other syntax tokens when required.

Dan.




The best workaround I can think of is to make the tokenizer preserve
a stop-word when it’s followed by a “*” … but there are contexts
where this can happen in regular text being indexed, when the “*” is
a footnote marker or the end of a Markdown emphasis sequence.

—Jens _______________________________________________ sqlite-users
mailing list sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to