Re: [sqlite] How can custom tokenizer tell it's parsing a search string?

Dan Kennedy Wed, 12 Dec 2018 07:08:52 -0800

On 12/12/2018 03:37 AM, Jens Alfke wrote:

Is there any way for a custom FTS4 tokenizer to know when it’s
tokenizing a search string (the argument of a MATCH expression), as
opposed to text to be indexed?


Here’s my problem: I’ve implemented a custom tokenizer that skips
“stop words” (noise words, like “the” and “a” in English.) It works
well. But I’ve just gotten a bug report that some search strings with
wild-cards don’t work. For example, “mo* AND the*” would be expected
to match text containing the words “Moog” and “theremin”, but instead
the query fails with the SQLite error "malformed MATCH expression:
[mo* AND the*]”.

The reason for the error is that when the query runs, FTS4 uses my
tokenizer to break the search string into words. My tokenizer skips
“the” because it’s a stop word, so the sequence of tokens FTS4 gets
is “mo”, “*”, “AND”, “*” … which is invalid since there’s no prefix
before the second “*”.

I can fix this by preserving stop-words when the tokenizer is being
used to scan the search string. But I can’t find any way for the
tokenizer to tell the difference! It’s the same tokenizer instance
used for indexing, and the SQLite function getNextToken opens it in
the normal way and calls its xNext function.

I don't think there is any way to tell with FTS3/4. FTS5 passes aparameter to the tokenizer to indicate this (the mask ofFTS5_TOKENIZER_* flags), but FTS3/4 does not. But you wouldn't have thisproblem with FTS5 anyhow, because it handles the AND or "*" syntaxbefore passing whatever is left to the tokenizer.


  https://sqlite.org/fts5.html#custom_tokenizers

Leaving stop words in while parsing queries won't quite work anyway. Ifyour tokenizer returns "the" when parsing a query, FTS3/4 will searchfor "the" in the index. And it won't be there if the tokenizer used forparsing documents stripped it out.

I think your best options might be to switch to FTS5 or to write atokenizer smart enough to remove the AND or other syntax tokens whenrequired.


Dan.

The best workaround I can think of is to make the tokenizer preserve
a stop-word when it’s followed by a “*” … but there are contexts
where this can happen in regular text being indexed, when the “*” is
a footnote marker or the end of a Markdown emphasis sequence.

—Jens _______________________________________________ sqlite-users
mailing list [email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] How can custom tokenizer tell it's parsing a search string?

Reply via email to