[sqlite] FTS tokenize=unicode61: "full" or "simple" case folding?

Matthias-Christian Ott Mon, 21 Mar 2016 22:08:30 +0100

On 2016-03-21 20:43, Tomash Brechko wrote:
> Hello,
> 
> https://www.sqlite.org/fts3.html#tokenizer page says that unicode61
> tokenizer implements _full_ case folding (it doesn't emphasize the word,
> but it's there).  ftp://unicode.org/Public/6.1.0/ucd/CaseFolding.txt has
> the following rules:
> 
> -- cut --
> ...
> 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
> ...
> 1E9E; F; 0073 0073; # LATIN CAPITAL LETTER SHARP S
> 1E9E; S; 00DF; # LATIN CAPITAL LETTER SHARP S
> ...
> -- cut --
> 
> I.e. in _full_ case folding both "?" (U+1E9E) and "?" (U+00DF) are mapped
> to "ss", whereas in _simple_ case folding first one is mapped to the
> second.  SQLite 3.11.0 works according to simple rules:
> 
> -- cut --
> CREATE VIRTUAL TABLE t USING fts3tokenize(unicode61);
> SELECT token FROM t WHERE input = "? ?";
> -- cut --
> gives
> -- cut--
> ?
> ?
> -- cut--
> 
> So which one is correct, documentation or implementation?  I also wonder
> what a native German speaker would expect in full-text search case?
> (Google gives different result counts for "Schlo?" and "Schloss", which
> actually surprises me a bit).


The character "?" was often not present in fonts, is not included in
ISO/IEC 8859-1:1998 and is not historically and commonly used in German
(the German Wikipedia and the articles' references can explain this
better than I can). It was just "recently" added Unicode 5.1 in 2008. It
is common to either capitalize ? as SS or SZ (to avoid ambiguities) in
all-caps titles. I think it's uncertain whether ? will be widely used.

If I understand Unicode case folding correctly, it exists to be able to
compare Unicode strings case-insensitively by converting them into a
canonical form. So simple case seems correct, as ? would be folded to ?.
However, if you keep in mind the old orthography (before 1996) and want
to know what makes sense for a search engine, full case folding makes
more sense. As you noted "Schlo?" and "Schloss" should return the same
results for non-verbatim searches as such distinction would seemingly
only be relevant to linguists or historians but not for every day use
and business information systems.

The Unicode standard is unfortunately vague about what it wants to
achieve by case folding and what thoughts went into the case folding
table. Perhaps you should ask on the Unicode mailing list.

You didn't describe your use-case but I would also generally advice to
use a phonetic algorithm for German to canonicalize words for
non-verbatim searches instead of case folding. It gives better results
and most German speakers I know appreciate the phonetic corrections of
popular Internet search engines for non-verbatim searches.

I hope this helps. Maybe it also helps to consult a linguist for
building a non-simplistic search engine for German. For example, you
have to perform compound splitting, stemming and some form of
grammatical analysis at some point.

- Matthias-Christian

[sqlite] FTS tokenize=unicode61: "full" or "simple" case folding?

Reply via email to