[sqlite] FTS tokenize=unicode61: "full" or "simple" case folding?

Tomash Brechko Mon, 21 Mar 2016 22:43:31 +0300

Hello,

https://www.sqlite.org/fts3.html#tokenizer page says that unicode61
tokenizer implements _full_ case folding (it doesn't emphasize the word,
but it's there).  ftp://unicode.org/Public/6.1.0/ucd/CaseFolding.txt has
the following rules:


-- cut --
...
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
...
1E9E; F; 0073 0073; # LATIN CAPITAL LETTER SHARP S
1E9E; S; 00DF; # LATIN CAPITAL LETTER SHARP S
...
-- cut --

I.e. in _full_ case folding both "?" (U+1E9E) and "?" (U+00DF) are mapped
to "ss", whereas in _simple_ case folding first one is mapped to the
second.  SQLite 3.11.0 works according to simple rules:

-- cut --
CREATE VIRTUAL TABLE t USING fts3tokenize(unicode61);
SELECT token FROM t WHERE input = "? ?";
-- cut --
gives
-- cut--
?
?
-- cut--

So which one is correct, documentation or implementation?  I also wonder
what a native German speaker would expect in full-text search case?
(Google gives different result counts for "Schlo?" and "Schloss", which
actually surprises me a bit).

-- 
  Tomash Brechko

[sqlite] FTS tokenize=unicode61: "full" or "simple" case folding?

Reply via email to