On 2016-03-21 20:43, Tomash Brechko wrote: > Hello, > > https://www.sqlite.org/fts3.html#tokenizer page says that unicode61 > tokenizer implements _full_ case folding (it doesn't emphasize the word, > but it's there). ftp://unicode.org/Public/6.1.0/ucd/CaseFolding.txt has > the following rules: > > -- cut -- > ... > 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S > ... > 1E9E; F; 0073 0073; # LATIN CAPITAL LETTER SHARP S > 1E9E; S; 00DF; # LATIN CAPITAL LETTER SHARP S > ... > -- cut -- > > I.e. in _full_ case folding both "?" (U+1E9E) and "?" (U+00DF) are mapped > to "ss", whereas in _simple_ case folding first one is mapped to the > second. SQLite 3.11.0 works according to simple rules: > > -- cut -- > CREATE VIRTUAL TABLE t USING fts3tokenize(unicode61); > SELECT token FROM t WHERE input = "? ?"; > -- cut -- > gives > -- cut-- > ? > ? > -- cut-- > > So which one is correct, documentation or implementation? I also wonder > what a native German speaker would expect in full-text search case? > (Google gives different result counts for "Schlo?" and "Schloss", which > actually surprises me a bit).
The character "?" was often not present in fonts, is not included in ISO/IEC 8859-1:1998 and is not historically and commonly used in German (the German Wikipedia and the articles' references can explain this better than I can). It was just "recently" added Unicode 5.1 in 2008. It is common to either capitalize ? as SS or SZ (to avoid ambiguities) in all-caps titles. I think it's uncertain whether ? will be widely used. If I understand Unicode case folding correctly, it exists to be able to compare Unicode strings case-insensitively by converting them into a canonical form. So simple case seems correct, as ? would be folded to ?. However, if you keep in mind the old orthography (before 1996) and want to know what makes sense for a search engine, full case folding makes more sense. As you noted "Schlo?" and "Schloss" should return the same results for non-verbatim searches as such distinction would seemingly only be relevant to linguists or historians but not for every day use and business information systems. The Unicode standard is unfortunately vague about what it wants to achieve by case folding and what thoughts went into the case folding table. Perhaps you should ask on the Unicode mailing list. You didn't describe your use-case but I would also generally advice to use a phonetic algorithm for German to canonicalize words for non-verbatim searches instead of case folding. It gives better results and most German speakers I know appreciate the phonetic corrections of popular Internet search engines for non-verbatim searches. I hope this helps. Maybe it also helps to consult a linguist for building a non-simplistic search engine for German. For example, you have to perform compound splitting, stemming and some form of grammatical analysis at some point. - Matthias-Christian

