> From: Philippe Verdy <[email protected]> > Date: Sun, 21 Feb 2016 00:19:19 +0100 > Cc: unicode Unicode Discussion <[email protected]> > > Unless we have case folding tailored by language, you cannot do that based > on the Unicode database alone. > > However CLDR provides tailored data about collation. > > From my point of view, it is just a matter or selecting the collation > strength to use for searches using collation. All collations in CLDR are > locale-dependant (the search algorithm must be using either a language > preselection, or detect the default language used by the document, or set > explicitly in specific fragments of the document, or use some hints to > guess what could be the effective language), even if CLDR also defines a > "root" locale for use in language-neutral contexts, or when the language > cannot be determined from the document or its metadata.
Emacs doesn't (yet) have the notion of the "current language". Being a multi-lingual environment, where different languages are routinely mixed in the same editing buffer, this is a hard problem that doesn't yet have a solution. Emacs does know the "charset" which the given text came from, if the original was encoded in some telltale encoding, like iso-2022-jp; it can also know the script of the text (by looking at the Unicode block of the characters). In some cases, this is enough to deduce the language. But in general, and notably with languages that use the Latin script, this is not enough. Using the locale in which Emacs was started is insufficient in this age of global communications. Therefore, the goal of what is currently implemented in what will become Emacs 25.1 in a few months was deliberately limited to begin with: support only "language-independent" folding. In a nutshell, this means ignoring all the collating weights except the primary. The implementation basically uses the decomposition data in UnicodeData.txt. How different is that from the "root locale" data that is part of CLDR? What are the differences? Does the implementation based on decomposition data have any merit, or is it completely useless/wrong?

