From: "Doug Ewell" <[EMAIL PROTECTED]>
Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

Unicode defines only 4 *standard* normalization forms (NFC, NFD, NFKC,
NFKD), but other *non-standard* normalization forms are possible:

But should not be used. It can be tricky enough getting the four standard ones right as it is.

Wrong. Non-standard normalization forms are useful too, and can even be safe if they preserve one of the two standard equivalences (canonical or compatibility).


There are lots of reasons where a non-standard normalization form that still preserves canonical equivalence must be used (NFC and NFD are not always good enough because of the way combining classes are defined and the fact that they are immutably frozen), or because new characters have been added in Unicode that can't even have a useful and obvious canonical equivalence, due to the stability pact.

Some transformations can't be named "normalization" under Unicode, although they should: for example the unification of decomposed SSANG* jamos in Hangul, or the removal of unnecessary occurences of CGJ in combining sequences. Such text transforms are considered by users as normalization, but Unicode sees them differently.





Reply via email to