Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

Markus Scherer Tue, 05 Jun 2001 15:35:58 -0700

Personally, I find it interesting to see which and how many characters are affected by 
the difference in binary ordering between UTF-8 and UTF-16.
Affected are all code points in two ranges:
    U+e000..U+ffff
    U+10000..U+10ffff

The second range contains assignments for characters that are "rare" in the "average 
text".

The first range is interesting: It consists mostly of the PUA range of the BMP, some 
"specials", and of compatibility character assignments.
There are - aside from private use characters and the specials U+fff0..U+fffd - only 
20 code points that "survive" an NFKD transformation:

    12 CJK Unified Ideographs (U+fa__)
    1 U+fb1e HEBREW POINT JUDEO-SPANISH VARIKA
    2 ornate parentheses (U+fd3e/f)
    2 combining ligatures halves (U+fe20/1)
    2 combining tilde halves (U+fe22/3)
    1 U+feff ZWNBSP

So, given normalized text (NFKD), there are only 20 assigned, non-compatibility, 
non-special characters that sort either before or after those "very rare" 
supplementary characters when one binary sorts UTF-8/16 strings.

I leave it up to the list to consider this... ;-)

markus

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

Reply via email to