Personally, I find it interesting to see which and how many characters are affected by
the difference in binary ordering between UTF-8 and UTF-16.
Affected are all code points in two ranges:
U+e000..U+ffff
U+10000..U+10ffff
The second range contains assignments for characters that are "rare" in the "average
text".
The first range is interesting: It consists mostly of the PUA range of the BMP, some
"specials", and of compatibility character assignments.
There are - aside from private use characters and the specials U+fff0..U+fffd - only
20 code points that "survive" an NFKD transformation:
12 CJK Unified Ideographs (U+fa__)
1 U+fb1e HEBREW POINT JUDEO-SPANISH VARIKA
2 ornate parentheses (U+fd3e/f)
2 combining ligatures halves (U+fe20/1)
2 combining tilde halves (U+fe22/3)
1 U+feff ZWNBSP
So, given normalized text (NFKD), there are only 20 assigned, non-compatibility,
non-special characters that sort either before or after those "very rare"
supplementary characters when one binary sorts UTF-8/16 strings.
I leave it up to the list to consider this... ;-)
markus
- Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) Peter_Constable
- Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) Mark Davis
- RE: UTF-8S (was: Re: ISO vs Unicode UTF-8) Marco Cimarosti
- Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) Mark Davis
- RE: UTF-8S (was: Re: ISO vs Unicode UTF-8) Marco Cimarosti
- RE: UTF-8S (was: Re: ISO vs Unicode UTF-8) Carl W. Brown
- RE: UTF-8S (was: Re: ISO vs Unicode UTF-8) Misha . Wolf
- Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) Mark Davis
- Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) Michael \(michka\) Kaplan
- Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) Peter_Constable
- RE: UTF-8S (was: Re: ISO vs Unicode UTF-8) Markus Scherer
- RE: UTF-8S (was: Re: ISO vs Unicode UTF-8) Carl W. Brown
- Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) Peter_Constable
- Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) Peter_Constable
- Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) Mark Davis
- Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) Peter_Constable
- Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) DougEwell2
- Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) DougEwell2
- RE: UTF-8S (was: Re: ISO vs Unicode UTF-8) Carl W. Brown
- Re: UTF-8S (was: Re: ISO vs Unicode UTF-8) Mark Davis
- Fw: UTF-8S (was: Re: ISO vs Unicode UTF-8) Mark Davis

