On Sun, 10 May 2015 07:42:14 +0200 Philippe Verdy <[email protected]> wrote:
I as replying out of order for greater coherence of my reply. > However I wonder what would be the effect of D80 in UTF-32: is > <0xFFFFFFFF> a valid "32-bit string" ? After all it is also > containing a single 32-bit code unit (for at least one Unicode > encoding form), even if it has no "scalar value" and then does not > have to validate D89 (for UTF-32)... The value 0xFFFFFFFF cannot appear in a UTF-32 string. Therefore it cannot represent a unit of encoded text in a UTF-32 string. By D77 paragraph 1, "Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange", it is therefore not a code unit. The effect of D77, D80 and D83 is that <0xFFFFFFFF> is a 32-bit string but not a Unicode 32-bit string. > - D80 defines "Unicode string" but in fact it just defines a generic > "string" as an arbitrary stream of fixed-size code units. No - see argument above. > These two rules [D80 and D82 - RW] are not productive at all, except > for saying that all values of fixed size code units are acceptable > (including for example 0xFF in 8-bit strings, which is invalid in > UTF-8) Do you still maintain this reading of D77? D77 is not as clear as it should be. > <snip> D80 and D82 have no purpose, except adding the term "Unicode" > redundantly to these expressions. I have the cynical suspicion that these definitions were added to preserve the interface definitions of routines processing UCS-2 strings when the transition to UTF-16 occurred. They can also have the (intentional?) side-effect of making more work for UTF-8 and UTF-32 processing, because arbitrary 8-bit strings and 32-bit strings are not Unicode strings. Richard.

