On Sat, 9 May 2015 07:55:17 +0200 Philippe Verdy <[email protected]> wrote:
> 2015-05-09 6:37 GMT+02:00 Markus Scherer <[email protected]>: > > > On Fri, May 8, 2015 at 9:13 PM, Philippe Verdy <[email protected]> > > wrote: > >> 2015-05-09 5:13 GMT+02:00 Richard Wordingham < > >> [email protected]>: WARNING: This post belongs in pedants' corner, or possibly a pantomime. > >>> I can't think of a practical use for the specific concepts of > >>> Unicode 8-bit, 16-bit and 32-bit strings. Unicode 16-bit strings > >>> are essentially the same as 16-bit strings, and Unicode 32-bit > >>> strings are UTF-32 strings. 'Unicode 8-bit string' strikes me > >>> as an exercise in pedantry; there are more useful categories of > >>> 8-bit strings that are not UTF-8 strings. > >> And here you're wrong: a 16-bit string is just a sequence of > >> arbitrary 16-bit code units, but an Unicode string (whatever the > >> size of its code units) adds restrictions for validity (the only > >> restriction being in fact that surrogates (when present in 16-bit > >> strings, i.e. UTF-16) must be paired, and in 32-bit (UTF-32) and > >> 8-bit (UTF-8) strings, surrogates are forbidden. You are thinking of a Unicode string as a sequence of codepoints. Now that may be a linguistically natural interpretation of 'Unicode string', but 'Unicode string' has a different interpretation, given in D80. A 'Unicode string' (D80) is a sequence of code-units occurring in some Unicode encoding form. By this definition, every permutation of the code-units in a Unicode string is itself a Unicode string. UTF-16 is unique in that every code-unit corresponds to a codepoint. (We could extend the Unicode codespace (D9, D10) by adding integers for the bytes of multibyte UTF-8 encodings, but I see no benefit.) A Unicode 8-bit string may have no interpretation as a sequence of codepoints. For example, the 8-bit string <C2, A0> is a Unicode 8-bit string denoting a sequence of one Unicode scalar value, namely U+00A0. <A0, A0> is therefore also a Unicode 8-bit string, but it has no defined or obvious interpretation as a codepoint; it is *not* a UTF-8 string. The string <E0, 80, 80> is also a Unicode 8-bit string, but is not a UTF-8 string because the sequence is not the shortest representation of U+0000. The 8-bit string <C0, 80> is *not* a Unicode 8-bit string, for the byte C0 does not occur in well-formed UTF-8; one does not even need to note that it is not the shortest representation of U+0000. > > No, Richard had it right. See for example definition D82 "Unicode > > 16-bit string" in the standard. (Section 3.9 Unicode Encoding Forms, > > http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf) > I was right, D82 refers to "UTF-16", which implies the restriction of > validity, i.e. NO isolated/unpaired surrogates,(but no exclusion of > non-characters). No, D82 merely requires that each 16-bit value be a valid UTF-16 code unit. Unicode strings, and Unicode 16-bit strings in particular, need not be well-formed. For x = 8, 16, 32, a 'UTF-x string', equivalently a 'valid UTF-x string', is one that is well-formed in UTF-x. > I was right, You and Richard were wrong. I stand by my explanation. I wrote it with TUS open at the definitions by my side. Richard.

