On Sat, 9 May 2015 16:54:30 +0200 Philippe Verdy <[email protected]> wrote:
> 2015-05-09 16:26 GMT+02:00 Richard Wordingham < > [email protected]>: > > > In particular, I claim that all 6 permutations of <D800, 0054, DCC1> > > are Unicode strings, but that only two, namely <D800, DCC1, 0054> > > and <0054, D800, DCC1>, are UTF-16 strings. > > > > Again you use "Unicode strings" for your 6 permutations, but in your > example they have nothing that make them "Unicode strings", given you > allow arbitrary code units in arbitrary order, including unpaired > ones. The 6 permutations are just "16-bit strings" (addding "Unicode" > for these 6 permutations gives absolutely no value if you keep your > definition, but visibly it cannot fit with the term used in the RFC > trying to normalize JSON, with similar confusions !). > TUS does not define what is a "Unicode string" like you do here. D80 _Unicode string:_ A code unit sequence containing code units of a particular Unicode encoding form RW: Note that by this definition, a permutation of a Unicode string is a Unicode string. D82 _Unicode 16-bit string:_ A Unicode string containing only UTF-16 code units. D85 _Well-formed:_ A Unicode code unit sequence that purports to be in a Unicode encoding form is called well-formed if and only if it _does_ follow the specification of that Unicode encoding form D89 _In a Unicode encoding form:_ A Unicode string is said to be in a particular Unicode encoding form if and only if it consists of a well-formed Unicode code unit sequence of that Unicode encoding form. • A Unicode string consisting of a well-formed UTF-8 code unit sequence is said to be _in UTF-8_. Such a Unicode string is referred to as a _valid UTF-8 string_, or a _UTF-8 string_ for short. • A Unicode string consisting of a well-formed UTF-16 code unit sequence is said to be _in UTF-16_. Such a Unicode string is referred to as a _valid UTF-16 string_, or a _UTF-16 string_ for short. • A Unicode string consisting of a well-formed UTF-32 code unit sequence is said to be _in UTF-32_. Such a Unicode string is referred to as a _valid UTF-32 string_, or a _UTF-32 string_ for short. > TUS just defines "Unicode 16-bit strings" with a direct reference to > UTF-16 (which implies conformance and only accepts the later two > strings, that TUS names "Unicode 16-bit strings", not "UTF-16 > strings"...) Look at D82 again. It refers to UTF-16 code units and does not otherwise reference UTF-16. If you still do not believe me, consider D89. Can you think of an example of a Unicode string consisting of UTF-8 code units, UTF-16 code units or UTF-32 code units that is not a UTF-8 string, not a UTF-16 and is not a UTF-32 string? If you can't, the use of "well-formed" is curiously redundant in D89. Richard.

