2015-05-09 16:26 GMT+02:00 Richard Wordingham < [email protected]>:
> In particular, I claim that all 6 permutations of <D800, 0054, DCC1> > are Unicode strings, but that only two, namely <D800, DCC1, 0054> and > <0054, D800, DCC1>, are UTF-16 strings. > Again you use "Unicode strings" for your 6 permutations, but in your example they have nothing that make them "Unicode strings", given you allow arbitrary code units in arbitrary order, including unpaired ones. The 6 permutations are just "16-bit strings" (addding "Unicode" for these 6 permutations gives absolutely no value if you keep your definition, but visibly it cannot fit with the term used in the RFC trying to normalize JSON, with similar confusions !). TUS does not define what is a "Unicode string" like you do here. TUS just defines "Unicode 16-bit strings" with a direct reference to UTF-16 (which implies conformance and only accepts the later two strings, that TUS names "Unicode 16-bit strings", not "UTF-16 strings"...) TUS goes further by then distinguishing its encoding schemes (taking into account their serialization ti 8-bit streams, and also considering the byte order, for defining the 3 supported UTF-16 encoding schemes: with or without BOM): then an "UTF-16 string" become "UTF-16 encoded text" (or UTF-16BE or UTF16-LE). Note also that I used the term "stream" instead of "string" only to avoid restricting the length (but JSON does not support encoding streams of arbitrary lengths, all of them must have a start, an end, and a defined bounded length (while streams don't necessarily have any defined length property, independantly of the way we measure length: either in bytes, code units, code points, combining sequences, grapheme clusters...).

