On Sun, 11 Oct 2015 21:36:49 -0700 Ken Whistler <[email protected]> wrote:
> I think the correct answer is probably: > > (c) The ill-formed three code unit Unicode 16-bit string > <0xDC00, 0xD800, 0xDC20> contains one code point, U+10020 and > one uninterpreted (and uninterpretable) high surrogate > code unit 0xDC00. > > In other words, I don't think it is useful or helpful to map isolated, > uninterpretable surrogate code units *to* surrogate code points. > Surrogate code points are an artifact of the code architecture. They > are code points in the code space which *cannot* be represented > in UTF-16, by definition. > > Any discussion about properties for surrogate code points is a > matter of designing graceful API fallback for instances which > have to deal with ill-formed strings and do *something*. I don't > think that should extend to treating isolated surrogate code > units as having interpretable status, *as if* they were valid > code points represented in the string. Graceful fallback is exactly where the issue arises. Throwing an exception is not a useful answer to the question of how many code points a 'Unicode string' (not a 'UTF-16 string') contains. The question can arise when one is following an instruction to advance x codepoints; the usual presumption is that the preferred response is to advance exactly x scalar values and not advance over anything else. > It might be easier to get a handle on this if folks were to ask, > instead how many code points are in the ill-formed Unicode 8-bit > string <0x61, 0xF4, 0x90, 0x90, 0x90, 0x61>. 6 code units, yes, > but how many code points? I'd say two code points and > 4 uninterpretable, ill-formed UTF-8 code units, rather than > any other possible answer. In this case I'd say three 'somethings', and define 'something' accordingly. There are different ideas as to what a 'something' should be. Having a clear definition matters when moving backwards and forwards through a Unicode 8-bit string. > Basically, you get the same kind of answer if the ill-formed string > were, instead <0x61, 0xED, 0xA0, 0x80, 0x61>. Two code points > and 3 uninterpretable, ill-formed UTF-8 code units. That is a > better answer than trying to map 0xED 0xA0 0x80 to U+D800 > and then saying, oh, that is a surrogate code *point*. A simple scenario is a filter that takes in a single byte (or EOF) at a time and returns a scalar value, 'no character yet', 'corrupt' or 'end of text'. It is a significant complication for it to have to emit sequences of values indicating uninterpretable bytes. I've found it much easier to treat bad sequences of UTF-8 code units that are bad by reason of their length and indicated scalar value as a single entity. This simplifies moving forwards and backwards through strings to just detecting non-continuation bytes and limiting traversal through runs of continuation bytes. Otherwise, one must also check the following continuation byte for a valid range. For example, if one starts at position 5 in your first example, just before the second 'A', one faces the following logic when moving back one codepoint. 1) Provisionally back up to position 1, just before 0xF4. 2) Confirm that one has skipped no more than 3 continuation bytes. 3) Conform that at least 3 continuation bytes follow the 0xF4. 4) Examine the first continuation byte, 0x90, and realise that it is not a legal value there. 5) Change to moving back one byte, arriving at position 4, just before the last 0x90. It gets even more complicated if one follows the "maximal subpart" approach of TUS Ch. 3. By contrast, one can even report the bad sequences in a 21-bit extension of Unicode. For example, one could use bits 20:16 to encode the problem, e.g.: 0-16 => Valid scalar value (excludes 0xD800 to 0xDFFF) 1) Numbers that look like scalar values: 1.1) Value not a scalar value: 17 => 11xxxx (start F4 9y) 18 => 12xxxx (start F4 Ay) 19 => 13xxxx (start F4 By) 20 => Surrogate codepoint (start ED Ay or ED By) (2^11 seqq.) 1.2) Non-shortest form: 21 => 4 bytes long (start F0 8y) (image of BMP) 22 => 3 bytes long (start E0 8y or E0 9y) (2^11 seqq.) 23 => 2 bytes long (start C0 or C1) (image of ASCII)* 2) Uninterpretable sequences: 24 => Declared length 4 but actually 3 long (5 * 2^12 seqq.) 25 => Declared length 4 but actually 2 long (5 * 2^6 seqq.) 26 => Declared length 3 but actually 2 long (2^10 seqq.) 27 => Non-ASCII lone bytes (2^7 seqq.)* * Not necessarily composed of UTF-8 code units. 17 => 11xxxx (start F4 9y) In this scheme, <0x61, 0xF4, 0x90, 0x90, 0x90, 0x61> would be analysed as <U+0061, V+110410, U+0061>, and the application could decide what to do with V+110410. It'd probably just be replaced by U+FFFD. Richard.

