Copying the Unicode mailing list. Masatoshi Kimura <VYV03354 at nifty dot ne dot jp> wrote:
> (2012/04/19 9:33), Doug Ewell wrote: >> Given the sequence F8 80 80 80 80, the Unicode Standard specifies >> that a decoder should recognize F5 as an invalid UTF-8 code unit, do >> whatever it does on an error condition, and then continue with the >> next byte. This will generate 5 error conditions if handling of >> errors includes trying to continue. > > Where TUS defines this? It seems to contradict TUS 6.1.0 p.96: > http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf#page=42 > |Although a UTF-8 conversion process is required to never consume > |well-formed subsequences as part of its error handling for ill-formed > |subsequences, such a process is not otherwise constrained in how it > |deals with any ill-formed subsequence itself. An ill-formed > |subsequence consisting of more than one code unit could be treated as > |a single error or as multiple errors. For example, in processing the > |UTF-8 code unit sequence <F0 80 80 41>, the only formal requirement > |mandated by Unicode conformance for a converter is that the <41> be > |processed and correctly interpreted as <U+0041>. The converter could > |return <U+FFFD, U+0041>, handling <F0 80 80> as a single error, or > |<U+FFFD, U+FFFD, U+FFFD, U+0041>, handling each byte of <F0 80 80> as > |a separate error, or could take other approaches to signalling <F0 80 > |80> as an ill-formed code unit subsequence. I remembered reading a statement from UTC that interpretation of an ill- formed sequence was supposed to terminate as soon as the sequence was determined to be ill-formed. Conformance definition C10 does say: > For example, in UTF-8 every byte of the form 110xxxxx₂ must be > followed with a byte of the form 10xxxxxx₂. A sequence such as > <110xxxxx₂ 0xxxxxxx₂> is illegal, and must never be generated. When > faced with this illegal byte sequence while transforming or > interpreting, a UTF-8 conformant process must treat the first byte > 110xxxxx₂ as an illegal termination error: for example, either > signaling an error, filtering the byte out, or representing the byte > with a marker such as FFFD (REPLACEMENT CHARACTER). In the latter two > cases, it will continue processing at the second byte 0xxxxxxx₂. A lead byte of 11111000₂ is ill-formed. And in fact, the section of TUS that Masatoshi quoted goes on to say: > Using the definition for maximal subpart, the best practice can be > stated simply as: > > Whenever an unconvertible offset is reached during conversion of a > code unit sequence: > > 1. The maximal subpart at that offset should be replaced by a single > U+FFFD. > > 2. The conversion should proceed at the offset immediately after the > maximal subpart. However, this description does use the word "should," not "must," and it goes on (on the same page) to offer a table with three "possible alternative approaches" for mapping an ill-formed UTF-8 sequence into characters. It recommends the method described above, but allows the other two. So the bottom line is that Masatoshi is right: the Unicode Standard does not specify that a decoder *must* respond to an invalid lead byte as I said, only that it *should*. I agree that this is unnecessarily vague. Whether this calls for a complete recasting of the definition of UTF-8 by WHATWG, or by any individual contributors therein, is of course a different matter. > It is exactly a purpose of Encoding Standard to avoid these kind of > vagueness. Again, I'm not sure whether it is within the authority or responsibility of WHATWG or any individual to provide a "better" definition of a Unicode encoding form than that provided by Unicode. I do understand the desire to nail down the various legacy encodings, such as Shift-JIS, that have been interpreted over the years in very flexible and confusing ways. I don't think UTF-8 falls into this category at all. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell

