Unfortunatelly, FSS-UTF in Unicode 1.1 IS NOT UTF-8. Most of the people refer to UTF-8 by looking at RFC 2279 http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2279.html
and RFC 2044 http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2044.html
but in that two RFCs, when it stated the decoding process, it does not mention checking the non-shortest-form

2279 does mention it. Near the end of section 2 you have:

        NOTE -- actual implementations of the decoding algorithm above
        should protect against decoding invalid sequences. For
        instance, a naive implementation may (wrongly) decode the
        invalid UTF-8 sequence C0 80 into the character U+0000, which
        may have security consequences and/or cause other problems. See
        the Security Considerations section below.

And the Security Considerations explains why one should check that. It's only a NOTE in 2279, hence not a normative prescription, reflecting the state of Unicode back in 1998. It's being made a normative MUST in 2279bis.

Likewise, ever since the surrogate code point range was designated in
Unicode 2.0, it has been invalid (or at least nonsensical) to encode
values from U+D800 through U+DFFF directly in UTF-8.  
Again, RFC 2279 is the one people look at when they refer to UTF-8. And the decoding process stated in there does not mention checking the range which directly map to D800-DFFF

Unfortunately true, but that's being fixed in 2279bis.

Well... that is another question. Is UTF-8 which represent U+FFFE and U+FFFF legal UTF-8 sequence?

Markus Scherer already answered this one: it's valid UTF-8 representing non-characters that should not be exchanged across system boundaries. A UTF-8 decoder is not necessarily located at such a boundary.

(Just like you may have a valid Base64 encoded file which encode an illegal GIF file. Your base 64 is legal, fully conform to Base64 decoding logic and could be decoed, but the decoded file is not a legal GIF file which conform to the GIF file specification)

Pretty apt analogy.

Where is the boundary of legal UTF-8 from legal Unicode ?

At "system boundaries", which non-characters may not cross.

François Yergeau

=?utf-8?B?UkU6IHBsZWFzZSByZXZpZXcgdGhlIHBhcGVyIGZvciBtZQ==?=

Reply via email to