[EMAIL PROTECTED]
wrote:
Unfortunatelly, FSS-UTF in Unicode 1.1 IS NOT UTF-8. Most of the people refer to UTF-8 by looking at RFC 2279 http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2279.html
and RFC 2044 http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2044.html
but in that two RFCs, when it stated the decoding process, it does not mention checking the non-shortest-form
2279
does mention it. Near the end of section 2 you have:
NOTE -- actual implementations
of the decoding algorithm above
should protect against decoding invalid sequences. For
instance, a naive implementation may (wrongly) decode the
invalid UTF-8 sequence C0 80 into the character U+0000, which
may have security consequences and/or cause other problems. See
the Security Considerations section below.
should protect against decoding invalid sequences. For
instance, a naive implementation may (wrongly) decode the
invalid UTF-8 sequence C0 80 into the character U+0000, which
may have security consequences and/or cause other problems. See
the Security Considerations section below.
And
the Security Considerations explains why one should check that.
It's only a NOTE in 2279, hence not a normative prescription, reflecting
the state of Unicode back in 1998. It's being made a normative MUST in
2279bis.
Likewise, ever since the surrogate code point range was designated in Unicode 2.0, it has been invalid (or at least nonsensical) to encode values from U+D800 through U+DFFF directly in UTF-8.Again, RFC 2279 is the one people look at when they refer to UTF-8. And the decoding process stated in there does not mention checking the range which directly map to D800-DFFF
Unfortunately true, but that's being fixed in
2279bis.
Well... that is another question. Is UTF-8 which represent U+FFFE and U+FFFF legal UTF-8 sequence?
Markus
Scherer already answered this one: it's valid UTF-8 representing non-characters that should not be exchanged across system
boundaries. A UTF-8 decoder is not necessarily located at such a
boundary.
(Just like you may have a valid Base64 encoded file which encode an illegal GIF file. Your base 64 is legal, fully conform to Base64 decoding logic and could be decoed, but the decoded file is not a legal GIF file which conform to the GIF file specification)
Pretty
apt analogy.
Where is the boundary of legal UTF-8 from legal Unicode ?
At
"system boundaries", which non-characters may not cross.
--
François Yergeau

