In a message dated 2001-09-17 16:24:05 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

> It doesn't reopen that specific type of security hole, because irregular 
UTF-8
> sequences (as defined by Unicode 3.1) can only decode to characters above
> 0xFFFF, and those characters are unlikely to be "special" for any 
application
> protocol. However, I entirely agree that it's desirable that UTF-8 should 
only
> allow shortest form; 6-byte surrogate encodings have always been incorrect.

All Unicode code points of the form U+xxxxFE and U+xxxxFF are special, in 
that they are non-characters and can be treated in a special way by 
applications (e.g. as sentinels).

I don't agree that irregular UTF-8 sequences in general can only decode to 
characters above 0xFFFF.  For example, the following irregular UTF-8 
sequences all decode to U+0000:

C0 80
E0 80 80
F0 80 80 80
F8 80 80 80 80
FC 80 80 80 80 80

It is true that the *specific* irregular UTF-8 sequences introduced (and 
required) by CESU-8 decode to characters above 0xFFFF when interpreted as 
CESU-8, and to pairs of surrogate code points when (incorrectly) interpreted 
as UTF-8.  Since definition D29, arguably my least favorite part of Unicode, 
requires that all UTFs (including UTF-8) be able to represent unpaired 
surrogates, the character count for the same chunk of data could be different 
depending on whether it is interpreted as CESU-8 or UTF-8.  That's a 
potential security hole.

CESU-8 decoders that are really diligent could check for this, of course, but 
when I think of CESU-8 the concept of "really diligent decoders" just doesn't 
spring to mind.  If the inventors were really diligent, they would have 
implemented UTF-16 sorting correctly in the first place.

-Doug Ewell
 Fullerton, California

Reply via email to