Two points in response to the questions:
1. The XML spec has just been amended by an erratum to clarify
that an irregular UTF-8 sequence must generate a fatal error.
2. It has been agreed that the Unicode Standard will be modified
to ban irregular UTF-8 sequences for all characters.
Misha
On 28/10/2001 12:52:37 Bernard Miller wrote:
> The question raised earlier by David Hollingsworth did
> not seem to get any responses from this list. I've
> pasted the text of the email below. I would also like
> clarification on why the utf-8 in unicode 3.1 only
> forbids conformant implementations from interpreting
> nonshortest forms for BMP characters --and does not
> forbid interpretation of all irregular sequences for
> all characters.
>
> ___
> Date: 5 Oct 2001 18:23:58 -0000
> From: "David E. Hollingsworth" <[EMAIL PROTECTED]> |
> Block Address | Add to Address Book
> To: [EMAIL PROTECTED]
> Subject: Handling irregular sequences
>
> The definition of UTF-32 (and the modifications to
> UTF-8 for Unicode
> 3.1) make it clear that conformant processes shall not
> generate
> irregular sequences. However, they do not (and
> perhaps they
> shouldn't) indicate what a process should do when
> encountering an
> irregular sequence, and I'm curious what people are
> doing in practice.
>
> One could apply the traditional Internet aphorism of
> being liberal in
> what one accepts, but that didn't pan out so well for
> non-shortest-form UTF-8, so in addition to wondering
> what people are
> doing in practice, I'm also curious about the follow
> theoretical
> issue:
>
> It doesn't seem very likely to me that someone would
> write a security
> check that depends on, say, passing Deseret code
> points but blocking
> musical notation code points; however, I wouldn't say
> it's impossible;
> moreover, a security check that wants to disallow all
> non-BMP
> characters doesn't seem quite so outlandish. If
> someone did write
> such a check, it seems to me that the attack described
> in UAX #27
> would apply, by substituting "irregular sequence" for
> "non-shortest
> form":
>
> Process A performs security checks, but does not
> check for irregular
> sequences.
>
> Process B accepts the byte sequence from process A,
> and transforms
> it into UTF-16 while interpreting irregular
> sequences.
>
> The UTF-16 text may then contain characters that
> should have been
> filtered out by process A.
>
>
> Even if I'm mistaken about this, is there a specific
> argument *for*
> accepting irregular sequences?
>
> --deh!
>
> ___
>
> Bernard
>
>
> __________________________________________________
> Do You Yahoo!?
> Make a great connection at Yahoo! Personals.
> http://personals.yahoo.com
>
-----------------------------------------------------------------
Visit our Internet site at http://www.reuters.com
Any views expressed in this message are those of the individual
sender, except where the sender specifically states them to be
the views of Reuters Ltd.