RE: Handling irregular sequences

Misha . Wolf Sun, 28 Oct 2001 10:42:14 -0800


Two points in response to the questions:


1.  The XML spec has just been amended by an erratum to clarify
    that an irregular UTF-8 sequence must generate a fatal error.

2.  It has been agreed that the Unicode Standard will be modified
    to ban irregular UTF-8 sequences for all characters.

Misha


On 28/10/2001 12:52:37 Bernard Miller wrote:
> The question raised earlier by David Hollingsworth did
> not seem to get any responses from this list. I've
> pasted the text of the email below. I would also like
> clarification on why the utf-8 in unicode 3.1 only
> forbids conformant implementations from interpreting
> nonshortest forms for BMP characters --and does not
> forbid interpretation of all irregular sequences for
> all characters.
>
> ___
> Date: 5 Oct 2001 18:23:58 -0000
> From: "David E. Hollingsworth" <[EMAIL PROTECTED]> |
> Block Address  | Add to Address Book
> To: [EMAIL PROTECTED]
> Subject: Handling irregular sequences
>
> The definition of UTF-32 (and the modifications to
> UTF-8 for Unicode
> 3.1) make it clear that conformant processes shall not
> generate
> irregular sequences.  However, they do not (and
> perhaps they
> shouldn't) indicate what a process should do when
> encountering an
> irregular sequence, and I'm curious what people are
> doing in practice.
>
> One could apply the traditional Internet aphorism of
> being liberal in
> what one accepts, but that didn't pan out so well for
> non-shortest-form UTF-8, so in addition to wondering
> what people are
> doing in practice, I'm also curious about the follow
> theoretical
> issue:
>
> It doesn't seem very likely to me that someone would
> write a security
> check that depends on, say, passing Deseret code
> points but blocking
> musical notation code points; however, I wouldn't say
> it's impossible;
> moreover, a security check that wants to disallow all
> non-BMP
> characters doesn't seem quite so outlandish.  If
> someone did write
> such a check, it seems to me that the attack described
> in UAX #27
> would apply, by substituting "irregular sequence" for
> "non-shortest
> form":
>
>   Process A performs security checks, but does not
> check for irregular
>   sequences.
>
>   Process B accepts the byte sequence from process A,
> and transforms
>   it into UTF-16 while interpreting irregular
> sequences.
>
>   The UTF-16 text may then contain characters that
> should have been
>   filtered out by process A.
>
>
> Even if I'm mistaken about this, is there a specific
> argument *for*
> accepting irregular sequences?
>
>   --deh!
>
> ___
>
> Bernard
>
>
> __________________________________________________
> Do You Yahoo!?
> Make a great connection at Yahoo! Personals.
> http://personals.yahoo.com
>



-----------------------------------------------------------------
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.

RE: Handling irregular sequences

Reply via email to