Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Asmus Freytag (c) via Unicode Thu, 01 Jun 2017 12:36:31 -0700

On 6/1/2017 11:53 AM, Shawn Steele wrote:

But those are IETF definitions. They don’t have to mean the samething in Unicode - except that people working in this field probablyexpect them to.

That's the thing. And even if Unicode had it's own version of RFC 2119one would considered it recommended for Unicode to follow widespreadindustry practice (there's that "r" word again!).

A./

*From:*Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of*Asmus Freytag via Unicode

*Sent:* Thursday, June 1, 2017 11:44 AM
*To:* unicode@unicode.org

*Subject:* Re: Feedback on the proposal to change U+FFFD generationwhen decoding ill-formed UTF-8


On 6/1/2017 10:41 AM, Shawn Steele via Unicode wrote:

    I think that the (or a) key problem is that the current "best practice" is treated as 
"SHOULD" in RFC parlance.  When what this really needs is a "MAY".

    People reading standards tend to treat "SHOULD" and "MUST" as the same 
thing.


It's not that they "tend to", it's in RFC 2119:


        SHOULD

      This word, or the adjective "RECOMMENDED", mean that there

        may exist valid reasons in particular circumstances to ignore a

        particular item, but the full implications must be understood and

        carefully weighed before choosing a different course.

The clear inference is that while the non-recommended practice is notprohibited, you better have some valid reason why you are deviatingfrom it (and, reading between the lines, it would not hurt if youdocumented those reasons).



      So, when an implementation deviates, then you get bugs (as we see here).  Given 
that there are very valid engineering reasons why someone might want to choose a 
different behavior for their needs - without harming the intent of the standard at all in 
most cases - I think the current/proposed language is too "strong".

Yes and no. ICU would be perfectly fine deviating from the existingrecommendation and stating their engineering reasons for doing so.That would allow them to close their bug ("by documentation").

What's not OK is to take an existing recommendation and change it tosomething else, just to make bug reports go away for oneimplementations. That's like two sleepers fighting over a blanketthat's too short. Whenever one is covered, the other is exposed.

If it is discovered that the existing recommendation is not based onanything like truly better behavior, there may be a case to change itto something that's equivalent to a MAY. Perhaps a list of nearlyequally capable options.

(If that language is not in the standard already, a strong "animplementation MUST not depend on the use of a particular strategy forreplacement of invalid code sequences", clearly ought to be added).


A./


    -Shawn

    -----Original Message-----

    From: Alastair Houghton [mailto:alast...@alastairs-place.net]

    Sent: Thursday, June 1, 2017 4:05 AM

    To: Henri Sivonen<hsivo...@hsivonen.fi> <mailto:hsivo...@hsivonen.fi>

    Cc: unicode Unicode Discussion<unicode@unicode.org> <mailto:unicode@unicode.org>; Shawn 
Steele<shawn.ste...@microsoft.com> <mailto:shawn.ste...@microsoft.com>

    Subject: Re: Feedback on the proposal to change U+FFFD generation when 
decoding ill-formed UTF-8

    On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode<unicode@unicode.org> 
<mailto:unicode@unicode.org>  wrote:

        On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode

        <unicode@unicode.org> <mailto:unicode@unicode.org>  wrote:

            * As far as I can tell, there are two (maybe three) sane approaches 
to this problem:

                    * Either a "maximal" emission of one U+FFFD for every byte 
that exists outside of a good sequence

                    * Or a "minimal" version that presumes the lead byte was 
counting trail bytes correctly even if the resulting sequence was invalid.  In that case 
just use one U+FFFD.

                    * And (maybe, I haven't heard folks arguing for this one) 
emit one U+FFFD at the first garbage byte and then ignore the input until valid 
data starts showing up again.  (So you could have 1 U+FFFD for a string of a 
hundred garbage bytes as long as there weren't any valid sequences within that 
group).

        I think it's not useful to come up with new rules in the abstract.

    The first two aren’t “new” rules; they’re, respectively, the current “Best 
Practice”, the proposed “Best Practice” and one other potentially reasonable 
approach that might make sense e.g. if the problem you’re worrying about is 
serial data slip or corruption of a compressed or encrypted file (where 
corruption will occur until re-synchronisation happens, and as a result you 
wouldn’t expect to have any knowledge whatever of the number of characters 
represented in the data in question).

    All of these approaches are explicitly allowed by the standard at present.  
All three are reasonable, and each has its own pros and cons in a technical 
sense (leaving aside how prevalent the approach in question might be).  In a 
general purpose library I’d probably go for the second one; if I knew I was 
dealing with a potentially corrupt compressed or encrypted stream, I might well 
plump for the third.  I can even *imagine* there being circumstances under 
which I might choose the first for some reason, in spite of my preference for 
the second approach.

    I don’t think it makes sense to standardise on *one* of these approaches, 
so if what you’re saying is that the “Best Practice” has been treated as if it 
was part of the specification (and I think that *is* essentially your claim), 
then I’m in favour of either removing it completely, or (better) replacing it 
with Shawn’s suggestion - i.e. listing three reasonable approaches and telling 
developers to document which they take and why.

    Kind regards,

    Alastair.

    --

    http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to