RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Shawn Steele via Unicode Thu, 01 Jun 2017 11:59:10 -0700

But those are IETF definitions.  They don’t have to mean the same thing in 
Unicode - except that people working in this field probably expect them to.

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Asmus Freytag 
via Unicode
Sent: Thursday, June 1, 2017 11:44 AM
To: unicode@unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

On 6/1/2017 10:41 AM, Shawn Steele via Unicode wrote:

I think that the (or a) key problem is that the current "best practice" is 
treated as "SHOULD" in RFC parlance.  When what this really needs is a "MAY".

People reading standards tend to treat "SHOULD" and "MUST" as the same thing.

It's not that they "tend to", it's in RFC 2119:
SHOULD

 This word, or the adjective "RECOMMENDED", mean that there

   may exist valid reasons in particular circumstances to ignore a

   particular item, but the full implications must be understood and

   carefully weighed before choosing a different course.

The clear inference is that while the non-recommended practice is not 
prohibited, you better have some valid reason why you are deviating from it 
(and, reading between the lines, it would not hurt if you documented those 
reasons).

 So, when an implementation deviates, then you get bugs (as we see here).  
Given that there are very valid engineering reasons why someone might want to 
choose a different behavior for their needs - without harming the intent of the 
standard at all in most cases - I think the current/proposed language is too 
"strong".

Yes and no. ICU would be perfectly fine deviating from the existing 
recommendation and stating their engineering reasons for doing so. That would 
allow them to close their bug ("by documentation").

What's not OK is to take an existing recommendation and change it to something 
else, just to make bug reports go away for one implementations. That's like two 
sleepers fighting over a blanket that's too short. Whenever one is covered, the 
other is exposed.

If it is discovered that the existing recommendation is not based on anything 
like truly better behavior, there may be a case to change it to something 
that's equivalent to a MAY. Perhaps a list of nearly equally capable options.

(If that language is not in the standard already, a strong "an implementation 
MUST not depend on the use of a particular strategy for replacement of invalid 
code sequences", clearly ought to be added).

A./

-Shawn

-----Original Message-----

From: Alastair Houghton [mailto:alast...@alastairs-place.net]

Sent: Thursday, June 1, 2017 4:05 AM

To: Henri Sivonen <hsivo...@hsivonen.fi><mailto:hsivo...@hsivonen.fi>

Cc: unicode Unicode Discussion 
<unicode@unicode.org><mailto:unicode@unicode.org>; Shawn Steele 
<shawn.ste...@microsoft.com><mailto:shawn.ste...@microsoft.com>

Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode 
<unicode@unicode.org><mailto:unicode@unicode.org> wrote:

On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode

<unicode@unicode.org><mailto:unicode@unicode.org> wrote:

* As far as I can tell, there are two (maybe three) sane approaches to this 
problem:

       * Either a "maximal" emission of one U+FFFD for every byte that exists 
outside of a good sequence

       * Or a "minimal" version that presumes the lead byte was counting trail 
bytes correctly even if the resulting sequence was invalid.  In that case just 
use one U+FFFD.

       * And (maybe, I haven't heard folks arguing for this one) emit one 
U+FFFD at the first garbage byte and then ignore the input until valid data 
starts showing up again.  (So you could have 1 U+FFFD for a string of a hundred 
garbage bytes as long as there weren't any valid sequences within that group).

I think it's not useful to come up with new rules in the abstract.

The first two aren’t “new” rules; they’re, respectively, the current “Best 
Practice”, the proposed “Best Practice” and one other potentially reasonable 
approach that might make sense e.g. if the problem you’re worrying about is 
serial data slip or corruption of a compressed or encrypted file (where 
corruption will occur until re-synchronisation happens, and as a result you 
wouldn’t expect to have any knowledge whatever of the number of characters 
represented in the data in question).

All of these approaches are explicitly allowed by the standard at present.  All 
three are reasonable, and each has its own pros and cons in a technical sense 
(leaving aside how prevalent the approach in question might be).  In a general 
purpose library I’d probably go for the second one; if I knew I was dealing 
with a potentially corrupt compressed or encrypted stream, I might well plump 
for the third.  I can even *imagine* there being circumstances under which I 
might choose the first for some reason, in spite of my preference for the 
second approach.

I don’t think it makes sense to standardise on *one* of these approaches, so if 
what you’re saying is that the “Best Practice” has been treated as if it was 
part of the specification (and I think that *is* essentially your claim), then 
I’m in favour of either removing it completely, or (better) replacing it with 
Shawn’s suggestion - i.e. listing three reasonable approaches and telling 
developers to document which they take and why.

Kind regards,

Alastair.

--

http://alastairs-place.net

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to