On 6/1/2017 11:53 AM, Shawn Steele wrote:
But those are IETF definitions. They don’t have to mean the same
thing in Unicode - except that people working in this field probably
expect them to.
That's the thing. And even if Unicode had it's own version of RFC 2119
one would considered it recommended for Unicode to follow widespread
industry practice (there's that "r" word again!).
A./
*From:*Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of
*Asmus Freytag via Unicode
*Sent:* Thursday, June 1, 2017 11:44 AM
*To:* unicode@unicode.org
*Subject:* Re: Feedback on the proposal to change U+FFFD generation
when decoding ill-formed UTF-8
On 6/1/2017 10:41 AM, Shawn Steele via Unicode wrote:
I think that the (or a) key problem is that the current "best practice" is treated as
"SHOULD" in RFC parlance. When what this really needs is a "MAY".
People reading standards tend to treat "SHOULD" and "MUST" as the same
thing.
It's not that they "tend to", it's in RFC 2119:
SHOULD
This word, or the adjective "RECOMMENDED", mean that there
may exist valid reasons in particular circumstances to ignore a
particular item, but the full implications must be understood and
carefully weighed before choosing a different course.
The clear inference is that while the non-recommended practice is not
prohibited, you better have some valid reason why you are deviating
from it (and, reading between the lines, it would not hurt if you
documented those reasons).
So, when an implementation deviates, then you get bugs (as we see here). Given
that there are very valid engineering reasons why someone might want to choose a
different behavior for their needs - without harming the intent of the standard at all in
most cases - I think the current/proposed language is too "strong".
Yes and no. ICU would be perfectly fine deviating from the existing
recommendation and stating their engineering reasons for doing so.
That would allow them to close their bug ("by documentation").
What's not OK is to take an existing recommendation and change it to
something else, just to make bug reports go away for one
implementations. That's like two sleepers fighting over a blanket
that's too short. Whenever one is covered, the other is exposed.
If it is discovered that the existing recommendation is not based on
anything like truly better behavior, there may be a case to change it
to something that's equivalent to a MAY. Perhaps a list of nearly
equally capable options.
(If that language is not in the standard already, a strong "an
implementation MUST not depend on the use of a particular strategy for
replacement of invalid code sequences", clearly ought to be added).
A./
-Shawn
-----Original Message-----
From: Alastair Houghton [mailto:alast...@alastairs-place.net]
Sent: Thursday, June 1, 2017 4:05 AM
To: Henri Sivonen<hsivo...@hsivonen.fi> <mailto:hsivo...@hsivonen.fi>
Cc: unicode Unicode Discussion<unicode@unicode.org> <mailto:unicode@unicode.org>; Shawn
Steele<shawn.ste...@microsoft.com> <mailto:shawn.ste...@microsoft.com>
Subject: Re: Feedback on the proposal to change U+FFFD generation when
decoding ill-formed UTF-8
On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode<unicode@unicode.org>
<mailto:unicode@unicode.org> wrote:
On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode
<unicode@unicode.org> <mailto:unicode@unicode.org> wrote:
* As far as I can tell, there are two (maybe three) sane approaches
to this problem:
* Either a "maximal" emission of one U+FFFD for every byte
that exists outside of a good sequence
* Or a "minimal" version that presumes the lead byte was
counting trail bytes correctly even if the resulting sequence was invalid. In that case
just use one U+FFFD.
* And (maybe, I haven't heard folks arguing for this one)
emit one U+FFFD at the first garbage byte and then ignore the input until valid
data starts showing up again. (So you could have 1 U+FFFD for a string of a
hundred garbage bytes as long as there weren't any valid sequences within that
group).
I think it's not useful to come up with new rules in the abstract.
The first two aren’t “new” rules; they’re, respectively, the current “Best
Practice”, the proposed “Best Practice” and one other potentially reasonable
approach that might make sense e.g. if the problem you’re worrying about is
serial data slip or corruption of a compressed or encrypted file (where
corruption will occur until re-synchronisation happens, and as a result you
wouldn’t expect to have any knowledge whatever of the number of characters
represented in the data in question).
All of these approaches are explicitly allowed by the standard at present.
All three are reasonable, and each has its own pros and cons in a technical
sense (leaving aside how prevalent the approach in question might be). In a
general purpose library I’d probably go for the second one; if I knew I was
dealing with a potentially corrupt compressed or encrypted stream, I might well
plump for the third. I can even *imagine* there being circumstances under
which I might choose the first for some reason, in spite of my preference for
the second approach.
I don’t think it makes sense to standardise on *one* of these approaches,
so if what you’re saying is that the “Best Practice” has been treated as if it
was part of the specification (and I think that *is* essentially your claim),
then I’m in favour of either removing it completely, or (better) replacing it
with Shawn’s suggestion - i.e. listing three reasonable approaches and telling
developers to document which they take and why.
Kind regards,
Alastair.
--
http://alastairs-place.net