Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Ken Whistler via Unicode


On 6/1/2017 8:32 PM, Richard Wordingham via Unicode wrote:

TUS Section 3 is like the Augean Stables.  It is a complete mess as a
standards document,


That is a matter of editorial taste, I suppose.


imputing mental states to computing processes.


That, however, is false. The rhetorical turn in the Unicode Standard's 
conformance clauses, "A process shall interpret..." and "A process shall 
not interpret..." has been in the standard for 21 years, and seems to 
have done its general job in guiding interoperable, conformant 
implementations fairly well. And everyone -- well, perhaps almost 
everyone -- has been able to figure out that such wording is a shorthand 
for something along the lines of "Any person implementing software 
conforming to the Unicode Standard in which a process does X shall 
implement it in such a way that that process when doing X shall follow 
the specification part Y, relevant to doing X, exactly according to that 
specification of Y...", rather than a misguided assumption that software 
processes are cognitive agents equipped with mental states that the 
standard can "tell what to think".


And I contend that the shorthand works just fine.



Table 3-7 for example, should be a consequence of a 'definition' that
UTF-8 only represents Unicode Scalar values and excludes 'non-shortest
forms'.


Well, Definition D92 does already explicitly limit UTF-8 to Unicode 
scalar values, and explicitly limits the form to sequences of one to 
four bytes. The reason why it doesn't explicitly include the exclusion 
of "non-shortest form" in the definition, but instead refers to Table 
3-7 for the well-formed sequences (which, btw explicitly rule out all 
the non-shortest forms), is because that would create another 
terminological conundrum -- trying to specify an air-tight definition of 
"non-shortest form (of UTF-8)" before UTF-8 itself is defined. It is 
terminologically cleaner to let people *derive* non-shortest form from 
the explicit exclusions of Table 3-7.



Instead, the exclusion of the sequence  is presented
as a brute definition, rather than as a consequence of 0xD800 not being
a Unicode scalar value. Likewise, 0xFC fails to be legal because it
would define either a 'non-shortest form' or a value that is not a
Unicode scalar value.


Actually 0xFC fails quite simply and unambiguously, because it is not in 
Table 3-7. End of story.


Same for 0xFF. There is nothing architecturally special about 
0xF5..0xFF. All are simply and unambiguously excluded from any 
well-formed UTF-8 byte sequence.




The differences are a matter of presentation; the outcome as to what is
permitted is the same.  The difference lies rather in whether the rules
are comprehensible.  A comprehensible definition is more likely to be
implemented correctly.  Where the presentation makes a difference is in
how malformed sequences are naturally handled.


Well, I don't think implementers have all that much trouble figuring out 
what *well-formed* UTF-8 is these days.


As for "how malformed sequences are naturally handled", I can't really 
say. Nor do I think the standard actually requires any particular 
handling to be conformant. It says thou shalt not emit them, and if you 
encounter them, thou shalt not interpret them as Unicode characters. 
Beyond that, it would be nice, of course, if people converged their 
error handling for malformed sequences in cooperative ways, but there is 
no conformance statement to that effect in the standard.


I have no trouble with the contention that the wording about "best 
practice" and "recommendations" regarding the handling of U+FFFD has 
caused some confusion and differences of interpretation among 
implementers. I'm sure the language in that area could use cleanup, 
precisely because it has led to contending, incompatible interpretations 
of the text. As to what actually *is* best practice in use of U+FFFD 
when attempting to convert ill-formed sequences handed off to UTF-8 
conversion processes, or whether the Unicode Standard should attempt to 
narrow down or change practice in that area, I am completely agnostic. 
Back to the U+FFFD thread for that discussion.


--Ken



Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Richard Wordingham via Unicode
On Thu, 1 Jun 2017 19:19:51 -0700
Ken Whistler via Unicode  wrote:

> >   and therefore should start a
> > sequence of 6 characters.  
> 
> That is completely false, and has nothing to do with the current 
> definition of UTF-8.
> 
> The current, normative definition of UTF-8, in the Unicode Standard,
> and in ISO/IEC 10646:2014, and in RFC 3629 (which explicitly
> "obsoletes and replaces RFC 2279") states clearly that 0xFC cannot
> start a sequence of anything identifiable as UTF-8.

TUS Section 3 is like the Augean Stables.  It is a complete mess as a
standards document, imputing mental states to computing processes.

Table 3-7 for example, should be a consequence of a 'definition' that
UTF-8 only represents Unicode Scalar values and excludes 'non-shortest
forms'. Instead, the exclusion of the sequence  is presented
as a brute definition, rather than as a consequence of 0xD800 not being
a Unicode scalar value. Likewise, 0xFC fails to be legal because it
would define either a 'non-shortest form' or a value that is not a
Unicode scalar value.

The differences are a matter of presentation; the outcome as to what is
permitted is the same.  The difference lies rather in whether the rules
are comprehensible.  A comprehensible definition is more likely to be
implemented correctly.  Where the presentation makes a difference is in
how malformed sequences are naturally handled.

Richard.


Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Ken Whistler via Unicode


On 6/1/2017 6:21 PM, Richard Wordingham via Unicode wrote:

By definition D39b, either sequence of bytes, if encountered by an
conformant UTF-8 conversion process, would be interpreted as a
sequence of 6 maximal subparts of an ill-formed subsequence.

("D39b" is a typo for "D93b".)


Sorry about that. :)



Conformant with what?  There is no mandatory*requirement*  for a UTF-8
conversion process conformant with Unicode to have any concept of
'maximal subpart'.


Conformant with the definition of UTF-8. I agree that nothing forces a 
conversion *process* to care anything about maximal subparts, but if 
*any* process using a conformant definition of UTF-8 then goes on to 
have any concept of "maximal subpart of an ill-formed subsequence" that 
departs from definition D93b in the Unicode Standard, then it is just 
making s**t up.





I don't see a good reason to build in special logic to treat FC 80 80
80 80 80 as somehow privileged as a unit for conversion fallback,
simply because*if*  UTF-8 were defined as the Unix gods intended
(which it ain't no longer) then that sequence*could*  be interpreted
as an out-of-bounds scalar value (which it ain't) on spec that the
codespace*might*  be extended past 10 at some indefinite time in
the future (which it won't).

Arguably, it requires special logic to treat FC 80 80 80 80 80 as an
invalid sequence.


That would be equally true of FF FF FF FF FF FF. Which was my point, 
actually.



   FC is not ASCII,


True, of course. But irrelevant. Because we are talking about UTF-8 
here. And just because some non-UTF-8 character encoding happened to 
include 0xFC as a valid (or invalid) value, might not require any 
special case processing. A simple 8-bit to 8-bit conversion table could 
be completely regular in its processing of 0xFC for a conversion.



  and has more than one leading bit
set.  It has the six leading bits set,


True, of course.


  and therefore should start a
sequence of 6 characters.


That is completely false, and has nothing to do with the current 
definition of UTF-8.


The current, normative definition of UTF-8, in the Unicode Standard, and 
in ISO/IEC 10646:2014, and in RFC 3629 (which explicitly "obsoletes and 
replaces RFC 2279") states clearly that 0xFC cannot start a sequence of 
anything identifiable as UTF-8.


--Ken



Richard.





Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Richard Wordingham via Unicode
On Thu, 1 Jun 2017 17:10:54 -0700
Ken Whistler via Unicode  wrote:

> Well, working from the *current* specification:
> 
> FC 80 80 80 80 80
> and
> FF FF FF FF FF FF
> 
> are equal trash, uninterpretable as *anything* in UTF-8.
> 
> By definition D39b, either sequence of bytes, if encountered by an 
> conformant UTF-8 conversion process, would be interpreted as a
> sequence of 6 maximal subparts of an ill-formed subsequence.

There is a very good argument that 0xFC and 0xFF are not code units
(D77) - they are not used in the representation of any Unicode scalar
values.  By that argument, you have 5 maximal subparts and seven
garbage bytes.

Richard.


Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Richard Wordingham via Unicode
On Thu, 1 Jun 2017 17:10:54 -0700
Ken Whistler via Unicode  wrote:

> On 6/1/2017 2:39 PM, Richard Wordingham via Unicode wrote:
> > You were implicitly invited to argue that there was no need to
> > handle 5 and 6 byte invalid sequences.
> >  
> 
> Well, working from the *current* specification:
> 
> FC 80 80 80 80 80
> and
> FF FF FF FF FF FF
> 
> are equal trash, uninterpretable as *anything* in UTF-8.
> 
> By definition D39b, either sequence of bytes, if encountered by an 
> conformant UTF-8 conversion process, would be interpreted as a
> sequence of 6 maximal subparts of an ill-formed subsequence.

("D39b" is a typo for "D93b".)

Conformant with what?  There is no mandatory *requirement* for a UTF-8
conversion process conformant with Unicode to have any concept of
'maximal subpart'.

> I don't see a good reason to build in special logic to treat FC 80 80
> 80 80 80 as somehow privileged as a unit for conversion fallback,
> simply because *if* UTF-8 were defined as the Unix gods intended
> (which it ain't no longer) then that sequence *could* be interpreted
> as an out-of-bounds scalar value (which it ain't) on spec that the
> codespace *might* be extended past 10 at some indefinite time in
> the future (which it won't).

Arguably, it requires special logic to treat FC 80 80 80 80 80 as an
invalid sequence.  FC is not ASCII, and has more than one leading bit
set.  It has the six leading bits set, and therefore should start a
sequence of 6 characters.

Richard.


Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Ken Whistler via Unicode


On 6/1/2017 2:39 PM, Richard Wordingham via Unicode wrote:

You were implicitly invited to argue that there was no need to handle
5 and 6 byte invalid sequences.



Well, working from the *current* specification:

FC 80 80 80 80 80
and
FF FF FF FF FF FF

are equal trash, uninterpretable as *anything* in UTF-8.

By definition D39b, either sequence of bytes, if encountered by an 
conformant UTF-8 conversion process, would be interpreted as a sequence 
of 6 maximal subparts of an ill-formed subsequence. Whatever your 
particular strategy for conversion fallbacks for uninterpretable 
sequences, it ought to treat either one of those trash sequences the 
same, in my book.


I don't see a good reason to build in special logic to treat FC 80 80 80 
80 80 as somehow privileged as a unit for conversion fallback, simply 
because *if* UTF-8 were defined as the Unix gods intended (which it 
ain't no longer) then that sequence *could* be interpreted as an 
out-of-bounds scalar value (which it ain't) on spec that the codespace 
*might* be extended past 10 at some indefinite time in the future 
(which it won't).


--Ken


Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Philippe Verdy via Unicode
This is still very unlikely to occur. Lot of discussions about emojis but
they still don't count a lot in the total.
The major updates were epected for CJK sinograms, but even the rate of
updates has slowed down and we will eventually will have another
sinographic plane, but it will not come soon and will be very slow to fill
in. This still leaves enough planes for several decenials or more.

May be in the next century a new encoding will be designed but we have
ample time to prepare this to reflect the best practives and experiences
acquired, and it will probably not happen because we lack of code points
but only because some experimentations will have proven that another
encoding is better performing and less complex to manage (just like the
ongoing transition from XML to JSON for the UCD) and because current
supporters of Unicode will prefer this new format and will have implemented
it (starting first by an automatic conversion from the existing encoding in
Unicode and ISO 10646, which will no longer be needed in deployed client
applications)

I bet it will still be an 8-bit based encoding using 7-bit ASCII (at least
the ngraphic part plus a few controls, but some other controls will be
remapped), but it could be simply a new 32 bit or 64-bit encoding.

Before this change ever occurs, there will be the need to demonstrate that
it is better performing, that it allows smooth transition and excellent
compatibility (possibly with efficient transcoders) and many implementation
"quirks" will have been resolved (including security risks).

2017-06-01 21:54 GMT+02:00 Doug Ewell via Unicode :

> Richard Wordingham wrote:
>
> > even supporting 6-byte patterns just in case 20.1 bits eventually turn
> > out not to be enough,
>
> Oh, gosh, here we go with this.
>
> What will we do if 31 bits turn out not to be enough?
>
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>


Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Richard Wordingham via Unicode
On Thu, 01 Jun 2017 12:54:45 -0700
Doug Ewell via Unicode  wrote:

> Richard Wordingham wrote:
> 
> > even supporting 6-byte patterns just in case 20.1 bits eventually
> > turn out not to be enough,  
> 
> Oh, gosh, here we go with this.

You were implicitly invited to argue that there was no need to handle
5 and 6 byte invalid sequences. 

> What will we do if 31 bits turn out not to be enough?

A compatible extension of UTF-16 to unbounded length has already been
designed.  Prefix bytes 0xFF can be used to extend the length for UTF-8
by 8 bytes at a time.  Extending UTF-32 is not beyond the wit of man,
and we know that UTF-16 could have been done better if the need had
been foreseen.

While it seems natural to hold a Unicode scalar value in a single
machine word of some length, this is not necessary, just highly
convenient.

In short, it won't be a big problem intrinsically.  The UCD may get a
bit unwieldy, which may be a problem for small systems without Internet
access.

Richard.


Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Doug Ewell via Unicode
Richard Wordingham wrote:

> even supporting 6-byte patterns just in case 20.1 bits eventually turn
> out not to be enough,

Oh, gosh, here we go with this.

What will we do if 31 bits turn out not to be enough?
 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org



Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Asmus Freytag (c) via Unicode

On 6/1/2017 11:53 AM, Shawn Steele wrote:


But those are IETF definitions.  They don’t have to mean the same 
thing in Unicode - except that people working in this field probably 
expect them to.




That's the thing. And even if Unicode had it's own version of RFC 2119 
one would considered it recommended for Unicode to follow widespread 
industry practice (there's that "r" word again!).


A./


*From:*Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of 
*Asmus Freytag via Unicode

*Sent:* Thursday, June 1, 2017 11:44 AM
*To:* unicode@unicode.org
*Subject:* Re: Feedback on the proposal to change U+FFFD generation 
when decoding ill-formed UTF-8


On 6/1/2017 10:41 AM, Shawn Steele via Unicode wrote:

I think that the (or a) key problem is that the current "best practice" is treated as 
"SHOULD" in RFC parlance.  When what this really needs is a "MAY".

People reading standards tend to treat "SHOULD" and "MUST" as the same 
thing.


It's not that they "tend to", it's in RFC 2119:


SHOULD

  This word, or the adjective "RECOMMENDED", mean that there

may exist valid reasons in particular circumstances to ignore a

particular item, but the full implications must be understood and

carefully weighed before choosing a different course.

The clear inference is that while the non-recommended practice is not 
prohibited, you better have some valid reason why you are deviating 
from it (and, reading between the lines, it would not hurt if you 
documented those reasons).



  So, when an implementation deviates, then you get bugs (as we see here).  Given 
that there are very valid engineering reasons why someone might want to choose a 
different behavior for their needs - without harming the intent of the standard at all in 
most cases - I think the current/proposed language is too "strong".


Yes and no. ICU would be perfectly fine deviating from the existing 
recommendation and stating their engineering reasons for doing so. 
That would allow them to close their bug ("by documentation").


What's not OK is to take an existing recommendation and change it to 
something else, just to make bug reports go away for one 
implementations. That's like two sleepers fighting over a blanket 
that's too short. Whenever one is covered, the other is exposed.


If it is discovered that the existing recommendation is not based on 
anything like truly better behavior, there may be a case to change it 
to something that's equivalent to a MAY. Perhaps a list of nearly 
equally capable options.


(If that language is not in the standard already, a strong "an 
implementation MUST not depend on the use of a particular strategy for 
replacement of invalid code sequences", clearly ought to be added).


A./


-Shawn

-Original Message-

From: Alastair Houghton [mailto:alast...@alastairs-place.net]

Sent: Thursday, June 1, 2017 4:05 AM

To: Henri Sivonen 

Cc: unicode Unicode Discussion ; Shawn 
Steele 

Subject: Re: Feedback on the proposal to change U+FFFD generation when 
decoding ill-formed UTF-8

On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode 
  wrote:

On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode

   wrote:

* As far as I can tell, there are two (maybe three) sane approaches 
to this problem:

* Either a "maximal" emission of one U+FFFD for every byte 
that exists outside of a good sequence

* Or a "minimal" version that presumes the lead byte was 
counting trail bytes correctly even if the resulting sequence was invalid.  In that case 
just use one U+FFFD.

* And (maybe, I haven't heard folks arguing for this one) 
emit one U+FFFD at the first garbage byte and then ignore the input until valid 
data starts showing up again.  (So you could have 1 U+FFFD for a string of a 
hundred garbage bytes as long as there weren't any valid sequences within that 
group).

I think it's not useful to come up with new rules in the abstract.

The first two aren’t “new” rules; they’re, respectively, the current “Best 
Practice”, the proposed “Best Practice” and one other potentially reasonable 
approach that might make sense e.g. if the problem you’re worrying about is 
serial data slip or corruption of a compressed or encrypted file (where 
corruption will occur until re-synchronisation happens, and as a result you 
wouldn’t expect to have any knowledge whatever of the number of characters 
represented in the data in question).

All of these approaches are explicitly allowed by the standard at present.  
All three are reasonable, and each has its own pros and cons in a 

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Richard Wordingham via Unicode
On Thu, 1 Jun 2017 12:32:08 +0300
Henri Sivonen via Unicode  wrote:

> On Wed, May 31, 2017 at 8:11 PM, Richard Wordingham via Unicode
>  wrote:
> > On Wed, 31 May 2017 15:12:12 +0300
> > Henri Sivonen via Unicode  wrote:  
> >> I am not claiming it's too difficult to implement. I think it
> >> inappropriate to ask implementations, even from-scratch ones, to
> >> take on added complexity in error handling on mere aesthetic
> >> grounds. Also, I think it's inappropriate to induce
> >> implementations already written according to the previous guidance
> >> to change (and risk bugs) or to make the developers who followed
> >> the previous guidance with precision be the ones who need to
> >> explain why they aren't following the new guidance.  
> >
> > How straightforward is the FSM for back-stepping?  
> 
> This seems beside the point, since the new guidance wasn't advertised
> as improving backward stepping compared to the old guidance.
> 
> (On the first look, I don't see the new guidance improving back
> stepping. In fact, if the UTC meant to adopt ICU's behavior for
> obsolete five and six-byte bit patterns, AFAICT, backstepping with the
> ICU behavior requires examining more bytes backward than the old
> guidance required.)

The greater simplicity comes from the the alternative behaviour being
more 'natural'.  It's a little difficult to count states without
constraints on the machines, but for forward stepping, even supporting
6-byte patterns just in case 20.1 bits eventually turn out not to be
enough, there are five intermediate states - '1 byte to go', '2
bytes to go', ... '5 bytes to go'.  For backward stepping, there are
similarly five intermediate states - '1 trailing byte seen', and so
on. 

For the recommended handling, forward stepping has seven
intermediate states, each directly reachable from the starting state -
start byte C2..DF; start byte E0; start byte E1..EC, EE or EF; start
byte ED; start byte F0; start byte F1..F3; and start byte FF.  No
further intermediate states are required.

For the recommended handling, I see a need for 8 intermediate steps,
depending on how may trail bytes have been considered and whether the
last one was in the range 80..8F (precludes E0 and F0 immediately
preceding), 90..9F (precludes E0 and F4 immediately preceding) or A0..BF
(precludes ED and F4 immediately preceding). The logic feels quite
complicated. If I implement it, I'm not likely to code it up as an FSM.

> > You should have researched implementations as they were in 2007.  

> I don't see how the state of things in 2007 is relevant to a decision
> taken in 2017.

Because the argument is that the original decision taken in 2008 was
wrong.  I have a feeling I have overlooked some of the discussion
around then, because I can't find my contribution in the archives, and I
thought I objected at the time.

Richard.


RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Shawn Steele via Unicode
But those are IETF definitions.  They don’t have to mean the same thing in 
Unicode - except that people working in this field probably expect them to.

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Asmus Freytag 
via Unicode
Sent: Thursday, June 1, 2017 11:44 AM
To: unicode@unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

On 6/1/2017 10:41 AM, Shawn Steele via Unicode wrote:

I think that the (or a) key problem is that the current "best practice" is 
treated as "SHOULD" in RFC parlance.  When what this really needs is a "MAY".



People reading standards tend to treat "SHOULD" and "MUST" as the same thing.

It's not that they "tend to", it's in RFC 2119:
SHOULD

 This word, or the adjective "RECOMMENDED", mean that there

   may exist valid reasons in particular circumstances to ignore a

   particular item, but the full implications must be understood and

   carefully weighed before choosing a different course.


The clear inference is that while the non-recommended practice is not 
prohibited, you better have some valid reason why you are deviating from it 
(and, reading between the lines, it would not hurt if you documented those 
reasons).



 So, when an implementation deviates, then you get bugs (as we see here).  
Given that there are very valid engineering reasons why someone might want to 
choose a different behavior for their needs - without harming the intent of the 
standard at all in most cases - I think the current/proposed language is too 
"strong".

Yes and no. ICU would be perfectly fine deviating from the existing 
recommendation and stating their engineering reasons for doing so. That would 
allow them to close their bug ("by documentation").

What's not OK is to take an existing recommendation and change it to something 
else, just to make bug reports go away for one implementations. That's like two 
sleepers fighting over a blanket that's too short. Whenever one is covered, the 
other is exposed.

If it is discovered that the existing recommendation is not based on anything 
like truly better behavior, there may be a case to change it to something 
that's equivalent to a MAY. Perhaps a list of nearly equally capable options.

(If that language is not in the standard already, a strong "an implementation 
MUST not depend on the use of a particular strategy for replacement of invalid 
code sequences", clearly ought to be added).

A./







-Shawn



-Original Message-

From: Alastair Houghton [mailto:alast...@alastairs-place.net]

Sent: Thursday, June 1, 2017 4:05 AM

To: Henri Sivonen 

Cc: unicode Unicode Discussion 
; Shawn Steele 


Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8



On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode 
 wrote:



On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode

 wrote:

* As far as I can tell, there are two (maybe three) sane approaches to this 
problem:

   * Either a "maximal" emission of one U+FFFD for every byte that exists 
outside of a good sequence

   * Or a "minimal" version that presumes the lead byte was counting trail 
bytes correctly even if the resulting sequence was invalid.  In that case just 
use one U+FFFD.

   * And (maybe, I haven't heard folks arguing for this one) emit one 
U+FFFD at the first garbage byte and then ignore the input until valid data 
starts showing up again.  (So you could have 1 U+FFFD for a string of a hundred 
garbage bytes as long as there weren't any valid sequences within that group).



I think it's not useful to come up with new rules in the abstract.



The first two aren’t “new” rules; they’re, respectively, the current “Best 
Practice”, the proposed “Best Practice” and one other potentially reasonable 
approach that might make sense e.g. if the problem you’re worrying about is 
serial data slip or corruption of a compressed or encrypted file (where 
corruption will occur until re-synchronisation happens, and as a result you 
wouldn’t expect to have any knowledge whatever of the number of characters 
represented in the data in question).



All of these approaches are explicitly allowed by the standard at present.  All 
three are reasonable, and each has its own pros and cons in a technical sense 
(leaving aside how prevalent the approach in question might be).  In a general 
purpose library I’d probably go for the second one; if I knew I was dealing 
with a potentially corrupt compressed or encrypted stream, I might well plump 
for the third.  I can even *imagine* there being circumstances under which I 
might choose the first for some reason, in spite of my preference for the 
second 

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Asmus Freytag via Unicode

  
  
On 6/1/2017 10:41 AM, Shawn Steele via
  Unicode wrote:


  I think that the (or a) key problem is that the current "best practice" is treated as "SHOULD" in RFC parlance.  When what this really needs is a "MAY".

People reading standards tend to treat "SHOULD" and "MUST" as the same thing. 


It's not that they "tend to", it's in RFC 2119:


  SHOULD   This word, or the adjective "RECOMMENDED", mean that there
   may exist valid reasons in particular circumstances to ignore a
   particular item, but the full implications must be understood and
   carefully weighed before choosing a different course.



The clear inference is that while the non-recommended practice is
not prohibited, you better have some valid reason why you are
deviating from it (and, reading between the lines, it would not hurt
if you documented those reasons).


   So, when an implementation deviates, then you get bugs (as we see here).  Given that there are very valid engineering reasons why someone might want to choose a different behavior for their needs - without harming the intent of the standard at all in most cases - I think the current/proposed language is too "strong".


Yes and no. ICU would be perfectly fine deviating from the existing
recommendation and stating their engineering reasons for doing so.
That would allow them to close their bug ("by documentation").

What's not OK is to take an existing recommendation and change it to
something else, just to make bug reports go away for one
implementations. That's like two sleepers fighting over a blanket
that's too short. Whenever one is covered, the other is exposed.

If it is discovered that the existing recommendation is not based on
anything like truly better behavior, there may be a case to change
it to something that's equivalent to a MAY. Perhaps a list of nearly
equally capable options.

(If that language is not in the standard already, a strong "an
implementation MUST not depend on the use of a particular strategy
for replacement of invalid code sequences", clearly ought to be
added).

A./


  

-Shawn

-Original Message-
From: Alastair Houghton [mailto:alast...@alastairs-place.net] 
Sent: Thursday, June 1, 2017 4:05 AM
To: Henri Sivonen 
Cc: unicode Unicode Discussion ; Shawn Steele 
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode  wrote:

  

On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode 
 wrote:


  * As far as I can tell, there are two (maybe three) sane approaches to this problem:
   * Either a "maximal" emission of one U+FFFD for every byte that exists outside of a good sequence
   * Or a "minimal" version that presumes the lead byte was counting trail bytes correctly even if the resulting sequence was invalid.  In that case just use one U+FFFD.
   * And (maybe, I haven't heard folks arguing for this one) emit one U+FFFD at the first garbage byte and then ignore the input until valid data starts showing up again.  (So you could have 1 U+FFFD for a string of a hundred garbage bytes as long as there weren't any valid sequences within that group).



I think it's not useful to come up with new rules in the abstract.

  
  
The first two aren’t “new” rules; they’re, respectively, the current “Best Practice”, the proposed “Best Practice” and one other potentially reasonable approach that might make sense e.g. if the problem you’re worrying about is serial data slip or corruption of a compressed or encrypted file (where corruption will occur until re-synchronisation happens, and as a result you wouldn’t expect to have any knowledge whatever of the number of characters represented in the data in question).

All of these approaches are explicitly allowed by the standard at present.  All three are reasonable, and each has its own pros and cons in a technical sense (leaving aside how prevalent the approach in question might be).  In a general purpose library I’d probably go for the second one; if I knew I was dealing with a potentially corrupt compressed or encrypted stream, I might well plump for the third.  I can even *imagine* there being circumstances under which I might choose the first for some reason, in spite of my preference for the second approach.

I don’t think it makes sense to standardise on *one* of these approaches, so if what you’re saying is that the “Best Practice” has been treated as if it was part of the specification (and I think that *is* essentially your claim), then I’m in favour of either removing it completely, or (better) replacing it with Shawn’s suggestion - i.e. 

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Shawn Steele via Unicode
I think that the (or a) key problem is that the current "best practice" is 
treated as "SHOULD" in RFC parlance.  When what this really needs is a "MAY".

People reading standards tend to treat "SHOULD" and "MUST" as the same thing.  
So, when an implementation deviates, then you get bugs (as we see here).  Given 
that there are very valid engineering reasons why someone might want to choose 
a different behavior for their needs - without harming the intent of the 
standard at all in most cases - I think the current/proposed language is too 
"strong".

-Shawn

-Original Message-
From: Alastair Houghton [mailto:alast...@alastairs-place.net] 
Sent: Thursday, June 1, 2017 4:05 AM
To: Henri Sivonen 
Cc: unicode Unicode Discussion ; Shawn Steele 

Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode  wrote:
> 
> On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode 
>  wrote:
>> * As far as I can tell, there are two (maybe three) sane approaches to this 
>> problem:
>>* Either a "maximal" emission of one U+FFFD for every byte that 
>> exists outside of a good sequence
>>* Or a "minimal" version that presumes the lead byte was counting 
>> trail bytes correctly even if the resulting sequence was invalid.  In that 
>> case just use one U+FFFD.
>>* And (maybe, I haven't heard folks arguing for this one) emit one 
>> U+FFFD at the first garbage byte and then ignore the input until valid data 
>> starts showing up again.  (So you could have 1 U+FFFD for a string of a 
>> hundred garbage bytes as long as there weren't any valid sequences within 
>> that group).
> 
> I think it's not useful to come up with new rules in the abstract.

The first two aren’t “new” rules; they’re, respectively, the current “Best 
Practice”, the proposed “Best Practice” and one other potentially reasonable 
approach that might make sense e.g. if the problem you’re worrying about is 
serial data slip or corruption of a compressed or encrypted file (where 
corruption will occur until re-synchronisation happens, and as a result you 
wouldn’t expect to have any knowledge whatever of the number of characters 
represented in the data in question).

All of these approaches are explicitly allowed by the standard at present.  All 
three are reasonable, and each has its own pros and cons in a technical sense 
(leaving aside how prevalent the approach in question might be).  In a general 
purpose library I’d probably go for the second one; if I knew I was dealing 
with a potentially corrupt compressed or encrypted stream, I might well plump 
for the third.  I can even *imagine* there being circumstances under which I 
might choose the first for some reason, in spite of my preference for the 
second approach.

I don’t think it makes sense to standardise on *one* of these approaches, so if 
what you’re saying is that the “Best Practice” has been treated as if it was 
part of the specification (and I think that *is* essentially your claim), then 
I’m in favour of either removing it completely, or (better) replacing it with 
Shawn’s suggestion - i.e. listing three reasonable approaches and telling 
developers to document which they take and why.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Asmus Freytag via Unicode

On 6/1/2017 2:32 AM, Henri Sivonen via Unicode wrote:

O
On Wed, May 31, 2017 at 10:38 PM, Doug Ewell via Unicode
 wrote:

Henri Sivonen wrote:


If anything, I hope this thread results in the establishment of a
requirement for proposals to come with proper research about what
multiple prominent implementations to about the subject matter of a
proposal concerning changes to text about implementation behavior.

Considering that several folks have objected that the U+FFFD
recommendation is perceived as having the weight of a requirement, I
think adding Henri's good advice above as a "requirement" seems
heavy-handed. Who will judge how much research qualifies as "proper"?


I agree with Henri on these general points:

1) Requiring extensive research on implementation practice is crucial in 
dealing with any changes to long standing definitions, algorithms, 
properties and recommendations.
2) Not having a perfect definition of what "extensive" means is not an 
excuse to do nothing.
3) Evaluating only the proposer's implementation (or only ICU) is not 
sufficient.
4) Changing a recommendation that many implementers (or worse, an 
implementers' collective) have chosen to adopt is a breaking change.
5) Breaking changes to fundamental algorithms require extraordinarily 
strong justification including, but not limited to "proof" that the 
existing definition/recommendation is not workable or presents grave 
security risks that cannot be mitigated any other way.


I continue to see a disturbing lack of appreciation of these issues in 
some of the replies to this discussion (and some past decisions by the UTC).


A./


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Alastair Houghton via Unicode
On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode  wrote:
> 
> On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode
>  wrote:
>> * As far as I can tell, there are two (maybe three) sane approaches to this 
>> problem:
>>* Either a "maximal" emission of one U+FFFD for every byte that 
>> exists outside of a good sequence
>>* Or a "minimal" version that presumes the lead byte was counting 
>> trail bytes correctly even if the resulting sequence was invalid.  In that 
>> case just use one U+FFFD.
>>* And (maybe, I haven't heard folks arguing for this one) emit one 
>> U+FFFD at the first garbage byte and then ignore the input until valid data 
>> starts showing up again.  (So you could have 1 U+FFFD for a string of a 
>> hundred garbage bytes as long as there weren't any valid sequences within 
>> that group).
> 
> I think it's not useful to come up with new rules in the abstract.

The first two aren’t “new” rules; they’re, respectively, the current “Best 
Practice”, the proposed “Best Practice” and one other potentially reasonable 
approach that might make sense e.g. if the problem you’re worrying about is 
serial data slip or corruption of a compressed or encrypted file (where 
corruption will occur until re-synchronisation happens, and as a result you 
wouldn’t expect to have any knowledge whatever of the number of characters 
represented in the data in question).

All of these approaches are explicitly allowed by the standard at present.  All 
three are reasonable, and each has its own pros and cons in a technical sense 
(leaving aside how prevalent the approach in question might be).  In a general 
purpose library I’d probably go for the second one; if I knew I was dealing 
with a potentially corrupt compressed or encrypted stream, I might well plump 
for the third.  I can even *imagine* there being circumstances under which I 
might choose the first for some reason, in spite of my preference for the 
second approach.

I don’t think it makes sense to standardise on *one* of these approaches, so if 
what you’re saying is that the “Best Practice” has been treated as if it was 
part of the specification (and I think that *is* essentially your claim), then 
I’m in favour of either removing it completely, or (better) replacing it with 
Shawn’s suggestion - i.e. listing three reasonable approaches and telling 
developers to document which they take and why.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Henri Sivonen via Unicode
On Wed, May 31, 2017 at 8:11 PM, Richard Wordingham via Unicode
 wrote:
> On Wed, 31 May 2017 15:12:12 +0300
> Henri Sivonen via Unicode  wrote:
>> I am not claiming it's too difficult to implement. I think it
>> inappropriate to ask implementations, even from-scratch ones, to take
>> on added complexity in error handling on mere aesthetic grounds. Also,
>> I think it's inappropriate to induce implementations already written
>> according to the previous guidance to change (and risk bugs) or to
>> make the developers who followed the previous guidance with precision
>> be the ones who need to explain why they aren't following the new
>> guidance.
>
> How straightforward is the FSM for back-stepping?

This seems beside the point, since the new guidance wasn't advertised
as improving backward stepping compared to the old guidance.

(On the first look, I don't see the new guidance improving back
stepping. In fact, if the UTC meant to adopt ICU's behavior for
obsolete five and six-byte bit patterns, AFAICT, backstepping with the
ICU behavior requires examining more bytes backward than the old
guidance required.)

>> On Fri, May 26, 2017 at 6:41 PM, Markus Scherer via Unicode
>>  wrote:
>> > The UTF-8 conversion code that I wrote for ICU, and apparently the
>> > code that various other people have written, collects sequences
>> > starting from lead bytes, according to the original spec, and at
>> > the end looks at whether the assembled code point is too low for
>> > the lead byte, or is a surrogate, or is above 10. Stopping at a
>> > non-trail byte is quite natural, and reading the PRI text
>> > accordingly is quite natural too.
>>
>> I don't doubt that other people have written code with the same
>> concept as ICU, but as far as non-shortest form handling goes in the
>> implementations I tested (see URL at the start of this email) ICU is
>> the lone outlier.
>
> You should have researched implementations as they were in 2007.

I don't see how the state of things in 2007 is relevant to a decision
taken in 2017. It's relevant that by 2017, prominent implementations
had adopted the old Unicode guidance, and, that being the case, it's
inappropriate to change the guidance for aesthetic reasons or to favor
the Unicode Consortium-hosted implementation.

On Wed, May 31, 2017 at 8:43 PM, Shawn Steele via Unicode
 wrote:
> I do not understand the energy being invested in a case that shouldn't 
> happen, especially in a case that is a subset of all the other bad cases that 
> could happen.

I'm a browser developer. I've explained previously on this list and in
my blog post why the browser developer / Web standard culture favors
well-defined behavior in error cases these days.

On Wed, May 31, 2017 at 10:38 PM, Doug Ewell via Unicode
 wrote:
> Henri Sivonen wrote:
>
>> If anything, I hope this thread results in the establishment of a
>> requirement for proposals to come with proper research about what
>> multiple prominent implementations to about the subject matter of a
>> proposal concerning changes to text about implementation behavior.
>
> Considering that several folks have objected that the U+FFFD
> recommendation is perceived as having the weight of a requirement, I
> think adding Henri's good advice above as a "requirement" seems
> heavy-handed. Who will judge how much research qualifies as "proper"?

In the Unicode scope, it's indeed harder to draw clear line to decide
what the prominent implementations are than in the WHATWG scope. The
point is that just checking ICU is not good enough. Someone making a
proposal should check the four major browser engines and a bunch of
system frameworks and standard libraries for well-known programming
languages. Which frameworks and standard libraries and how many is not
precisely definable objectively and depends on the subject matter
(there are many UTF-8 decoders but e.g. fewer text shaping engines).
There will be diminishing returns to checking them. Chances are that
it's not necessary to check too many for a pattern to emerge to judge
whether the existing spec language is being implemented (don't change
it) or being ignored (probably should be changed then).

In any case, "we can't check everything or choose fairly what exactly
to check" shouldn't be a reason for it to be OK to just check ICU or
to make abstract arguments without checking any implementations at
all. Checking multiple popular implementations is homework better done
than just checking ICU even if it's up to the person making the
proposal to choose which implementations to check exactly. The
committee should be able to recognize if the list of implementations
tested looks like a list of broadly-deployed implementations.

On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode
 wrote:
> * As far as I can tell, there are two (maybe three) sane approaches to this 
> problem:
>

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Alastair Houghton via Unicode
On 31 May 2017, at 20:42, Shawn Steele via Unicode  wrote:
> 
>> And *that* is what the specification says.  The whole problem here is that 
>> someone elevated
>> one choice to the status of “best practice”, and it’s a choice that some of 
>> us don’t think *should*
>> be considered best practice.
> 
>> Perhaps “best practice” should simply be altered to say that you *clearly 
>> document* your behavior
>> in the case of invalid UTF-8 sequences, and that code should not rely on the 
>> number of U+FFFDs 
>> generated, rather than suggesting a behaviour?
> 
> That's what I've been suggesting.
> 
> I think we could maybe go a little further though:
> 
> * Best practice is clearly not to depend on the # of U+FFFDs generated by 
> another component/app.  Clearly that can't be relied upon, so I think 
> everyone can agree with that.
> * I think encouraging documentation of behavior is cool, though there are 
> probably low priority bugs and people don't like to read the docs in that 
> detail, so I wouldn't expect very much from that.
> * As far as I can tell, there are two (maybe three) sane approaches to this 
> problem:
>   * Either a "maximal" emission of one U+FFFD for every byte that exists 
> outside of a good sequence 
>   * Or a "minimal" version that presumes the lead byte was counting trail 
> bytes correctly even if the resulting sequence was invalid.  In that case 
> just use one U+FFFD.
>   * And (maybe, I haven't heard folks arguing for this one) emit one 
> U+FFFD at the first garbage byte and then ignore the input until valid data 
> starts showing up again.  (So you could have 1 U+FFFD for a string of a 
> hundred garbage bytes as long as there weren't any valid sequences within 
> that group).
> * I'd be happy if the best practice encouraged one of those two (or maybe 
> three) approaches.  I think an approach that called rand() to see how many 
> U+FFFDs to emit when it encountered bad data is fair to discourage.

Agreed.

Kind regards,

Alastair.

--
http://alastairs-place.net




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Alastair Houghton via Unicode
On 31 May 2017, at 20:24, Shawn Steele via Unicode  wrote:
> 
> > For implementations that emit FFFD while handling text conversion and 
> > repair (ie, converting ill-formed
> > UTF-8 to well-formed), it is best for interoperability if they get the same 
> > results, so that indices within the
> > resulting strings are consistent across implementations for all the correct 
> > characters thereafter.
>  
> That seems optimistic :) 
>  
> If interoperability is the goal, then it would seem to me that changing the 
> recommendation would be contrary to that goal.  There are systems that will 
> not or cannot change to a new recommendation.  If such systems are updated, 
> then adoption of those systems will likely take some time.

Indeed, if interoperability is the goal, the behaviour should be fully 
specified, not merely recommended.  At present, though, it appears that we have 
(broadly) two different behaviours in the wild, and nobody wants to change what 
they presently do.

Personally I agree with Shawn on this; the presence of a U+FFFD indicates that 
the input was invalid somehow.  You don’t know *how* it was invalid, and 
probably shouldn’t rely on equivalence with another invalid string.

There are obviously some exceptions - e.g. it *may* be desirable in the context 
of browsers to specify the behaviour in order to avoid behavioural differences 
being used for Javascript-based “fingerprinting”.  But I don’t see why WHATWG 
(for instance) couldn’t do that.

Kind regards,

Alastair.

--
http://alastairs-place.net