RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Shawn Steele via Unicode
So basically this came about because code got bugged for not following the 
"recommendation."   To fix that, the recommendation will be changed.  However 
then that is going to lead to bugs for other existing code that does not follow 
the new recommendation.

I totally get the forward/backward scanning in sync without decoding reasoning 
for some implementations, however I do not think that the practices that 
benefit those should extend to other applications that are happy with a 
different practice.

In either case, the bad characters are garbage, so neither approach is "better" 
- except that one or the other may be more conducive to the requirements of the 
particular API/application.

I really think the correct approach here is to allow any number of replacement 
characters without prejudice.  Perhaps with suggestions for pros and cons of 
various approaches if people feel that is really necessary.

-Shawn

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Karl Williamson 
via Unicode
Sent: Friday, May 26, 2017 2:16 PM
To: Ken Whistler 
Cc: unicode@unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding 
ill-formed UTF-8

On 05/26/2017 12:22 PM, Ken Whistler wrote:
> 
> On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:
>> The link provided about the PRI doesn't lead to the comments.
>>
> 
> PRI #121 (August, 2008) pre-dated the practice of keeping all the 
> feedback comments together with the PRI itself in a numbered directory 
> with the name "feedback.html". But the comments were collected 
> together at the time and are accessible here:
> 
> http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121
> 
> Also there was a separately submitted comment document:
> 
> http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt
> 
> And the minutes of the pertinent UTC meeting (UTC #116):
> 
> http://www.unicode.org/L2/L2008/08253.htm
> 
> The minutes simply capture the consensus to adopt Option #2 from PRI 
> #121, and the relevant action items.
> 
> I now return the floor to the distinguished disputants to continue 
> litigating history. ;-)
> 
> --Ken
> 
>

The reason this discussion got started was that in December, someone came to me 
and said the code I support does not follow Unicode best practices, and 
suggested I need to change, though no ticket (yet) has been filed.  I was 
surprised, and posted a query to this list about what the advantages of the new 
approach are.  There were a number of replies, but I did not see anything that 
seemed definitive.  After a month, I created a ticket in Unicode and Markus was 
assigned to research it, and came up with the proposal currently being debated.

Looking at the PRI, it seems to me that treating an overlong as a single 
maximal unit is in the spirit of the wording, if not the fine print. 
That seems to be borne out by Markus, even with his stake in ICU, supporting 
option #2.

Looking at the comments, I don't see any discussion of the effect of this on 
overlong treatments.  My guess is that the effect change was unintentional.

So I have code that handled overlongs in the only correct way possible when 
they were acceptable, and in the obvious way after they became illegal, and now 
without apparent discussion (which is very much akin to "flimsy reasons"), it 
suddenly was no longer "best practice".  And that change came "rather late in 
the game".  That this escaped notice for years indicates that the specifics of 
REPLACEMENT CHAR handling don't matter all that much.

To cut to the chase, I think Unicode should issue a Corrigendum to the effect 
that it was never the intent of this change to say that treating overlongs as a 
single unit isn't best practice.  I'm not sure this warrants a full-fledge 
Corrigendum, though.  But I believe the text of the best practices should 
indicate that treating overlongs as a single unit is just as acceptable as 
Martin's interpretation.

I believe this is pretty much in line with Shawn's position.  Certainly, a 
discussion of the reasons one might choose one interpretation over another 
should be included in TUS.  That would likely have satisfied my original query, 
which hence would never have been posted.



Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Karl Williamson via Unicode

On 05/26/2017 12:22 PM, Ken Whistler wrote:


On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:

The link provided about the PRI doesn't lead to the comments.



PRI #121 (August, 2008) pre-dated the practice of keeping all the 
feedback comments together with the PRI itself in a numbered directory 
with the name "feedback.html". But the comments were collected together 
at the time and are accessible here:


http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121

Also there was a separately submitted comment document:

http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt

And the minutes of the pertinent UTC meeting (UTC #116):

http://www.unicode.org/L2/L2008/08253.htm

The minutes simply capture the consensus to adopt Option #2 from PRI 
#121, and the relevant action items.


I now return the floor to the distinguished disputants to continue 
litigating history. ;-)


--Ken




The reason this discussion got started was that in December, someone 
came to me and said the code I support does not follow Unicode best 
practices, and suggested I need to change, though no ticket (yet) has 
been filed.  I was surprised, and posted a query to this list about what 
the advantages of the new approach are.  There were a number of replies, 
but I did not see anything that seemed definitive.  After a month, I 
created a ticket in Unicode and Markus was assigned to research it, and 
came up with the proposal currently being debated.


Looking at the PRI, it seems to me that treating an overlong as a single 
maximal unit is in the spirit of the wording, if not the fine print. 
That seems to be borne out by Markus, even with his stake in ICU, 
supporting option #2.


Looking at the comments, I don't see any discussion of the effect of 
this on overlong treatments.  My guess is that the effect change was 
unintentional.


So I have code that handled overlongs in the only correct way possible 
when they were acceptable, and in the obvious way after they became 
illegal, and now without apparent discussion (which is very much akin to 
"flimsy reasons"), it suddenly was no longer "best practice".  And that 
change came "rather late in the game".  That this escaped notice for 
years indicates that the specifics of REPLACEMENT CHAR handling don't 
matter all that much.


To cut to the chase, I think Unicode should issue a Corrigendum to the 
effect that it was never the intent of this change to say that treating 
overlongs as a single unit isn't best practice.  I'm not sure this 
warrants a full-fledge Corrigendum, though.  But I believe the text of 
the best practices should indicate that treating overlongs as a single 
unit is just as acceptable as Martin's interpretation.


I believe this is pretty much in line with Shawn's position.  Certainly, 
a discussion of the reasons one might choose one interpretation over 
another should be included in TUS.  That would likely have satisfied my 
original query, which hence would never have been posted.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Ken Whistler via Unicode


On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:

The link provided about the PRI doesn't lead to the comments.



PRI #121 (August, 2008) pre-dated the practice of keeping all the 
feedback comments together with the PRI itself in a numbered directory 
with the name "feedback.html". But the comments were collected together 
at the time and are accessible here:


http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121

Also there was a separately submitted comment document:

http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt

And the minutes of the pertinent UTC meeting (UTC #116):

http://www.unicode.org/L2/L2008/08253.htm

The minutes simply capture the consensus to adopt Option #2 from PRI 
#121, and the relevant action items.


I now return the floor to the distinguished disputants to continue 
litigating history. ;-)


--Ken







Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Karl Williamson via Unicode

On 05/26/2017 04:28 AM, Martin J. Dürst wrote:
It may be worth to think about whether the Unicode standard should 
mention implementations like yours. But there should be no doubt about 
the fact that the PRI and Unicode 5.2 (and the current version of 
Unicode) are clear about what they recommend, and that that 
recommendation is based on the definition of UTF-8 at that time (and 
still in force), and not at based on a historical definition of UTF-8.


The link provided about the PRI doesn't lead to the comments.

Is there any evidence that there was a realization that the language 
being adopted would lead to overlongs being split into multiple subparts?




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Markus Scherer via Unicode
On Fri, May 26, 2017 at 3:28 AM, Martin J. Dürst 
wrote:

> But there's plenty in the text that makes it absolutely clear that some
> things cannot be included. In particular, it says
>
> 
> The term “maximal subpart of an ill-formed subsequence” refers to the code
> units that were collected in this manner. They could be the start of a
> well-formed sequence, except that the sequence lacks the proper
> continuation. Alternatively, the converter may have found an continuation
> code unit, which cannot be the start of a well-formed sequence.
> 
>
> And the "in this manner" refers to:
> 
> A sequence of code units will be processed up to the point where the
> sequence either can be unambiguously interpreted as a particular Unicode
> code point or where the converter recognizes that the code units collected
> so far constitute an ill-formed subsequence.
> 
>
> So we have the same thing twice: Bail out as soon as something is
> ill-formed.


The UTF-8 conversion code that I wrote for ICU, and apparently the code
that various other people have written, collects sequences starting from
lead bytes, according to the original spec, and at the end looks at whether
the assembled code point is too low for the lead byte, or is a surrogate,
or is above 10. Stopping at a non-trail byte is quite natural, and
reading the PRI text accordingly is quite natural too.

Aside from UTF-8 history, there is a reason for preferring a more
"structural" definition for UTF-8 over one purely along valid sequences.
This applies to code that *works* on UTF-8 strings rather than just
converting them. For UTF-8 *processing* you need to be able to iterate both
forward and backward, and sometimes you need not collect code points while
skipping over n units in either direction -- but your iteration needs to be
consistent in all cases. This is easier to implement (especially in fast,
short, inline code) if you have to look only at how many trail bytes follow
a lead byte, without having to look whether the first trail byte is in a
certain range for some specific lead bytes.

(And don't say that everyone can validate all strings once and then all
code can assume they are valid: That just does not work for library code,
you cannot assume anything about your input strings, and you cannot crash
when they are ill-formed.)

markus


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Philippe Verdy via Unicode
>
> Citing directly from the PRI:
>
> 
> The term "maximal subpart of the ill-formed subsequence" refers to the
> longest potentially valid initial subsequence or, if none, then to the next
> single code unit.
> 
>

The way i understand it is that C0 80 will have TWO maximal subparts,
because there's not any valid initial subsequence, so only the next single
code unit (C0) will be considered. After this the following byte 80 also
has not any valid initial subsequence, so here again only the next single
code unit (80) will be considered. You'll get U+FFFD replacements emitted
twice. This treats all cases of "overlong" sequences that were in the old
UTF-8 definition in the first RFC.

For C3 80 20, there will be only ONE maximal subpart because C3 80 is a
valid initial subsequence, so a single U+FFFD replacement will be emitted,
followed then by the valid UTF-8 sequence (20) which will correctly decode
as U+0020.

Good ! This means that this proposal makes sense and is compatible with
random accesses within the encoded text whithout having to look backward
for an indefinite number of code units and we never have to handle any case
with possibly infinite number of code units mapped to the same U+FFFD
replacement.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Martin J. Dürst via Unicode

On 2017/05/25 09:22, Markus Scherer wrote:

On Wed, May 24, 2017 at 3:56 PM, Karl Williamson 
wrote:


On 05/24/2017 12:46 AM, Martin J. Dürst wrote:


That's wrong. There was a public review issue with various options and
with feedback, and the recommendation has been implemented and in use
widely (among else, in major programming language and browsers) without
problems for quite some time.



Could you supply a reference to the PRI and its feedback?



http://www.unicode.org/review/resolved-pri-100.html#pri121

The PRI did not discuss possible different versions of "maximal subpart",
and the examples there yield the same results either way. (No non-shortest
forms.)


It is correct that it didn't give any of the *examples* that are under 
discussion now. On the other hand, the PRI is very clear about what it 
means by "maximal subpart":


Citing directly from the PRI:


The term "maximal subpart of the ill-formed subsequence" refers to the 
longest potentially valid initial subsequence or, if none, then to the 
next single code unit.



At the time of the PRI, so-called "overlongs" were already ill-formed.

That change goes back to 2003 or earlier (RFC 3629 
(https://tools.ietf.org/html/rfc3629) was published in 2003 to reflect 
the tightening of the UTF-8 definition in Unicode/ISO 10646).



The recommendation in TUS 5.2 is "Replace each maximal subpart of an

ill-formed subsequence by a single U+FFFD."



You are right.

http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf shows a slightly
expanded example compared with the PRI.

The text simply talked about a "conversion process" stopping as soon as it
encounters something that does not fit, so these edge cases would depend on
whether the conversion process treats original-UTF-8 sequences as single
units.


No, the text, both in the PRI and in Unicode 5.2, is quite clear. The 
"does not fit" (which I haven't found in either text) is clearly 
grounded by "ill-formed UTF-8". And there's no question about what 
"ill-formed UTF-8" means, in particular in Unicode 5.2, where you just 
have to go two pages back to find byte sequences such as , 80>, and  all called out explicitly as ill-formed.


Any kind of claim, as in the L2/17-168 document, about there being an 
option 2a, are just not substantiated. It's true that there are no 
explicit examples in the PRI that would allow to distinguish between 
converting e.g.

FC BF BF BF BF 80
to a single FFFD or to six of these. But there's no need to have 
examples for every corner case if the text is clear enough. In the above 
six-byte sequence, there's not a single potentially valid (initial) 
subsequence, so it's all single code units.




And I agree with that.  And I view an overlong sequence as a maximal
ill-formed subsequence


Can you point to any definition that would include or allow such an 
interpretation? I just haven't found any yet, neither in the PRI nor in 
Unicode 5.2.



that should be replaced by a single FFFD. There's
nothing in the text of 5.2 that immediately follows that recommendation
that indicates to me that my view is incorrect.


I have to agree that the text in Unicode 5.2 could be clearer. It's a 
hodgepodge of attempts at justifications and definitions. And the word 
"maximal" itself may also contribute to pushing the interpretation in 
one direction.


But there's plenty in the text that makes it absolutely clear that some 
things cannot be included. In particular, it says



The term “maximal subpart of an ill-formed subsequence” refers to the 
code units that were collected in this manner. They could be the start 
of a well-formed sequence, except that the sequence lacks the proper 
continuation. Alternatively, the converter may have found an 
continuation code unit, which cannot be the start of a well-formed sequence.



And the "in this manner" refers to:

A sequence of code units will be processed up to the point where the 
sequence either can be unambiguously interpreted as a particular Unicode 
code point or where the converter recognizes that the code units 
collected so far constitute an ill-formed subsequence.



So we have the same thing twice: Bail out as soon as something is 
ill-formed.




Perhaps my view is colored by the fact that I now maintain code that was
written to parse UTF-8 back when overlongs were still considered legal
input.


Thanks for providing this information. That's a lot more useful than 
"feels right", which was given as a reason on this list before.




An overlong was a single unit.  When they became illegal, the code
still considered them a single unit.


That's fine for your code. I might do the same (or not) if I were you, 
because one indeed never knows in which situation some code is used, and 
what repercussions a change might produce.


But the PRI, and the wording in Unicode 5.2, was created when overlongs 
and 5-byte and 6-byte sequences and surrogate pairs,... were very 
clearly