Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-24 Thread Markus Scherer via Unicode
On Wed, May 24, 2017 at 3:56 PM, Karl Williamson 
wrote:

> On 05/24/2017 12:46 AM, Martin J. Dürst wrote:
>
>> That's wrong. There was a public review issue with various options and
>> with feedback, and the recommendation has been implemented and in use
>> widely (among else, in major programming language and browsers) without
>> problems for quite some time.
>>
>
> Could you supply a reference to the PRI and its feedback?
>

http://www.unicode.org/review/resolved-pri-100.html#pri121

The PRI did not discuss possible different versions of "maximal subpart",
and the examples there yield the same results either way. (No non-shortest
forms.)

The recommendation in TUS 5.2 is "Replace each maximal subpart of an
> ill-formed subsequence by a single U+FFFD."
>

You are right.

http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf shows a slightly
expanded example compared with the PRI.

The text simply talked about a "conversion process" stopping as soon as it
encounters something that does not fit, so these edge cases would depend on
whether the conversion process treats original-UTF-8 sequences as single
units.

And I agree with that.  And I view an overlong sequence as a maximal
> ill-formed subsequence that should be replaced by a single FFFD. There's
> nothing in the text of 5.2 that immediately follows that recommendation
> that indicates to me that my view is incorrect.
>
> Perhaps my view is colored by the fact that I now maintain code that was
> written to parse UTF-8 back when overlongs were still considered legal
> input.  An overlong was a single unit.  When they became illegal, the code
> still considered them a single unit.
>

Right.

I can understand how someone who comes along later could say C0 can't be
> followed by any continuation character that doesn't yield an overlong,
> therefore C0 is a maximal subsequence.
>

Right.

But I assert that my interpretation is just as valid as that one.  And
> perhaps more so, because of historical precedent.
>

I agree.

markus


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-24 Thread Karl Williamson via Unicode

On 05/24/2017 12:46 AM, Martin J. Dürst wrote:

On 2017/05/24 05:57, Karl Williamson via Unicode wrote:

On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote:



Adding a "recommendation" this late in the game is just bad standards
policy.



Unless I misunderstand, you are missing the point.  There is already a
recommendation listed in TUS,


That's indeed correct.



and that recommendation appears to have
been added without much thought.


That's wrong. There was a public review issue with various options and 
with feedback, and the recommendation has been implemented and in use 
widely (among else, in major programming language and browsers) without 
problems for quite some time.


Could you supply a reference to the PRI and its feedback?

The recommendation in TUS 5.2 is "Replace each maximal subpart of an 
ill-formed subsequence by a single U+FFFD."


And I agree with that.  And I view an overlong sequence as a maximal 
ill-formed subsequence that should be replaced by a single FFFD. 
There's nothing in the text of 5.2 that immediately follows that 
recommendation that indicates to me that my view is incorrect.


Perhaps my view is colored by the fact that I now maintain code that was 
written to parse UTF-8 back when overlongs were still considered legal 
input.  An overlong was a single unit.  When they became illegal, the 
code still considered them a single unit.


I can understand how someone who comes along later could say C0 can't be 
followed by any continuation character that doesn't yield an overlong, 
therefore C0 is a maximal subsequence.


But I assert that my interpretation is just as valid as that one.  And 
perhaps more so, because of historical precedent.


It appears to me that little thought was given to the fact that these 
changes would cause overlongs to now be at least two units instead of 
one, making long existing code no longer be best practice.  You are 
effectively saying I'm wrong about this.  I thought I had been paying 
attention to PRI's since the 5.x series, and I don't remember anything 
about this.  If you have evidence to the contrary, please give it. 
However, I would have thought Markus would have dug any up and given it 
in his proposal.






There is no proposal to add a
recommendation "this late in the game".


True. The proposal isn't for an addition, it's for a change. The "late 
in the game" however, still applies.


Regards,   Martin.






Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-24 Thread Martin J. Dürst via Unicode

On 2017/05/24 05:57, Karl Williamson via Unicode wrote:

On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote:



Adding a "recommendation" this late in the game is just bad standards
policy.



Unless I misunderstand, you are missing the point.  There is already a
recommendation listed in TUS,


That's indeed correct.



and that recommendation appears to have
been added without much thought.


That's wrong. There was a public review issue with various options and 
with feedback, and the recommendation has been implemented and in use 
widely (among else, in major programming language and browsers) without 
problems for quite some time.




There is no proposal to add a
recommendation "this late in the game".


True. The proposal isn't for an addition, it's for a change. The "late 
in the game" however, still applies.


Regards,   Martin.