Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Wed, May 24, 2017 at 3:56 PM, Karl Williamsonwrote: > On 05/24/2017 12:46 AM, Martin J. Dürst wrote: > >> That's wrong. There was a public review issue with various options and >> with feedback, and the recommendation has been implemented and in use >> widely (among else, in major programming language and browsers) without >> problems for quite some time. >> > > Could you supply a reference to the PRI and its feedback? > http://www.unicode.org/review/resolved-pri-100.html#pri121 The PRI did not discuss possible different versions of "maximal subpart", and the examples there yield the same results either way. (No non-shortest forms.) The recommendation in TUS 5.2 is "Replace each maximal subpart of an > ill-formed subsequence by a single U+FFFD." > You are right. http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf shows a slightly expanded example compared with the PRI. The text simply talked about a "conversion process" stopping as soon as it encounters something that does not fit, so these edge cases would depend on whether the conversion process treats original-UTF-8 sequences as single units. And I agree with that. And I view an overlong sequence as a maximal > ill-formed subsequence that should be replaced by a single FFFD. There's > nothing in the text of 5.2 that immediately follows that recommendation > that indicates to me that my view is incorrect. > > Perhaps my view is colored by the fact that I now maintain code that was > written to parse UTF-8 back when overlongs were still considered legal > input. An overlong was a single unit. When they became illegal, the code > still considered them a single unit. > Right. I can understand how someone who comes along later could say C0 can't be > followed by any continuation character that doesn't yield an overlong, > therefore C0 is a maximal subsequence. > Right. But I assert that my interpretation is just as valid as that one. And > perhaps more so, because of historical precedent. > I agree. markus
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 05/24/2017 12:46 AM, Martin J. Dürst wrote: On 2017/05/24 05:57, Karl Williamson via Unicode wrote: On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote: Adding a "recommendation" this late in the game is just bad standards policy. Unless I misunderstand, you are missing the point. There is already a recommendation listed in TUS, That's indeed correct. and that recommendation appears to have been added without much thought. That's wrong. There was a public review issue with various options and with feedback, and the recommendation has been implemented and in use widely (among else, in major programming language and browsers) without problems for quite some time. Could you supply a reference to the PRI and its feedback? The recommendation in TUS 5.2 is "Replace each maximal subpart of an ill-formed subsequence by a single U+FFFD." And I agree with that. And I view an overlong sequence as a maximal ill-formed subsequence that should be replaced by a single FFFD. There's nothing in the text of 5.2 that immediately follows that recommendation that indicates to me that my view is incorrect. Perhaps my view is colored by the fact that I now maintain code that was written to parse UTF-8 back when overlongs were still considered legal input. An overlong was a single unit. When they became illegal, the code still considered them a single unit. I can understand how someone who comes along later could say C0 can't be followed by any continuation character that doesn't yield an overlong, therefore C0 is a maximal subsequence. But I assert that my interpretation is just as valid as that one. And perhaps more so, because of historical precedent. It appears to me that little thought was given to the fact that these changes would cause overlongs to now be at least two units instead of one, making long existing code no longer be best practice. You are effectively saying I'm wrong about this. I thought I had been paying attention to PRI's since the 5.x series, and I don't remember anything about this. If you have evidence to the contrary, please give it. However, I would have thought Markus would have dug any up and given it in his proposal. There is no proposal to add a recommendation "this late in the game". True. The proposal isn't for an addition, it's for a change. The "late in the game" however, still applies. Regards, Martin.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 2017/05/24 05:57, Karl Williamson via Unicode wrote: On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote: Adding a "recommendation" this late in the game is just bad standards policy. Unless I misunderstand, you are missing the point. There is already a recommendation listed in TUS, That's indeed correct. and that recommendation appears to have been added without much thought. That's wrong. There was a public review issue with various options and with feedback, and the recommendation has been implemented and in use widely (among else, in major programming language and browsers) without problems for quite some time. There is no proposal to add a recommendation "this late in the game". True. The proposal isn't for an addition, it's for a change. The "late in the game" however, still applies. Regards, Martin.