RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
So basically this came about because code got bugged for not following the "recommendation." To fix that, the recommendation will be changed. However then that is going to lead to bugs for other existing code that does not follow the new recommendation. I totally get the forward/backward scanning in sync without decoding reasoning for some implementations, however I do not think that the practices that benefit those should extend to other applications that are happy with a different practice. In either case, the bad characters are garbage, so neither approach is "better" - except that one or the other may be more conducive to the requirements of the particular API/application. I really think the correct approach here is to allow any number of replacement characters without prejudice. Perhaps with suggestions for pros and cons of various approaches if people feel that is really necessary. -Shawn -Original Message- From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Karl Williamson via Unicode Sent: Friday, May 26, 2017 2:16 PM To: Ken WhistlerCc: unicode@unicode.org Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 On 05/26/2017 12:22 PM, Ken Whistler wrote: > > On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: >> The link provided about the PRI doesn't lead to the comments. >> > > PRI #121 (August, 2008) pre-dated the practice of keeping all the > feedback comments together with the PRI itself in a numbered directory > with the name "feedback.html". But the comments were collected > together at the time and are accessible here: > > http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121 > > Also there was a separately submitted comment document: > > http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt > > And the minutes of the pertinent UTC meeting (UTC #116): > > http://www.unicode.org/L2/L2008/08253.htm > > The minutes simply capture the consensus to adopt Option #2 from PRI > #121, and the relevant action items. > > I now return the floor to the distinguished disputants to continue > litigating history. ;-) > > --Ken > > The reason this discussion got started was that in December, someone came to me and said the code I support does not follow Unicode best practices, and suggested I need to change, though no ticket (yet) has been filed. I was surprised, and posted a query to this list about what the advantages of the new approach are. There were a number of replies, but I did not see anything that seemed definitive. After a month, I created a ticket in Unicode and Markus was assigned to research it, and came up with the proposal currently being debated. Looking at the PRI, it seems to me that treating an overlong as a single maximal unit is in the spirit of the wording, if not the fine print. That seems to be borne out by Markus, even with his stake in ICU, supporting option #2. Looking at the comments, I don't see any discussion of the effect of this on overlong treatments. My guess is that the effect change was unintentional. So I have code that handled overlongs in the only correct way possible when they were acceptable, and in the obvious way after they became illegal, and now without apparent discussion (which is very much akin to "flimsy reasons"), it suddenly was no longer "best practice". And that change came "rather late in the game". That this escaped notice for years indicates that the specifics of REPLACEMENT CHAR handling don't matter all that much. To cut to the chase, I think Unicode should issue a Corrigendum to the effect that it was never the intent of this change to say that treating overlongs as a single unit isn't best practice. I'm not sure this warrants a full-fledge Corrigendum, though. But I believe the text of the best practices should indicate that treating overlongs as a single unit is just as acceptable as Martin's interpretation. I believe this is pretty much in line with Shawn's position. Certainly, a discussion of the reasons one might choose one interpretation over another should be included in TUS. That would likely have satisfied my original query, which hence would never have been posted.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 05/26/2017 12:22 PM, Ken Whistler wrote: On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: The link provided about the PRI doesn't lead to the comments. PRI #121 (August, 2008) pre-dated the practice of keeping all the feedback comments together with the PRI itself in a numbered directory with the name "feedback.html". But the comments were collected together at the time and are accessible here: http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121 Also there was a separately submitted comment document: http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt And the minutes of the pertinent UTC meeting (UTC #116): http://www.unicode.org/L2/L2008/08253.htm The minutes simply capture the consensus to adopt Option #2 from PRI #121, and the relevant action items. I now return the floor to the distinguished disputants to continue litigating history. ;-) --Ken The reason this discussion got started was that in December, someone came to me and said the code I support does not follow Unicode best practices, and suggested I need to change, though no ticket (yet) has been filed. I was surprised, and posted a query to this list about what the advantages of the new approach are. There were a number of replies, but I did not see anything that seemed definitive. After a month, I created a ticket in Unicode and Markus was assigned to research it, and came up with the proposal currently being debated. Looking at the PRI, it seems to me that treating an overlong as a single maximal unit is in the spirit of the wording, if not the fine print. That seems to be borne out by Markus, even with his stake in ICU, supporting option #2. Looking at the comments, I don't see any discussion of the effect of this on overlong treatments. My guess is that the effect change was unintentional. So I have code that handled overlongs in the only correct way possible when they were acceptable, and in the obvious way after they became illegal, and now without apparent discussion (which is very much akin to "flimsy reasons"), it suddenly was no longer "best practice". And that change came "rather late in the game". That this escaped notice for years indicates that the specifics of REPLACEMENT CHAR handling don't matter all that much. To cut to the chase, I think Unicode should issue a Corrigendum to the effect that it was never the intent of this change to say that treating overlongs as a single unit isn't best practice. I'm not sure this warrants a full-fledge Corrigendum, though. But I believe the text of the best practices should indicate that treating overlongs as a single unit is just as acceptable as Martin's interpretation. I believe this is pretty much in line with Shawn's position. Certainly, a discussion of the reasons one might choose one interpretation over another should be included in TUS. That would likely have satisfied my original query, which hence would never have been posted.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: The link provided about the PRI doesn't lead to the comments. PRI #121 (August, 2008) pre-dated the practice of keeping all the feedback comments together with the PRI itself in a numbered directory with the name "feedback.html". But the comments were collected together at the time and are accessible here: http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121 Also there was a separately submitted comment document: http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt And the minutes of the pertinent UTC meeting (UTC #116): http://www.unicode.org/L2/L2008/08253.htm The minutes simply capture the consensus to adopt Option #2 from PRI #121, and the relevant action items. I now return the floor to the distinguished disputants to continue litigating history. ;-) --Ken
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 05/26/2017 04:28 AM, Martin J. Dürst wrote: It may be worth to think about whether the Unicode standard should mention implementations like yours. But there should be no doubt about the fact that the PRI and Unicode 5.2 (and the current version of Unicode) are clear about what they recommend, and that that recommendation is based on the definition of UTF-8 at that time (and still in force), and not at based on a historical definition of UTF-8. The link provided about the PRI doesn't lead to the comments. Is there any evidence that there was a realization that the language being adopted would lead to overlongs being split into multiple subparts?
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Fri, May 26, 2017 at 3:28 AM, Martin J. Dürstwrote: > But there's plenty in the text that makes it absolutely clear that some > things cannot be included. In particular, it says > > > The term “maximal subpart of an ill-formed subsequence” refers to the code > units that were collected in this manner. They could be the start of a > well-formed sequence, except that the sequence lacks the proper > continuation. Alternatively, the converter may have found an continuation > code unit, which cannot be the start of a well-formed sequence. > > > And the "in this manner" refers to: > > A sequence of code units will be processed up to the point where the > sequence either can be unambiguously interpreted as a particular Unicode > code point or where the converter recognizes that the code units collected > so far constitute an ill-formed subsequence. > > > So we have the same thing twice: Bail out as soon as something is > ill-formed. The UTF-8 conversion code that I wrote for ICU, and apparently the code that various other people have written, collects sequences starting from lead bytes, according to the original spec, and at the end looks at whether the assembled code point is too low for the lead byte, or is a surrogate, or is above 10. Stopping at a non-trail byte is quite natural, and reading the PRI text accordingly is quite natural too. Aside from UTF-8 history, there is a reason for preferring a more "structural" definition for UTF-8 over one purely along valid sequences. This applies to code that *works* on UTF-8 strings rather than just converting them. For UTF-8 *processing* you need to be able to iterate both forward and backward, and sometimes you need not collect code points while skipping over n units in either direction -- but your iteration needs to be consistent in all cases. This is easier to implement (especially in fast, short, inline code) if you have to look only at how many trail bytes follow a lead byte, without having to look whether the first trail byte is in a certain range for some specific lead bytes. (And don't say that everyone can validate all strings once and then all code can assume they are valid: That just does not work for library code, you cannot assume anything about your input strings, and you cannot crash when they are ill-formed.) markus
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> > Citing directly from the PRI: > > > The term "maximal subpart of the ill-formed subsequence" refers to the > longest potentially valid initial subsequence or, if none, then to the next > single code unit. > > The way i understand it is that C0 80 will have TWO maximal subparts, because there's not any valid initial subsequence, so only the next single code unit (C0) will be considered. After this the following byte 80 also has not any valid initial subsequence, so here again only the next single code unit (80) will be considered. You'll get U+FFFD replacements emitted twice. This treats all cases of "overlong" sequences that were in the old UTF-8 definition in the first RFC. For C3 80 20, there will be only ONE maximal subpart because C3 80 is a valid initial subsequence, so a single U+FFFD replacement will be emitted, followed then by the valid UTF-8 sequence (20) which will correctly decode as U+0020. Good ! This means that this proposal makes sense and is compatible with random accesses within the encoded text whithout having to look backward for an indefinite number of code units and we never have to handle any case with possibly infinite number of code units mapped to the same U+FFFD replacement.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 2017/05/25 09:22, Markus Scherer wrote: On Wed, May 24, 2017 at 3:56 PM, Karl Williamsonwrote: On 05/24/2017 12:46 AM, Martin J. Dürst wrote: That's wrong. There was a public review issue with various options and with feedback, and the recommendation has been implemented and in use widely (among else, in major programming language and browsers) without problems for quite some time. Could you supply a reference to the PRI and its feedback? http://www.unicode.org/review/resolved-pri-100.html#pri121 The PRI did not discuss possible different versions of "maximal subpart", and the examples there yield the same results either way. (No non-shortest forms.) It is correct that it didn't give any of the *examples* that are under discussion now. On the other hand, the PRI is very clear about what it means by "maximal subpart": Citing directly from the PRI: The term "maximal subpart of the ill-formed subsequence" refers to the longest potentially valid initial subsequence or, if none, then to the next single code unit. At the time of the PRI, so-called "overlongs" were already ill-formed. That change goes back to 2003 or earlier (RFC 3629 (https://tools.ietf.org/html/rfc3629) was published in 2003 to reflect the tightening of the UTF-8 definition in Unicode/ISO 10646). The recommendation in TUS 5.2 is "Replace each maximal subpart of an ill-formed subsequence by a single U+FFFD." You are right. http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf shows a slightly expanded example compared with the PRI. The text simply talked about a "conversion process" stopping as soon as it encounters something that does not fit, so these edge cases would depend on whether the conversion process treats original-UTF-8 sequences as single units. No, the text, both in the PRI and in Unicode 5.2, is quite clear. The "does not fit" (which I haven't found in either text) is clearly grounded by "ill-formed UTF-8". And there's no question about what "ill-formed UTF-8" means, in particular in Unicode 5.2, where you just have to go two pages back to find byte sequences such as , 80>, and all called out explicitly as ill-formed. Any kind of claim, as in the L2/17-168 document, about there being an option 2a, are just not substantiated. It's true that there are no explicit examples in the PRI that would allow to distinguish between converting e.g. FC BF BF BF BF 80 to a single FFFD or to six of these. But there's no need to have examples for every corner case if the text is clear enough. In the above six-byte sequence, there's not a single potentially valid (initial) subsequence, so it's all single code units. And I agree with that. And I view an overlong sequence as a maximal ill-formed subsequence Can you point to any definition that would include or allow such an interpretation? I just haven't found any yet, neither in the PRI nor in Unicode 5.2. that should be replaced by a single FFFD. There's nothing in the text of 5.2 that immediately follows that recommendation that indicates to me that my view is incorrect. I have to agree that the text in Unicode 5.2 could be clearer. It's a hodgepodge of attempts at justifications and definitions. And the word "maximal" itself may also contribute to pushing the interpretation in one direction. But there's plenty in the text that makes it absolutely clear that some things cannot be included. In particular, it says The term “maximal subpart of an ill-formed subsequence” refers to the code units that were collected in this manner. They could be the start of a well-formed sequence, except that the sequence lacks the proper continuation. Alternatively, the converter may have found an continuation code unit, which cannot be the start of a well-formed sequence. And the "in this manner" refers to: A sequence of code units will be processed up to the point where the sequence either can be unambiguously interpreted as a particular Unicode code point or where the converter recognizes that the code units collected so far constitute an ill-formed subsequence. So we have the same thing twice: Bail out as soon as something is ill-formed. Perhaps my view is colored by the fact that I now maintain code that was written to parse UTF-8 back when overlongs were still considered legal input. Thanks for providing this information. That's a lot more useful than "feels right", which was given as a reason on this list before. An overlong was a single unit. When they became illegal, the code still considered them a single unit. That's fine for your code. I might do the same (or not) if I were you, because one indeed never knows in which situation some code is used, and what repercussions a change might produce. But the PRI, and the wording in Unicode 5.2, was created when overlongs and 5-byte and 6-byte sequences and surrogate pairs,... were very clearly