Re: Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8?

2019-05-15 Thread Richard Wordingham via Unicode
On Wed, 15 May 2019 05:56:54 -0700 Asmus Freytag via Unicode wrote: > On 5/15/2019 4:22 AM, Costello, Roger L. via Unicode wrote: > Hello Unicode experts! > > Which is correct: > > (a) The input file contains a string. The string is encoded using > UTF-8. > >

Re: Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8?

2019-05-15 Thread Rebecca T via Unicode
I think that colloquially “the file contains a UTF-8 string” is best, but perhaps not in more formal communications. On Wed, May 15, 2019, 7:24 AM Costello, Roger L. via Unicode < unicode@unicode.org> wrote: > Hello Unicode experts! > > Which is correct: > > (a) The input f

Re: Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8?

2019-05-15 Thread Neil Shadrach via Unicode
(e) The input file contains a UTF-8 encoded string. Ar Mer, 15 Mai 2019 am 14:22 Andre Schappo via Unicode ysgrifennodd: > > > > On May 15, 31 Heisei, at 12:22 pm, Costello, Roger L. via Unicode < > unicode@unicode.org> wrote: > > > > Hello Unicode

Re: Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8?

2019-05-15 Thread Andre Schappo via Unicode
> On May 15, 31 Heisei, at 12:22 pm, Costello, Roger L. via Unicode > wrote: > > Hello Unicode experts! > > Which is correct: > > (a) The input file contains a string. The string is encoded using UTF-8. > > (b) The input file contains a string. The string

Re: Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8?

2019-05-15 Thread Asmus Freytag via Unicode
On 5/15/2019 4:22 AM, Costello, Roger L. via Unicode wrote: Hello Unicode experts! Which is correct: (a) The input file contains a string. The string is encoded using UTF-8. (b) The input file contains a string. The string is encoded with UTF-8. (c) The input

Re: Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8?

2019-05-15 Thread Aleksey Tulinov via Unicode
the same amount of space whether it is encoded with the UTF-8 or ASCII codes. Conversely, text consisting of CJK ideographs encoded with UTF-8 will require more space than equivalent text encoded with UTF-16." Hope this helps. ср, 15 мая 2019 г. в 14:24, Costello, Roger L. via Unicode : >

Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8?

2019-05-15 Thread Costello, Roger L. via Unicode
Hello Unicode experts! Which is correct: (a) The input file contains a string. The string is encoded using UTF-8. (b) The input file contains a string. The string is encoded with UTF-8. (c) The input file contains a string. The string is encoded in UTF-8. (d) Something else (what?) /Roger

Re: Does "endian-ness" apply to UTF-8 characters that use multiple bytes?

2019-02-04 Thread Doug Ewell via Unicode
http://www.unicode.org/faq/utf_bom.html#utf8-2 -- Doug Ewell | Thornton, CO, US | ewellic.org

Re: Does "endian-ness" apply to UTF-8 characters that use multiple bytes?

2019-02-04 Thread James Tauber via Unicode
Endian-ness only affects ordering of bytes within a code unit. Because UTF-8 has single byte code units, the order is not affected by endian-ness, only the UTF-8 bit mapping itself. Note also that endian-ness only affects individual 16-bit code units in UTF-16. If you have a surrogate pair

Re: Does "endian-ness" apply to UTF-8 characters that use multiple bytes?

2019-02-04 Thread Clive Hohberger via Unicode
Asmus, I believe it also applies to the bit order in the bytes I believe UTF-16 and UTF-32 are transmitted as single 16 or 32-bit numbers. UTF-8 is a stream of 8-bit numbers Clive *Clive P. Hohberger, PhD MBA* Managing Director Clive Hohberger, LLC +1 847 910 8794 cp...@case.edu *Inventor

Re: Does "endian-ness" apply to UTF-8 characters that use multiple bytes?

2019-02-04 Thread Asmus Freytag via Unicode
-16BE (Big-Endian), UTF-16LE (Little-Endian), UTF-32BE and UTF32-LE because each character uses multiple bytes. Clearly endian-ness does not apply to single-byte UTF-8 characters. But what about UTF-8 characters that use multiple bytes, such as the character é, which uses two bytes C3 and A9; does

Does "endian-ness" apply to UTF-8 characters that use multiple bytes?

2019-02-04 Thread Costello, Roger L. via Unicode
uses multiple bytes. Clearly endian-ness does not apply to single-byte UTF-8 characters. But what about UTF-8 characters that use multiple bytes, such as the character é, which uses two bytes C3 and A9; does endian-ness apply? For example, if a file is in Little Endian would the character é

Re: Interesting UTF-8 decoder

2017-10-09 Thread Mark Davis ☕️ via Unicode
the string ends on a > memory allocation boundary. will have to make sure strings are always > allocated with 3 extra bytes. > > 2017-10-09 1:37 GMT-07:00 Martin J. Dürst via Unicode <unicode@unicode.org > >: > >> A friend of mine sent me a pointer to >> http://nullprogr

Re: Interesting UTF-8 decoder

2017-10-09 Thread J Decker via Unicode
to > http://nullprogram.com/blog/2017/10/06/, a branchless UTF-8 decoder. > > Regards, Martin. >

Interesting UTF-8 decoder

2017-10-09 Thread Martin J. Dürst via Unicode
A friend of mine sent me a pointer to http://nullprogram.com/blog/2017/10/06/, a branchless UTF-8 decoder. Regards, Martin.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-09-23 Thread Markus Scherer via Unicode
FYI, I changed the ICU behavior for the upcoming ICU 60 release (pending code review). Proposal & description: https://sourceforge.net/p/icu/mailman/message/35990833/ Code changes: http://bugs.icu-project.org/trac/review/13311 Best regards, markus On Thu, Aug 3, 2017 at 5:34 PM, Mark Davis ☕️

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-08-05 Thread Martin J. Dürst via Unicode
Hello Mark, On 2017/08/04 09:34, Mark Davis ☕️ wrote: FYI, the UTC retracted the following. Thanks for letting us know! Regards, Martin. *[151-C19 ] Consensus:* Modify the section on "Best Practices for Using FFFD" in section "3.9

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-08-04 Thread Henri Sivonen via Unicode
On Fri, Aug 4, 2017 at 3:34 AM, Mark Davis ☕️ via Unicode wrote: > FYI, the UTC retracted the following. > > [151-C19] Consensus: Modify the section on "Best Practices for Using FFFD" > in section "3.9 Encoding Forms" of TUS per the recommendation in L2/17-168, > for Unicode

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-08-03 Thread Mark Davis ☕️ via Unicode
you supply a reference to the PRI and its feedback? > > The recommendation in TUS 5.2 is "Replace each maximal subpart of an > ill-formed subsequence by a single U+FFFD." > > And I agree with that. And I view an overlong sequence as a maximal > ill-formed subsequence that

Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Philippe Verdy via Unicode
2017-07-25 0:35 GMT+02:00 Doug Ewell via Unicode <unicode@unicode.org>: > J Decker wrote: > > > I generally accepted any utf-8 encoding up to 31 bits though ( since > > I was going from the original spec, and not what was effective limit > > based on unicode codep

Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Doug Ewell via Unicode
J Decker wrote: > I generally accepted any utf-8 encoding up to 31 bits though ( since > I was going from the original spec, and not what was effective limit > based on unicode codepoint space) Hey, everybody: Don't do that. UTF-8 has been constrained to the Unicode code space (maximum

Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread J Decker via Unicode
: > > if (from[0]&0xC0 == 0x80) from--; > else if (from[-1]&0xC0 == 0x80) from -=2; > else if (from[-2]&0xC0 == 0x80) from -=3; > if (from[0]&0xC0 == 0x80) throw (some exception); > // continue here with character encoded as UTF-8 starting at "

Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Philippe Verdy via Unicode
gt;>> The RFC doesn't say 'characters' but either a space or a tab character >> (singular) >> >> back scanning is simple enough >> >> while( ( from[0] & 0xC0 ) == 0x80 ) >> from--; >> > > Certainly not like this! Backscanning should

Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Philippe Verdy via Unicode
s simple enough > > while( ( from[0] & 0xC0 ) == 0x80 ) > from--; > Certainly not like this! Backscanning should only directly use a single assignement to the last known start position, no loop at all ! UTF-8 security is based on the fact that its sequences are strictly limited

Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread J Decker via Unicode
On Mon, Jul 24, 2017 at 10:57 AM, Costello, Roger L. via Unicode < unicode@unicode.org> wrote: > Hi Folks, > > 2. (Bug) The sending application performs the folding process - inserts > CRLF plus white space characters - and the receiving application does the > unfolding process but doesn't

RE: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Costello, Roger L. via Unicode
e for very simple implementations to generate improperly folded lines in the middle of a UTF-8 multi-octet sequence. For this reason, implementations need to unfold lines in such a way to properly restore the original sequence. Here is an example

Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Doug Ewell via Unicode
Costello, Roger L. wrote: > Suppose an application splits a UTF-8 multi-octet sequence. The > application then sends the split sequence to a client. The client must > restore the original sequence. > > Question: is it possible to split a UTF-8 multi-octet sequence in such > a w

Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Philippe Verdy via Unicode
ntinuation sequence (not clear here what it means given that it refers to UTF-8: should it be "code units", i.e. bytes?) Due to this ambiguity, all implementations will need to interpret it as id they are actually 75 Unicode characters, which could all be up to 4 bytes in UTF-8,

Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Philippe Verdy via Unicode
But at the same time that RFC makes a direct reference as UTF-8 as being the default charset, so an implementation of the RFC cannot be agnostic to what is UTF-8 and will not break in the middle of a conforming UTF-8 sequence. When the limit is reached, that implementations knows that it cannot

Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Steffen Nurpmeso via Unicode
"Costello, Roger L. via Unicode" <unicode@unicode.org> wrote: |Suppose an application splits a UTF-8 multi-octet sequence. The application \ |then sends the split sequence to a client. The client must restore \ |the original sequence. | |Question: is it possible to split a U

Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Costello, Roger L. via Unicode
Hello Unicode Experts! Suppose an application splits a UTF-8 multi-octet sequence. The application then sends the split sequence to a client. The client must restore the original sequence. Question: is it possible to split a UTF-8 multi-octet sequence in such a way that the client cannot

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-29 Thread Henri Sivonen via Unicode
submission has been made: http://www.unicode.org/L2/L2017/17197-utf8-retract.pdf > Also, since Chromium/Blink/v8 are using ICU, I suggest you submit an ICU > ticket via http://bugs.icu-project.org/trac/newticket Although they use ICU for most legacy encodings, they don't use ICU for UTF-8.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-03 Thread Markus Scherer via Unicode
months in which to comment. > > What should I read to learn how to formulate an appeal correctly? > I suggest you submit a write-up via http://www.unicode.org/reporting.html and make the case there that you think the UTC should retract http://www.unicode.org/L2/L2017/17103.htm#151-C19 *B

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-02 Thread Alastair Houghton via Unicode
gy for replacement of > invalid code sequences", clearly ought to be added). It already says (p.127, section 3.9): Although a UTF-8 conversion process is required to never consume well-formed subsequences as part of its error handling for ill-formed subsequences, such a process is n

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Asmus Freytag (c) via Unicode
: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 On 6/1/2017 10:41 AM, Shawn Steele via Unicode wrote: I think that the (or a) key problem is that the current "best practice" is treated as "SHOULD" in RFC parlance. When what this really needs i

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Richard Wordingham via Unicode
On Thu, 1 Jun 2017 12:32:08 +0300 Henri Sivonen via Unicode wrote: > On Wed, May 31, 2017 at 8:11 PM, Richard Wordingham via Unicode > wrote: > > On Wed, 31 May 2017 15:12:12 +0300 > > Henri Sivonen via Unicode wrote: > >> I am

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Shawn Steele via Unicode
@unicode.org Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 On 6/1/2017 10:41 AM, Shawn Steele via Unicode wrote: I think that the (or a) key problem is that the current "best practice" is treated as "SHOULD" in RFC parlance. W

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Asmus Freytag via Unicode
stair Houghton [mailto:alast...@alastairs-place.net] Sent: Thursday, June 1, 2017 4:05 AM To: Henri Sivonen <hsivo...@hsivonen.fi> Cc: unicode Unicode Discussion <unicode@unicode.org>; Shawn Steele <shawn.ste...@microsoft.com> Subject: Re: Feedback on the proposal to change U+F

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Shawn Steele via Unicode
soft.com> Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode <unicode@unicode.org> wrote: > > On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode > <unicode@unicode.org>

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Asmus Freytag via Unicode
On 6/1/2017 2:32 AM, Henri Sivonen via Unicode wrote: O On Wed, May 31, 2017 at 10:38 PM, Doug Ewell via Unicode wrote: Henri Sivonen wrote: If anything, I hope this thread results in the establishment of a requirement for proposals to come with proper research about

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Alastair Houghton via Unicode
On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode wrote: > > On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode > wrote: >> * As far as I can tell, there are two (maybe three) sane approaches to this >> problem: >>* Either a "maximal"

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Henri Sivonen via Unicode
backward than the old guidance required.) >> On Fri, May 26, 2017 at 6:41 PM, Markus Scherer via Unicode >> <unicode@unicode.org> wrote: >> > The UTF-8 conversion code that I wrote for ICU, and apparently the >> > code that various other people have written,

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Alastair Houghton via Unicode
>> us don’t think *should* >> be considered best practice. > >> Perhaps “best practice” should simply be altered to say that you *clearly >> document* your behavior >> in the case of invalid UTF-8 sequences, and that code should not rely on the >> number of U+FFFDs

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Alastair Houghton via Unicode
On 31 May 2017, at 20:24, Shawn Steele via Unicode <unicode@unicode.org> wrote: > > > For implementations that emit FFFD while handling text conversion and > > repair (ie, converting ill-formed > > UTF-8 to well-formed), it is best for interoperability if they ge

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Richard Wordingham via Unicode
On Wed, 31 May 2017 19:24:04 + Shawn Steele via Unicode wrote: > It seems to me that being able to use a data stream of ambiguous > quality in another application with predictable results, then that > stream should be “repaired” prior to being handed over. Then both >

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
be altered to say that you *clearly > document* your behavior > in the case of invalid UTF-8 sequences, and that code should not rely on the > number of U+FFFDs > generated, rather than suggesting a behaviour? That's what I've been suggesting. I think we could maybe go a little fu

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Doug Ewell via Unicode
Henri Sivonen wrote: > If anything, I hope this thread results in the establishment of a > requirement for proposals to come with proper research about what > multiple prominent implementations to about the subject matter of a > proposal concerning changes to text about implementation behavior.

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> it’s more meaningful for whoever sees the output to see a single U+FFFD > representing > the illegally encoded NUL that it is to see two U+FFFDs, one for an invalid > lead byte and > then another for an “unexpected” trailing byte. I disagree. It may be more meaningful for some applications

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> For implementations that emit FFFD while handling text conversion and repair > (ie, converting ill-formed > UTF-8 to well-formed), it is best for interoperability if they get the same > results, so that indices within the > resulting strings are consistent across implemen

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Mark Davis ☕️ via Unicode
ing length in 'characters'. Having two different 'character' counts for the same string is inviting trouble. For implementations that emit FFFD while handling text conversion and repair (ie, converting ill-formed UTF-8 to well-formed), it is best for interoperability if they get the same results,

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Alastair Houghton via Unicode
asically, the new proposal is that we should decode bytes that structurally match UTF-8, and if the encoding is then illegal (because it’s over-long, because it’s a surrogate or because it’s over U+10) then the entire thing is replaced with U+FFFD. If, on the other hand, we get a sequence tha

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Alastair Houghton via Unicode
Everybody knows >> what it means, but everybody knows they don't exist. > > Yes, this is trying to improve the language for a scenario that CANNOT > HAPPEN. We're trying to optimize a case for data that implementations should > never encounter. It is sort of exactly like

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
ll). The other scenarios seem just as likely, (or, IMO, much more likely) than a badly designed encoder creating overlong sequences that appear to fit the UTF-8 pattern but aren't actually UTF-8. The other cases are going to cause byte patterns that are less "obvious" about how

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Richard Wordingham via Unicode
ising from two UTF-8 decoders within one product > handling UTF-8 errors differently. > Does it matter if a proposal/appeal is submitted as a non-member > implementor person, as an individual person member or as a liaison > member? http://www.unicode.org/consortium/liaison-members.ht

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Henri Sivonen via Unicode
tested. I've written up my findings at https://hsivonen.fi/broken-utf-8/ The write-up mentions https://bugs.chromium.org/p/chromium/issues/detail?id=662822#c13 . I'd like to draw everyone's attention to that bug, which is real-world evidence of a bug arising from two UTF-8 decoders within one product

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Richard Wordingham via Unicode
On Fri, 26 May 2017 21:41:49 + Shawn Steele via Unicode wrote: > I totally get the forward/backward scanning in sync without decoding > reasoning for some implementations, however I do not think that the > practices that benefit those should extend to other applications

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Richard Wordingham via Unicode
n's test page http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt, test 5.1.5. Firefox generates three replacement characters. Richard.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Richard Wordingham via Unicode
On Fri, 26 May 2017 11:22:37 -0700 Ken Whistler via Unicode wrote: > On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: > > The link provided about the PRI doesn't lead to the comments. > > > > PRI #121 (August, 2008) pre-dated the practice of keeping all the >

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Doug Ewell via Unicode
Original message From: Karl Williamson <pub...@khwilliamson.com> Date: 5/30/17 16:32 (GMT-07:00) To: Doug Ewell <d...@ewellic.org>, Unicode Mailing List <unicode@unicode.org> Subject: Re: Feedback on the proposal to change U+FFFD generation when   decoding ill-f

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Shawn Steele via Unicode
> Until TUS 3.1, it was legal for UTF-8 parsers to treat the sequence > as U+002F. Sort of, maybe. It was not legal for them to generate it though. So you could kind of infer that it was not a legal sequence. -Shawn

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Karl Williamson via Unicode
Under Best Practices, how many REPLACEMENT CHARACTERs should the sequence generate? 0, 1, 2, 3, 4 ? In practice, how many do parsers generate?

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Karl Williamson via Unicode
On 05/30/2017 02:30 PM, Doug Ewell via Unicode wrote: L2/17-168 says: "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8, without ever restricting trail bytes to less than 80..BF. For example: is a single maximal subsequence becau

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Doug Ewell via Unicode
L2/17-168 says: "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8, without ever restricting trail bytes to less than 80..BF. For example: is a single maximal subsequence because C0 was originally a lead byte for two-byte sequences.&q

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Shawn Steele via Unicode
ANNOT HAPPEN. We're trying to optimize a case for data that implementations should never encounter. It is sort of exactly like optimizing for the case where your data input is actually a dragon and not UTF-8 text. Since it is illegal, then the "at least 1 FFFD but as many as you want to emit (or just fail)" is fine. -Shawn

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Shawn Steele via Unicode
> I think nobody is debating that this is *one way* to do things, and that some > code does it. Except that they sort of are. The premise is that the "old language was wrong", and the "new language is right." The reason we know the old language was wrong was that there was a bug filed

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Martin J. Dürst via Unicode
at the effect change was unintentional. I agree that it was probably not considered explicitly. But overlongs were disallowed for security reasons, and once the definition of UTF-8 was tightened, "overlongs" essentially did not exist anymore. Essentially, "overlong" is a word li

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Martin J. Dürst via Unicode
collected so far constitute an ill-formed subsequence. So we have the same thing twice: Bail out as soon as something is ill-formed. The UTF-8 conversion code that I wrote for ICU, and apparently the code that various other people have written, collects sequences starting from lead bytes, according

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Shawn Steele via Unicode
n the proposal to change U+FFFD generation when decoding ill-formed UTF-8 On 05/26/2017 12:22 PM, Ken Whistler wrote: > > On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: >> The link provided about the PRI doesn't lead to the comments. >> > > PRI #121 (A

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Karl Williamson via Unicode
On 05/26/2017 12:22 PM, Ken Whistler wrote: On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: The link provided about the PRI doesn't lead to the comments. PRI #121 (August, 2008) pre-dated the practice of keeping all the feedback comments together with the PRI itself in a numbered

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Ken Whistler via Unicode
On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: The link provided about the PRI doesn't lead to the comments. PRI #121 (August, 2008) pre-dated the practice of keeping all the feedback comments together with the PRI itself in a numbered directory with the name "feedback.html".

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Karl Williamson via Unicode
they recommend, and that that recommendation is based on the definition of UTF-8 at that time (and still in force), and not at based on a historical definition of UTF-8. The link provided about the PRI doesn't lead to the comments. Is there any evidence that there was a realization that the language being

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Markus Scherer via Unicode
the converter recognizes that the code units collected > so far constitute an ill-formed subsequence. > >>>> > > So we have the same thing twice: Bail out as soon as something is > ill-formed. The UTF-8 conversion code that I wrote for ICU, and apparently the code that

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Philippe Verdy via Unicode
l be considered. You'll get U+FFFD replacements emitted twice. This treats all cases of "overlong" sequences that were in the old UTF-8 definition in the first RFC. For C3 80 20, there will be only ONE maximal subpart because C3 80 is a valid initial subsequence, so a single U+FFFD rep

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Martin J. Dürst via Unicode
verlongs" were already ill-formed. That change goes back to 2003 or earlier (RFC 3629 (https://tools.ietf.org/html/rfc3629) was published in 2003 to reflect the tightening of the UTF-8 definition in Unicode/ISO 10646). The recommendation in TUS 5.2 is "Replace each maximal subpart

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-24 Thread Markus Scherer via Unicode
.unicode.org/versions/Unicode5.2.0/ch03.pdf shows a slightly expanded example compared with the PRI. The text simply talked about a "conversion process" stopping as soon as it encounters something that does not fit, so these edge cases would depend on whether the conversion proce

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-24 Thread Karl Williamson via Unicode
the fact that I now maintain code that was written to parse UTF-8 back when overlongs were still considered legal input. An overlong was a single unit. When they became illegal, the code still considered them a single unit. I can understand how someone who comes along later could say C0 can't

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-24 Thread Martin J. Dürst via Unicode
On 2017/05/24 05:57, Karl Williamson via Unicode wrote: On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote: Adding a "recommendation" this late in the game is just bad standards policy. Unless I misunderstand, you are missing the point. There is already a recommendation listed in

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Karl Williamson via Unicode
"feels right" and not a "deviate at your peril" situation (or necessary escape hatch), then we are better off not making a RECOMMEDATION that goes against collective practice. I think the standard is quite clear about this: Although a UTF-8 conversion proc

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Doug Ewell via Unicode
Asmus Freytag \(c\) wrote: > And why add a recommendation that changes that from completely up to > the implementation (or groups of implementations) to something where > one way of doing it now has to justify itself? A recommendation already exists, at the end of Section 3.9. The current

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Shawn Steele via Unicode
> If the thread has made one thing clear is that there's no consensus in the > wider community > that one approach is obviously better. When it comes to ill-formed sequences, > all bets are off. > Simple as that. > Adding a "recommendation" this late in the game is just bad standards policy. I

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Asmus Freytag (c) via Unicode
il" situation (or necessary escape hatch), then we are better off not making a RECOMMEDATION that goes against collective practice. I think the standard is quite clear about this: Although a UTF-8 conversion process is required to never consume well-formed subsequenc

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Alastair Houghton via Unicode
" and not a >> "deviate at your peril" situation (or necessary escape hatch), then we are >> better off not making a RECOMMEDATION that goes against collective practice. > > I think the standard is quite clear about this: > > Although a UTF-8 conversion process

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Markus Scherer via Unicode
ot making a RECOMMEDATION that goes against collective > practice. > I think the standard is quite clear about this: Although a UTF-8 conversion process is required to never consume well-formed subsequences as part of its error handling for ill-formed subsequences, such a process is not other

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Shawn Steele via Unicode
+ the list, which somehow my reply seems to have lost. > I may have missed something, but I think nobody actually proposed to change > the recommendations into requirements No thanks, that would be a breaking change for some implementations (like mine) and force them to become non-complying or

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Asmus Freytag via Unicode
On 5/23/2017 1:24 AM, Martin J. Dürst via Unicode wrote: Hello Mark, On 2017/05/22 01:37, Mark Davis ☕️ via Unicode wrote: I actually didn't see any of this discussion until today. Many thanks for

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Alastair Houghton via Unicode
e decision complicates U+FFFD generation when validating UTF-8 by state >>> machine. >>> >> It *really* doesn’t. Even if you’re hell bent on using a pure state machine >> approach, you need to add maybe two additional error states >> (two-trailing-bytes-to-e

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Martin J. Dürst via Unicode
Hello Mark, On 2017/05/22 01:37, Mark Davis ☕️ via Unicode wrote: I actually didn't see any of this discussion until today. Many thanks for chiming in. ( unicode@unicode.org mail was going into my spam folder...) I started reading the thread, but it looks like a lot of it is OT, As is

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Jonathan Coxhead via Unicode
On 18/05/2017 1:58 am, Alastair Houghton via Unicode wrote: On 18 May 2017, at 07:18, Henri Sivonen via Unicode <unicode@unicode.org> wrote: the decision complicates U+FFFD generation when validating UTF-8 by state machine. It *really* doesn’t. Even if you’re hell bent on using a pure

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-21 Thread Mark Davis ☕️ via Unicode
I actually didn't see any of this discussion until today. ( unicode@unicode.org mail was going into my spam folder...) I started reading the thread, but it looks like a lot of it is OT, so just scanned some of them. A few brief points: 1. There is plenty of time for public comment, since it

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Richard Wordingham via Unicode
On Thu, 18 May 2017 09:58:43 +0100 Alastair Houghton via Unicode <unicode@unicode.org> wrote: > On 18 May 2017, at 07:18, Henri Sivonen via Unicode > <unicode@unicode.org> wrote: > > > > the decision complicates U+FFFD generation when validating UTF-8 by > &

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Alastair Houghton via Unicode
On 18 May 2017, at 07:18, Henri Sivonen via Unicode <unicode@unicode.org> wrote: > > the decision complicates U+FFFD generation when validating UTF-8 by state > machine. It *really* doesn’t. Even if you’re hell bent on using a pure state machine approach, you need to add maybe

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Hans Åberg via Unicode
lt;unicode@unicode.org> wrote: >> ... >>> I think Unicode should not adopt the proposed change. >> >> It would be useful, for use with filesystems, to have Unicode >> codepoint markers that indicate how UTF-8, including non-valid >> sequences, is tra

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Alastair Houghton via Unicode
t;> of the **shortest** sequences, but now wants to treat **maximal >> sequences** as a single unit with arbitrary length. UTF-8 was >> designed to work only with some state machines that would NEVER need >> to parse more than 4 bytes. > > If you look at the sample code in >

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Alastair Houghton via Unicode
On 18 May 2017, at 01:04, Philippe Verdy via Unicode <unicode@unicode.org> wrote: > > I find intriguating that the update intends to enforce the decoding of the > **shortest** sequences, but now wants to treat **maximal sequences** as a > single unit with arbitrary length.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Henri Sivonen via Unicode
least compared to other standards > organizations. The PRI process addresses that issue to some extent. What action should I take to make proposals to be considered by the UTC? I'd like to make two: 1) Substantive: Reverse the decision to modify U+FFFD best practice when decoding UTF-8.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Richard Wordingham via Unicode
On Thu, 18 May 2017 02:04:55 +0200 Philippe Verdy via Unicode <unicode@unicode.org> wrote: > I find intriguating that the update intends to enforce the decoding > of the **shortest** sequences, but now wants to treat **maximal > sequences** as a single unit with arbitrar

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Doug Ewell via Unicode
Richard Wordingham wrote: I'm afraid I don't get the analogy. You can't build a full Unicode system out of Unicode-compliant parts. Others will have to address Richard's point about canonical-equivalent sequences. However, having dug out Unicode Version 2 Appendix A Section 2 UTF-8

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Philippe Verdy via Unicode
I find intriguating that the update intends to enforce the decoding of the **shortest** sequences, but now wants to treat **maximal sequences** as a single unit with arbitrary length. UTF-8 was designed to work only with some state machines that would NEVER need to parse more than 4 bytes. For me

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Asmus Freytag via Unicode
On 5/17/2017 2:31 PM, Richard Wordingham via Unicode wrote: There's some sort of rule that proposals should be made seven days in advance of the meeting. I can't find it now, so I'm not sure whether the actual rule was followed, let alone what authority it has.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Richard Wordingham via Unicode
On Wed, 17 May 2017 15:31:56 -0700 Doug Ewell via Unicode <unicode@unicode.org> wrote: > Richard Wordingham wrote: > > > So it was still a legal way for a non-UTF-8-compliant process! > > Anything is possible if you are non-compliant. You can encode U+263A > w

  1   2   3   4   5   6   7   8   9   10   >