Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 16 May 2017, at 17:23, Hans Åberg wrote: > > HFS implements case insensitivity in a layer above the filesystem raw > functions. So it is perfectly possible to have files that differ by case only > in the same directory by using low level function calls. The Tenon MachTen > did that on Mac OS 9 already. You keep insisting on this, but it’s not true; I’m a disk utility developer, and I can tell you for a fact that HFS+ uses a B+-Tree to hold its directory data (a single one for the entire disk, not one per directory either), and that that tree is sorted by (CNID, filename) pairs. And since it’s case-preserving *and* case-insensitive, the comparisons it does to order its B+-Tree nodes *cannot* be raw. I should know - I’ve actually written the code for it! Even for legacy HFS, which didn’t store UTF-16, but stored a specified Mac legacy encoding (the encoding used is in the volume header), it’s case sensitive, so the encoding matters. I don’t know what tricks Tenon MachTen pulled on Mac OS 9, but I *do* know how the filesystem works. Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> On 16 May 2017, at 18:13, Alastair Houghton > wrote: > > On 16 May 2017, at 17:07, Hans Åberg wrote: >> > HFS(+), NTFS and VFAT long filenames are all encoded in some variation on > UCS-2/UTF-16. ... The filesystem directory is using octet sequences and does not bother passing over an encoding, I am told. Someone could remember one that to used UTF-16 directly, but I think it may not be current. >>> >>> No, that’s not true. All three of those systems store UTF-16 on the disk >>> (give or take). >> >> I am not speaking about what they store, but how the filesystem identifies >> files. > > Well, quite clearly none of those systems treat the UTF-16 strings as binary > either - they’re case insensitive, so how could they? HFS+ even normalises > strings using a variant of a frozen version of the normalisation spec. HFS implements case insensitivity in a layer above the filesystem raw functions. So it is perfectly possible to have files that differ by case only in the same directory by using low level function calls. The Tenon MachTen did that on Mac OS 9 already.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 16 May 2017, at 17:07, Hans Åberg wrote: > HFS(+), NTFS and VFAT long filenames are all encoded in some variation on UCS-2/UTF-16. ... >>> >>> The filesystem directory is using octet sequences and does not bother >>> passing over an encoding, I am told. Someone could remember one that to >>> used UTF-16 directly, but I think it may not be current. >> >> No, that’s not true. All three of those systems store UTF-16 on the disk >> (give or take). > > I am not speaking about what they store, but how the filesystem identifies > files. Well, quite clearly none of those systems treat the UTF-16 strings as binary either - they’re case insensitive, so how could they? HFS+ even normalises strings using a variant of a frozen version of the normalisation spec. Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> On 16 May 2017, at 17:52, Alastair Houghton > wrote: > > On 16 May 2017, at 16:44, Hans Åberg wrote: >> >> On 16 May 2017, at 17:30, Alastair Houghton via Unicode >> wrote: >>> >>> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on >>> UCS-2/UTF-16. ... >> >> The filesystem directory is using octet sequences and does not bother >> passing over an encoding, I am told. Someone could remember one that to used >> UTF-16 directly, but I think it may not be current. > > No, that’s not true. All three of those systems store UTF-16 on the disk > (give or take). I am not speaking about what they store, but how the filesystem identifies files.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 16 May 2017, at 16:44, Hans Åberg wrote: > > On 16 May 2017, at 17:30, Alastair Houghton via Unicode > wrote: >> >> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on >> UCS-2/UTF-16. ... > > The filesystem directory is using octet sequences and does not bother passing > over an encoding, I am told. Someone could remember one that to used UTF-16 > directly, but I think it may not be current. No, that’s not true. All three of those systems store UTF-16 on the disk (give or take). On Windows, the “ANSI” APIs convert the filenames to or from the appropriate Windows code page, while the “Wide” API works in UTF-16, which is the native encoding for VFAT long filenames and NTFS filenames. And, as I said, on Mac OS X and iOS, the kernel expects filenames to be encoded as UTF-8 at the BSD API, regardless of what encoding you might be using in your Terminal (this is different to traditional UNIX behaviour, where how you interpret your filenames is entirely up to you - usually you’d use the same encoding you were using on your tty). Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> On 16 May 2017, at 17:30, Alastair Houghton via Unicode > wrote: > > On 16 May 2017, at 14:23, Hans Åberg via Unicode wrote: >> >> You don't. You have a filename, which is a octet sequence of unknown >> encoding, and want to deal with it. Therefore, valid Unicode transformations >> of the filename may result in that is is not being reachable. >> >> It only matters that the correct octet sequence is handed back to the >> filesystem. All current filsystems, as far as experts could recall, use >> octet sequences at the lowest level; whatever encoding is used is built in a >> layer above. > > HFS(+), NTFS and VFAT long filenames are all encoded in some variation on > UCS-2/UTF-16. ... The filesystem directory is using octet sequences and does not bother passing over an encoding, I am told. Someone could remember one that to used UTF-16 directly, but I think it may not be current.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 16 May 2017, at 14:23, Hans Åberg via Unicode wrote: > > You don't. You have a filename, which is a octet sequence of unknown > encoding, and want to deal with it. Therefore, valid Unicode transformations > of the filename may result in that is is not being reachable. > > It only matters that the correct octet sequence is handed back to the > filesystem. All current filsystems, as far as experts could recall, use octet > sequences at the lowest level; whatever encoding is used is built in a layer > above. HFS(+), NTFS and VFAT long filenames are all encoded in some variation on UCS-2/UTF-16. FAT 8.3 names are also encoded, but the encoding isn’t specified (more specifically, MS-DOS and Windows assume an encoding based on your locale, which could cause all kinds of fun if you swapped disks with someone from a different country, and IIRC there are some shenanigans for Japan because of the use of 0xe5 as a deleted file marker). There are some less widely used filesystems that require a particular encoding also (BeOS’ BFS used UTF-8, for instance). Also, Mac OS X and iOS use UTF-8 at the BSD layer; if a filesystem is in use whose names can’t be converted to UTF-8, the Darwin kernel uses a percent encoding scheme(!) It looks like Apple has changed its mind for APFS and is going with the “bag of bytes” approach that’s typical of other systems; at least, that’s what it appears to have done on iOS. Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
2017-05-16 15:23 GMT+02:00 Hans Åberg : > All current filsystems, as far as experts could recall, use octet > sequences at the lowest level; whatever encoding is used is built in a > layer above > Not NTFS (on Windows) which uses sequences of 16bit units. Same about FAT32/exFAT within "Long File Names" (the legacy 8.3 short filenames are using legacy 8-bit codepages, but these are alternate filenames used when long filenames are not found, and working mostly like aliasing physical links on Unix filesystems, as if they were separate directory entries, except that they are hidden by default when their matching LFN are already shown)
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> On 16 May 2017, at 15:00, Philippe Verdy wrote: > > 2017-05-16 14:44 GMT+02:00 Hans Åberg via Unicode : > > > On 15 May 2017, at 12:21, Henri Sivonen via Unicode > > wrote: > ... > > I think Unicode should not adopt the proposed change. > > It would be useful, for use with filesystems, to have Unicode codepoint > markers that indicate how UTF-8, including non-valid sequences, is translated > into UTF-32 in a way that the original octet sequence can be restored. > > Why just UTF-32 ? Synonym for codepoint numbers. It would suffice to add markers how it is translated. For example, codepoints meaning "overlong long length ", "byte", or whatever is useful. > How would you convert ill-formed UTF-8/UTF-16/UTF-32 to valid > UTF-8/UTF-16/UTF-32 ? You don't. You have a filename, which is a octet sequence of unknown encoding, and want to deal with it. Therefore, valid Unicode transformations of the filename may result in that is is not being reachable. It only matters that the correct octet sequence is handed back to the filesystem. All current filsystems, as far as experts could recall, use octet sequences at the lowest level; whatever encoding is used is built in a layer above.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, 16 May 2017 14:44:44 +0200 Hans Åberg via Unicode wrote: > > On 15 May 2017, at 12:21, Henri Sivonen via Unicode > > wrote: > ... > > I think Unicode should not adopt the proposed change. > > It would be useful, for use with filesystems, to have Unicode > codepoint markers that indicate how UTF-8, including non-valid > sequences, is translated into UTF-32 in a way that the original octet > sequence can be restored. Escape sequences for the inappropriate bytes is the natural technique. Your problem is smoothly transitioning so that the escape character is always escaped when it means itself. Strictly, it can't be done. Of course, some sequences of escaped characters should be prohibited. Checking could be fiddly. Richard.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, 16 May 2017 20:08:52 +0900 "Martin J. Dürst via Unicode" wrote: > I agree with others that ICU should not be considered to have a > special status, it should be just one implementation among others. > [The next point is a side issue, please don't spend too much time on > it.] I find it particularly strange that at a time when UTF-8 is > firmly defined as up to 4 bytes, never including any bytes above > 0xF4, the Unicode consortium would want to consider recommending that > be converted to a single U+FFFD. I note with > agreement that Markus seems to have thoughts in the same direction, > because the proposal (17168-utf-8-recommend.pdf) says "(I suppose > that lead bytes above F4 could be somewhat debatable.)". The undesirable sidetrack, I suppose, is worrying about how many planes will be required for emoji. However, it does make for the point that, while some practices may be better than other, there isn't necessarily a best practice. The English of the proposal is unclear - the text would benefit from showing some maximal subsequences (poor terminology - some of us are used to non-contiguous subsequences). When he writes, "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8, without ever restricting trail bytes to less than 80..BF", I am pretty sure he means "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8, with the only restriction on trailing bytes beyond the number of them being that they must be in the range 80..BF". Thus Philippe's example of "E0 E0 C3 89" would be converted with an error flagged to a sequence of scalar values FFFD FFFD C9. This may make a UTF-8 system usable if it tries to use something like non-characters as understood before CLDR was caught publishing them as an essential part of text files. Richard.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
2017-05-16 14:44 GMT+02:00 Hans Åberg via Unicode : > > > On 15 May 2017, at 12:21, Henri Sivonen via Unicode > wrote: > ... > > I think Unicode should not adopt the proposed change. > > It would be useful, for use with filesystems, to have Unicode codepoint > markers that indicate how UTF-8, including non-valid sequences, is > translated into UTF-32 in a way that the original octet sequence can be > restored. Why just UTF-32 ? How would you convert ill-formed UTF-8/UTF-16/UTF-32 to valid UTF-8/UTF-16/UTF-32 ? In all cases this would require extensions on the 3 standards (which MUST be interoperable), then you'll shoke on new validation rules for these 3 standards for these extensions, and new ill-formed sequences that you won't be able to convert interoperably. Given the most restrictive condition in UTF-16 (which is still the most widely used internal representation), such extensions would be very complex too manage. There's no solution, such extensions in any one of them are then undesirable and can only be used privately (but without interoperating with the other 2 representations), so it's impossible to make sure the original octet sequences can be restored. Any deviation of the UTF-8/16/32 will be bounded in the same UTF. It cannot be part of the 3 standard UTF, but may be part of a distinct encoding, not fully compatible with the 3 standards.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> On 15 May 2017, at 12:21, Henri Sivonen via Unicode > wrote: ... > I think Unicode should not adopt the proposed change. It would be useful, for use with filesystems, to have Unicode codepoint markers that indicate how UTF-8, including non-valid sequences, is translated into UTF-32 in a way that the original octet sequence can be restored.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
2017-05-16 12:40 GMT+02:00 Henri Sivonen via Unicode : > > One additional note: the standard codifies this behaviour as a > *recommendation*, not a requirement. > > This is an odd argument in favor of changing it. If the argument is > that it's just a recommendation that you don't need to adhere to, > surely then the people who don't like the current recommendation > should choose not to adhere to it instead of advocating changing it. I also agree. The internet is full of RFC specifications that are also "best practices" and even in this case, changing them must be extensively documented, including discussing new compatibility/interoperability problems and new security risks. The case of random access in substrings is significant because what was once valid UTF-8 could become invalid if the best recommandation is not followed, and then could cause unexpected failures, uncaught exceptions causing software to suddenly fail and become subject to possible attacks due to this new failure (this is mostly a problem for implementations that do not use "safe" U+FFFD replacements but throw exceptions on ill-formed input: we should not change the cases where these exceptions may occur by adding new cases caused by a change of implementation based on a change of best practice). The considerations about trying to reduce the nnumber of U+FFFD is not relevant, purely esthetic because some people would like to compact the decoded result in memory. What is really import is to not ignore silently these ill-formed sequences, and properly track that there was some data loss. The number of U+FFFD (only one or as many as there are invalid code units in the input before the first resynchronization point) inserted is not so important. As well, wether implementations will use an accumulator or just a single state (where each state knows how many code units have been parsed without emitting an output code point, so that these code points can be decoded by relative indexed accesses) is not relevant, it is just a very minor optimization case (in my opinion, using an accumulator that can live in a CPU register is faster than using relative indexed accesses All modern CPUs have enough registers to store that accumulator, and the input and output pointers, and a finite state number is not needed when the state can be tracked by the executable instruction position where you don't necessarily need to loop for each code unit but can easily write your decoder so that each loop will process a full codepoint or will emit a single U+FFFD before adjusting the input pointer : UTF-8 and UTF-16 complexity is small enough that unwinding such loops will be easy to implement for processing full code points instead of single code units: That code will still remain very small (fitting fully in instruction cache), and it will be faster because it will avoid several conditional branches and because it will save one register (for the finite state number) that will not ned to be slowly saved on a stack: 2 pointer registers (or 2 access function/method addresses) + 2 data registers + the PC instruction counter is enough.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Hello everybody, [using this mail to in effect reply to different mails in the thread] On 2017/05/16 17:31, Henri Sivonen via Unicode wrote: On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag wrote: Under what circumstance would it matter how many U+FFFDs you see? Maybe it doesn't, but I don't think the burden of proof should be on the person advocating keeping the spec and major implementations as they are. If anything, I think those arguing for a change of the spec in face of browsers, OpenJDK, Python 3 (and, likely, "etc.") agreeing with the current spec should show why it's important to have a different number of U+FFFDs than the spec's "best practice" calls for now. I have just checked (the programming language) Ruby. Some background: As you might know, Ruby is (at least in theory) pretty encoding-independent, meaning you can run scripts in iso-8859-1, in Shift_JIS, in UTF-8, or in any of quite a few other encodings directly, without any conversion. However, in practice, incl. Ruby on Rails, Ruby is very much using UTF-8 internally, and is optimized to work well that way. Character encoding conversion also works with UTF-8 as the pivot encoding. As far as I understand, Ruby does the same as all of the above software, based (among else) on the fact that we followed the recommendation in the standard. Here are a few examples (sorry for the linebreaks introduced by mail software): $ ruby -e 'puts "\xF0\xaf".encode("UTF-16BE", invalid: :replace).inspect' #=>"\uFFFD" $ ruby -e 'puts "\xe0\x80\x80".encode("UTF-16BE", invalid: :replace).inspect' #=>"\uFFFD\uFFFD\uFFFD" $ ruby -e 'puts "\xF4\x90\x80\x80".encode("UTF-16BE", invalid: :replace).inspect' #=>"\uFFFD\uFFFD\uFFFD\uFFFD" $ ruby -e 'puts "\xfd\x81\x82\x83\x84\x85".encode("UTF-16BE", invalid: :replace).inspect #=>"\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD" $ ruby -e 'puts "\x41\xc0\xaf\x41\xf4\x80\x80\x41".encode("UTF-16BE", invalid: :replace).inspect' #=>"A\uFFFD\uFFFDA\uFFFDA" This is based on http://www.unicode.org/review/pr-121.html as noted at https://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/test/ruby/test_transcode.rb?revision=56516&view=markup#l1507 (for those having a look at these tests, in Ruby's version of assert_equal, the expected value comes first (not sure whether this is called little-endian or big-endian :-), but this is a decision where the various test frameworks are virtually split 50/50 :-(. )) Even if the above examples and the tests use conversion to UTF-16 (in particular the BE variant for better readability), what happens internally is that the input is analyzed byte-by-byte. In this case, it is easiest to just stop as soon as something is found that is clearly invalid (be this a single byte or something longer). This makes a data-driven implementation (such as the Ruby transcoder) or one based on a state machine (such as http://bjoern.hoehrmann.de/utf-8/decoder/dfa/) more compact. In other words, because we never know whether the next byte is a valid one such as 0x41, it's easier to just handle one byte at a time if this way we can avoid lookahead (which is always a good idea when parsing). I agree with Henri and others that there is no need at all to change the recommendation in the standard that has been stable for so long (close to 9 years). Because the original was done on a PR (http://www.unicode.org/review/pr-121.html), I think this should at least also be handled as PR (if it's not dropped based on the discussion here). I think changing the current definition of "maximal subsequence" is a bad idea, because it would mean that one wouldn't know what one was speaking about over the years. If necessary, new definitions should be introduced for other variants. I agree with others that ICU should not be considered to have a special status, it should be just one implementation among others. [The next point is a side issue, please don't spend too much time on it.] I find it particularly strange that at a time when UTF-8 is firmly defined as up to 4 bytes, never including any bytes above 0xF4, the Unicode consortium would want to consider recommending that 84 85> be converted to a single U+FFFD. I note with agreement that Markus seems to have thoughts in the same direction, because the proposal (17168-utf-8-recommend.pdf) says "(I suppose that lead bytes above F4 could be somewhat debatable.)". Regards,Martin.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> > The proposal actually does cover things that aren’t structurally valid, > like your e0 e0 e0 example, which it suggests should be a single U+FFFD > because the initial e0 denotes a three byte sequence, and your 80 80 80 > example, which it proposes should constitute three illegal subsequences > (again, both reasonable). However, I’m not entirely certain about things > like > > e0 e0 c3 89 > > which the proposal would appear to decode as > > U+FFFD U+FFFD U+FFFD U+FFFD (3) > > instead of a perhaps more reasonable > > U+FFFD U+FFFD U+00C9 (4) > > (the key part is the “without ever restricting trail bytes to less than > 80..BF”) > I also agree with that, due to access in strings from random position: if you access it from byte 0x89, you can assume it's a trialing byte and you'll want to look backward, and will see 0xc3,0x89 which will decode correctly as U+00C9 without any error detected. So the wrong bytes are only the initial two occurences of 0x80 which are individually converted to U+FFFD. In summary: when you detect any ill-formed sequence, only replace the first code unit by U+FFFD and restart scanning from the next code unit, without skeeping over multiple bytes. This means that multiple occurences of U+FFFD is not only the best practice, it also matches the intended design of UTF-8 to allow access from random positions.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, May 16, 2017 at 1:09 PM, Alastair Houghton wrote: > On 16 May 2017, at 09:31, Henri Sivonen via Unicode > wrote: >> >> On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton >> wrote: >>> That would be true if the in-memory representation had any effect on what >>> we’re talking about, but it really doesn’t. >> >> If the internal representation is UTF-16 (or UTF-32), it is a likely >> design that there is a variable into which the scalar value of the >> current code point is accumulated during UTF-8 decoding. > > That’s quite a likely design with a UTF-8 internal representation too; it’s > just that you’d only decode during processing, as opposed to immediately at > input. The time to generate the U+FFFDs is at the input time which is what's at issue here. The later processing, which may then involve iterating by code point and involving computing the scalar values is a different step that should be able to assume valid UTF-8 and not be concerned with invalid UTF-8. (To what extent different programming languages and frameworks allow confident maintenance of the invariant that after input all in-RAM UTF-8 can be treated as valid varies.) >> When the internal representation is UTF-8, only UTF-8 validation is >> needed, and it's natural to have a fail-fast validator, which *doesn't >> necessarily need such a scalar value accumulator at all*. > > Sure. But a state machine can still contain appropriate error states without > needing an accumulator. As I said upthread, it could, but it seems inappropriate to ask implementations to take on that extra complexity on as weak grounds as "ICU does it" or "feels right" when the current recommendation doesn't call for those extra states and the current spec is consistent with a number of prominent non-ICU implementations, including Web browsers. >>> In what sense is this “interop”? >> >> In the sense that prominent independent implementations do the same >> externally observable thing. > > The argument is, I think, that in this case the thing they are doing is the > *wrong* thing. It's seems weird to characterize following the currently-specced "best practice" as "wrong" without showing a compelling fundamental flaw (such as a genuine security problem) in the currently-specced "best practice". With implementations of the currently-specced "best practice" already shipped, I don't think aesthetic preferences should be considered enough of a reason to proclaim behavior adhering to the currently-specced "best practice" as "wrong". > That many of them do it would only be an argument if there was some reason > that it was desirable that they did it. There doesn’t appear to be such a > reason, unless you can think of something that hasn’t been mentioned thus far? I've already given a reason: UTF-8 validation code not needing to have extra states catering to aesthetic considerations of U+FFFD consolidation. > The only reason you’ve given, to date, is that they currently do that, so > that should be the recommended behaviour (which is little different from the > argument - which nobody deployed - that ICU currently does the other thing, > so *that* should be the recommended behaviour; the only difference is that > *you* care about browsers and don’t care about ICU, whereas you yourself > suggested that some of us might be advocating this decision because we care > about ICU and not about e.g. browsers). Not just browsers. Also OpenJDK and Python 3. Do I really need to test the standard libraries of more languages/systems to more strongly make the case that the ICU behavior (according to the proposal PDF) is not the norm and what the spec currently says is? > I’ll add also that even among the implementations you cite, some of them > permit surrogates in their UTF-8 input (i.e. they’re actually processing > CESU-8, not UTF-8 anyway). Python, for example, certainly accepts the > sequence [ed a0 bd ed b8 80] and decodes it as U+1F600; a true “fast fail” > implementation that conformed literally to the recommendation, as you seem to > want, should instead replace it with *four* U+FFFDs (I think), no? I see that behavior in Python 2. Earlier, I said that Python 3 agrees with the current spec for my test case. The Python 2 behavior I see is not just against "best practice" but obviously incompliant. (For details: I tested Python 2.7.12 and 3.5.2 as shipped on Ubuntu 16.04.) > One additional note: the standard codifies this behaviour as a > *recommendation*, not a requirement. This is an odd argument in favor of changing it. If the argument is that it's just a recommendation that you don't need to adhere to, surely then the people who don't like the current recommendation should choose not to adhere to it instead of advocating changing it. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 16 May 2017, at 09:31, Henri Sivonen via Unicode wrote: > > On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton > wrote: >> That would be true if the in-memory representation had any effect on what >> we’re talking about, but it really doesn’t. > > If the internal representation is UTF-16 (or UTF-32), it is a likely > design that there is a variable into which the scalar value of the > current code point is accumulated during UTF-8 decoding. That’s quite a likely design with a UTF-8 internal representation too; it’s just that you’d only decode during processing, as opposed to immediately at input. > When the internal representation is UTF-8, only UTF-8 validation is > needed, and it's natural to have a fail-fast validator, which *doesn't > necessarily need such a scalar value accumulator at all*. Sure. But a state machine can still contain appropriate error states without needing an accumulator. That the ones you care about currently don’t is readily apparent, but there’s nothing stopping them from doing so. I don’t see this as an argument about implementations, since it really makes very little difference to the implementation which approach is taken; in both internal representations, the question is whether you generate U+FFFD immediately on detection of the first incorrect *byte*, or whether you do so after reading a complete sequence. UTF-8 sequences are bounded anyway, so it isn’t as if failing early gives you any significant performance benefit. >> In what sense is this “interop”? > > In the sense that prominent independent implementations do the same > externally observable thing. The argument is, I think, that in this case the thing they are doing is the *wrong* thing. That many of them do it would only be an argument if there was some reason that it was desirable that they did it. There doesn’t appear to be such a reason, unless you can think of something that hasn’t been mentioned thus far? The only reason you’ve given, to date, is that they currently do that, so that should be the recommended behaviour (which is little different from the argument - which nobody deployed - that ICU currently does the other thing, so *that* should be the recommended behaviour; the only difference is that *you* care about browsers and don’t care about ICU, whereas you yourself suggested that some of us might be advocating this decision because we care about ICU and not about e.g. browsers). I’ll add also that even among the implementations you cite, some of them permit surrogates in their UTF-8 input (i.e. they’re actually processing CESU-8, not UTF-8 anyway). Python, for example, certainly accepts the sequence [ed a0 bd ed b8 80] and decodes it as U+1F600; a true “fast fail” implementation that conformed literally to the recommendation, as you seem to want, should instead replace it with *four* U+FFFDs (I think), no? One additional note: the standard codifies this behaviour as a *recommendation*, not a requirement. Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> On 16 May 2017, at 10:29, David Starner wrote: > > On Tue, May 16, 2017 at 1:45 AM Alastair Houghton > wrote: > That’s true anyway; imagine the database holds raw bytes, that just happen to > decode to U+FFFD. There might seem to be *two* names that both contain > U+FFFD in the same place. How do you distinguish between them? > >> If the database holds raw bytes, then the name is a byte string, not a >> Unicode string, and can't contain U+FFFD at all. It's a relatively easy rule >> to make and enforce that a string in a database is a validly formatted >> string; I would hope that most SQL servers do in fact reject malformed UTF-8 >> strings. On the other hand, I'd expect that an SQL server would accept >> U+FFFD in a Unicode string. Databases typically separate the encoding in which strings are stored from the encoding in which an application connected to the database is operating. A database might well hold data in (say) ISO Latin 1, EUC-JP, or indeed any other character set, while presenting it to a client application as UTF-8 or UTF-16. Hence my comment - application software could very well see two names that are apparently identical and that include U+FFFDs in the same places, even though the database back-end actually has different strings. As I said, this is a problem we already have. > I don’t see a problem; the point is that where a structurally valid UTF-8 > encoding has been used, albeit in an invalid manner (e.g. encoding a number > that is not a valid code point, or encoding a valid code point as an > over-long sequence), a single U+FFFD is appropriate. That seems a perfectly > sensible rule to adopt. > >> It seems like a perfectly arbitrary rule to adopt; I'd like to assume that >> the only source of such UTF-8 data is willful attempts to break security, >> and in that case, how is this a win? Nonattack sources of broken data are >> much more likely to be the result of mixing UTF-8 with other character >> encodings or raw binary data. I’d say there are three sources of UTF-8 data of that ilk: (a) bugs, (b) “Modified UTF-8” and “CESU-8” implementations, (c) wilful attacks (b) in particular is quite common, and the result of the presently recommended approach doesn’t make much sense there ([c0 80] will get replaced with *two* U+FFFDs, while [ed a0 bd ed b8 80] will be replaced by *four* U+FFFDs - surrogates aren’t supposed to be valid in UTF-8, right?) Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, May 16, 2017 at 1:45 AM Alastair Houghton < alast...@alastairs-place.net> wrote: > That’s true anyway; imagine the database holds raw bytes, that just happen > to decode to U+FFFD. There might seem to be *two* names that both contain > U+FFFD in the same place. How do you distinguish between them? > If the database holds raw bytes, then the name is a byte string, not a Unicode string, and can't contain U+FFFD at all. It's a relatively easy rule to make and enforce that a string in a database is a validly formatted string; I would hope that most SQL servers do in fact reject malformed UTF-8 strings. On the other hand, I'd expect that an SQL server would accept U+FFFD in a Unicode string. > I don’t see a problem; the point is that where a structurally valid UTF-8 > encoding has been used, albeit in an invalid manner (e.g. encoding a number > that is not a valid code point, or encoding a valid code point as an > over-long sequence), a single U+FFFD is appropriate. That seems a > perfectly sensible rule to adopt. > It seems like a perfectly arbitrary rule to adopt; I'd like to assume that the only source of such UTF-8 data is willful attempts to break security, and in that case, how is this a win? Nonattack sources of broken data are much more likely to be the result of mixing UTF-8 with other character encodings or raw binary data. >
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> On 16 May 2017, at 09:18, David Starner wrote: > > On Tue, May 16, 2017 at 12:42 AM Alastair Houghton > wrote: >> If you’re about to mutter something about security, consider this: security >> code *should* refuse to compare strings that contain U+FFFD (or at least >> should never treat them as equal, even to themselves), because it has no way >> to know what that code point represents. >> > Which causes various other security problems; if an object (file, database > element, etc.) gets a name with a FFFD in it, it becomes impossible to > reference. That an IEEE 754 float may not equal itself is a perpetual source > of confusion for programmers. That’s true anyway; imagine the database holds raw bytes, that just happen to decode to U+FFFD. There might seem to be *two* names that both contain U+FFFD in the same place. How do you distinguish between them? Clearly if you are holding Unicode code points that you know are validly encoded somehow, you may want to be able to match U+FFFDs, but that’s a special case where you have extra knowledge. > In this case, It's pretty clear, but I don't see it as a general rule. Any > rule has to handle e0 e0 e0 or 80 80 80 or any variety of charset or mojibake > or random binary data. I don’t see a problem; the point is that where a structurally valid UTF-8 encoding has been used, albeit in an invalid manner (e.g. encoding a number that is not a valid code point, or encoding a valid code point as an over-long sequence), a single U+FFFD is appropriate. That seems a perfectly sensible rule to adopt. The proposal actually does cover things that aren’t structurally valid, like your e0 e0 e0 example, which it suggests should be a single U+FFFD because the initial e0 denotes a three byte sequence, and your 80 80 80 example, which it proposes should constitute three illegal subsequences (again, both reasonable). However, I’m not entirely certain about things like e0 e0 c3 89 which the proposal would appear to decode as U+FFFD U+FFFD U+FFFD U+FFFD (3) instead of a perhaps more reasonable U+FFFD U+FFFD U+00C9 (4) (the key part is the “without ever restricting trail bytes to less than 80..BF”) and if Markus or others could explain why they chose (3) over (4) I’d be quite interested to hear the explanation. Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag wrote: > but I think the way he raises this point is needlessly antagonistic. I apologize. My level of dismay at the proposal's ICU-centricity overcame me. On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton wrote: > That would be true if the in-memory representation had any effect on what > we’re talking about, but it really doesn’t. If the internal representation is UTF-16 (or UTF-32), it is a likely design that there is a variable into which the scalar value of the current code point is accumulated during UTF-8 decoding. In such a scenario, it can be argued as "natural" to first operate according to the general structure of UTF-8 and then inspect what you got in the accumulation variable (ruling out non-shortest forms, values above the Unicode range and surrogate values after the fact). When the internal representation is UTF-8, only UTF-8 validation is needed, and it's natural to have a fail-fast validator, which *doesn't necessarily need such a scalar value accumulator at all*. The construction at http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ when used as a UTF-8 validator is the best illustration of a UTF-8 validator not necessarily looking like a "natural" UTF-8 to UTF-16 converter at all. >>> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick >>> test with three major browsers that use UTF-16 internally and have >>> independent (of each other) implementations of UTF-8 decoding >>> (Firefox, Edge and Chrome) shows agreement on the current spec: there >>> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, >>> 6 on the second, 4 on the third and 6 on the last line). Changing the >>> Unicode standard away from that kind of interop needs *way* better >>> rationale than "feels right”. > > In what sense is this “interop”? In the sense that prominent independent implementations do the same externally observable thing. > Under what circumstance would it matter how many U+FFFDs you see? Maybe it doesn't, but I don't think the burden of proof should be on the person advocating keeping the spec and major implementations as they are. If anything, I think those arguing for a change of the spec in face of browsers, OpenJDK, Python 3 (and, likely, "etc.") agreeing with the current spec should show why it's important to have a different number of U+FFFDs than the spec's "best practice" calls for now. > If you’re about to mutter something about security, consider this: security > code *should* refuse to compare strings that contain U+FFFD (or at least > should never treat them as equal, even to themselves), because it has no way > to know what that code point represents. In practice, e.g. the Web Platform doesn't allow for stopping operating on input that contains an U+FFFD, so the focus is mainly on making sure that U+FFFDs are placed well enough to prevent bad stuff under normal operations. At least typically, the number of U+FFFDs doesn't matter for that purpose, but when browsers agree on the number of U+FFFDs, changing that number should have an overwhelmingly strong rationale. A security reason could be a strong reason, but such a security motivation for fewer U+FFFDs has not been shown, to my knowledge. > Would you advocate replacing > > e0 80 80 > > with > > U+FFFD U+FFFD U+FFFD (1) > > rather than > > U+FFFD (2) I advocate (1), most simply because that's what Firefox, Edge and Chrome do *in accordance with the currently-recommended best practice* and, less simply, because it makes sense in the presence of a fail-fast UTF-8 validator. I think the burden of proof to show an overwhelmingly good reason to change should, at this point, be on whoever proposes doing it differently than what the current widely-implemented spec says. > It’s pretty clear what the intent of the encoder was there, I’d say, and > while we certainly don’t want to decode it as a NUL (that was the source of > previous security bugs, as I recall), I also don’t see the logic in insisting > that it must be decoded to *three* code points when it clearly only > represented one in the input. As noted previously, the logic is that you generate a U+FFFD whenever a fail-fast validator fails. > This isn’t just a matter of “feels nicer”. (1) is simply illogical > behaviour, and since behaviours (1) and (2) are both clearly out there today, > it makes sense to pick the more logical alternative as the official > recommendation. Again, the current best practice makes perfect logical sense in the context of a fail-fast UTF-8 validator. Moreover, it doesn't look like both are "out there" equally when major browsers, OpenJDK and Python 3 agree. (I expect I could find more prominent implementations that implement the currently-stated best practice, but I feel I shouldn't have to.) From my experience from working on Web standards and implementing them, I think it's a bad idea to change something to be "more logical"
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, May 16, 2017 at 12:42 AM Alastair Houghton < alast...@alastairs-place.net> wrote: > If you’re about to mutter something about security, consider this: > security code *should* refuse to compare strings that contain U+FFFD (or at > least should never treat them as equal, even to themselves), because it has > no way to know what that code point represents. > Which causes various other security problems; if an object (file, database element, etc.) gets a name with a FFFD in it, it becomes impossible to reference. That an IEEE 754 float may not equal itself is a perpetual source of confusion for programmers. > Would you advocate replacing > > e0 80 80 > > with > > U+FFFD U+FFFD U+FFFD (1) > > rather than > > U+FFFD (2) > > It’s pretty clear what the intent of the encoder was there, I’d say, and > while we certainly don’t want to decode it as a NUL (that was the source of > previous security bugs, as I recall), I also don’t see the logic in > insisting that it must be decoded to *three* code points when it clearly > only represented one in the input. > In this case, It's pretty clear, but I don't see it as a general rule. Any rule has to handle e0 e0 e0 or 80 80 80 or any variety of charset or mojibake or random binary data. 88 A0 8B D4 is UTF-16 Chinese, but I'm not going to insist that it get replaced with U+FFFD U+FFFD because it's clear (to me) it was meant as two characters.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, 16 May 2017 10:01:03 +0300 Henri Sivonen via Unicode wrote: > Even so, I think even changing a recommendation of "best practice" > needs way better rationale than "feels right" or "ICU already does it" > when a) major browsers (which operate in the most prominent > environment of broken and hostile UTF-8) agree with the > currently-recommended best practice and b) the currently-recommended > best practice makes more sense for implementations where "UTF-8 > decoding" is actually mere "UTF-8 validation". There was originally an attempt to prescribe rather than to recommend the interpretation of ill-formed 8-bit Unicode strings. It may even briefly have been an issued prescription, until common sense prevailed. I do remember a sinking feeling when I thought I would have to change my own handling of bogus UTF-8, only to be relieved later when it became mere best practice. However, it is not uncommon for coding standards to prescribe 'best practice'. Richard.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Mon, May 15, 2017 at 11:50 PM, Henri Sivonen via Unicode < unicode@unicode.org> wrote: > On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode > wrote: > > I’m not sure how the discussion of “which is better” relates to the > > discussion of ill-formed UTF-8 at all. > > Clearly, the "which is better" issue is distracting from the > underlying issue. I'll clarify what I meant on that point and then > move on: > > I acknowledge that UTF-16 as the internal memory representation is the > dominant design. However, because UTF-8 as the internal memory > representation is *such a good design* (when legacy constraits permit) > that *despite it not being the current dominant design*, I think the > Unicode Consortium should be fully supportive of UTF-8 as the internal > memory representation and not treat UTF-16 as the internal > representation as the one true way of doing things that gets > considered when speccing stuff. > > I.e. I wasn't arguing against UTF-16 as the internal memory > representation (for the purposes of this thread) but trying to > motivate why the Consortium should consider "UTF-8 internally" equally > despite it not being the dominant design. > > So: When a decision could go either way from the "UTF-16 internally" > perspective, but one way clearly makes more sense from the "UTF-8 > internally" perspective, the "UTF-8 internally" perspective should be > decisive in *such a case*. (I think the matter at hand is such a > case.) > > At the very least a proposal should discuss the impact on the "UTF-8 > internally" case, which the proposal at hand doesn't do. > > (Moving on to a different point.) > > The matter at hand isn't, however, a new green-field (in terms of > implementations) issue to be decided but a proposed change to a > standard that has many widely-deployed implementations. Even when > observing only "UTF-16 internally" implementations, I think it would > be appropriate for the proposal to include a review of what existing > implementations, beyond ICU, do. > > Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick > test with three major browsers that use UTF-16 internally and have > independent (of each other) implementations of UTF-8 decoding > (Firefox, Edge and Chrome) Something I've learned through working with Node (V8 javascript engine from chrome) V8 stores strings either as UTF-16 OR UTF-8 interchangably and is not one OR the other... https://groups.google.com/forum/#!topic/v8-users/wmXgQOdrwfY and I wouldn't really assume UTF-16 is a 'majority'; Go is utf-8 for instance. > shows agreement on the current spec: there > is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, > 6 on the second, 4 on the third and 6 on the last line). Changing the > Unicode standard away from that kind of interop needs *way* better > rationale than "feels right". > > -- > Henri Sivonen > hsivo...@hsivonen.fi > https://hsivonen.fi/ > >
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 16 May 2017, at 08:22, Asmus Freytag via Unicode wrote: > I therefore think that Henri has a point when he's concerned about tacit > assumptions favoring one memory representation over another, but I think the > way he raises this point is needlessly antagonistic. That would be true if the in-memory representation had any effect on what we’re talking about, but it really doesn’t. (The only time I can think of that the in-memory representation has a significant effect is where you’re talking about default binary ordering of string data, in which case, in the presence of non-BMP characters, UTF-8 and UCS-4 sort the same way, but because the surrogates are “in the wrong place”, UTF-16 doesn’t. I think everyone is well aware of that, no?) >> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick >> test with three major browsers that use UTF-16 internally and have >> independent (of each other) implementations of UTF-8 decoding >> (Firefox, Edge and Chrome) shows agreement on the current spec: there >> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, >> 6 on the second, 4 on the third and 6 on the last line). Changing the >> Unicode standard away from that kind of interop needs *way* better >> rationale than "feels right”. In what sense is this “interop”? Under what circumstance would it matter how many U+FFFDs you see? If you’re about to mutter something about security, consider this: security code *should* refuse to compare strings that contain U+FFFD (or at least should never treat them as equal, even to themselves), because it has no way to know what that code point represents. Would you advocate replacing e0 80 80 with U+FFFD U+FFFD U+FFFD (1) rather than U+FFFD (2) It’s pretty clear what the intent of the encoder was there, I’d say, and while we certainly don’t want to decode it as a NUL (that was the source of previous security bugs, as I recall), I also don’t see the logic in insisting that it must be decoded to *three* code points when it clearly only represented one in the input. This isn’t just a matter of “feels nicer”. (1) is simply illogical behaviour, and since behaviours (1) and (2) are both clearly out there today, it makes sense to pick the more logical alternative as the official recommendation. Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 15 May 2017, at 23:43, Richard Wordingham via Unicode wrote: > > The problem with surrogates is inadequate testing. They're sufficiently > rare for many users that it may be a long time before an error is > discovered. It's not always obvious that code is designed for UCS-2 > rather than UTF-16. While I don’t think we should spend too long debating the relative merits of UTF-8 versus UTF-16, I’ll note that that argument applies equally to both combining characters and indeed the underlying UTF-8 encoding in the first place, and that mistakes in handling both are not exactly uncommon. There are advantages to UTF-8 and advantages to UTF-16. Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, May 16, 2017 at 9:50 AM, Henri Sivonen wrote: > Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick > test with three major browsers that use UTF-16 internally and have > independent (of each other) implementations of UTF-8 decoding > (Firefox, Edge and Chrome) shows agreement on the current spec: there > is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, > 6 on the second, 4 on the third and 6 on the last line). Changing the > Unicode standard away from that kind of interop needs *way* better > rationale than "feels right". Testing with that file, Python 3 and OpenJDK 8 agree with the currently-specced best-practice, too. I expect there to be other well-known implementations that comply with the currently-specced best practice, so the rationale to change the stated best practice would have to be very strong (as in: security problem with currently-stated best practice) for a change to be appropriate. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 5/15/2017 11:50 PM, Henri Sivonen via Unicode wrote: On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode wrote: I’m not sure how the discussion of “which is better” relates to the discussion of ill-formed UTF-8 at all. Clearly, the "which is better" issue is distracting from the underlying issue. I'll clarify what I meant on that point and then move on: I acknowledge that UTF-16 as the internal memory representation is the dominant design. However, because UTF-8 as the internal memory representation is *such a good design* (when legacy constraits permit) that *despite it not being the current dominant design*, I think the Unicode Consortium should be fully supportive of UTF-8 as the internal memory representation and not treat UTF-16 as the internal representation as the one true way of doing things that gets considered when speccing stuff. There are cases where it is prohibitive to transcode external data from UTF-8 to any other format, as a precondition to doing any work. In these situations processing has to be done in UTF-8, effectively making that the in-memory representation. I've encountered this issue on separate occasions, both for my own code as well as code I reviewed for clients. I therefore think that Henri has a point when he's concerned about tacit assumptions favoring one memory representation over another, but I think the way he raises this point is needlessly antagonistic. At the very least a proposal should discuss the impact on the "UTF-8 internally" case, which the proposal at hand doesn't do. This is a key point. It may not be directly relevant to any other modifications to the standard, but the larger point is to not make assumption about how people implement the standard (or any of the algorithms). (Moving on to a different point.) The matter at hand isn't, however, a new green-field (in terms of implementations) issue to be decided but a proposed change to a standard that has many widely-deployed implementations. Even when observing only "UTF-16 internally" implementations, I think it would be appropriate for the proposal to include a review of what existing implementations, beyond ICU, do. I would like to second this as well. The level of documented review of existing implementation practices tends to be thin (at least thinner than should be required for changing long-established edge cases or recommendations, let alone core conformance requirements). Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick test with three major browsers that use UTF-16 internally and have independent (of each other) implementations of UTF-8 decoding (Firefox, Edge and Chrome) shows agreement on the current spec: there is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, 6 on the second, 4 on the third and 6 on the last line). Changing the Unicode standard away from that kind of interop needs *way* better rationale than "feels right". It would be good if the UTC could work out some minimal requirements for evaluating proposals for changes to properties and algorithms, much like the criteria for encoding new code points A./
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 15 May 2017, at 23:16, Shawn Steele via Unicode wrote: > > I’m not sure how the discussion of “which is better” relates to the > discussion of ill-formed UTF-8 at all. It doesn’t, which is a point I made in my original reply to Henry. The only reason I answered his anti-UTF-16 rant at all was to point out that some of us don’t think UTF-16 is a mistake, and in fact can see various benefits (*particularly* as an in-memory representation). > And to the last, saying “you cannot process UTF-16 without handling > surrogates” seems to me to be the equivalent of saying “you cannot process > UTF-8 without handling lead & trail bytes”. That’s how the respective > encodings work. Quite. Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, May 16, 2017 at 6:23 AM, Karl Williamson wrote: > On 05/15/2017 04:21 AM, Henri Sivonen via Unicode wrote: >> >> In reference to: >> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf >> >> I think Unicode should not adopt the proposed change. >> >> The proposal is to make ICU's spec violation conforming. I think there >> is both a technical and a political reason why the proposal is a bad >> idea. > > > > Henri's claim that "The proposal is to make ICU's spec violation conforming" > is a false statement, and hence all further commentary based on this false > premise is irrelevant. > > I believe that ICU is actually currently conforming to TUS. Do you mean that ICU's behavior differs from what the PDF claims (I didn't test and took the assertion in the PDF about behavior at face value) or do you mean that despite deviating from the currently-recommended best practice the behavior is conforming, because the relevant part of the spec is mere best practice and not a requirement? > TUS has certain requirements for UTF-8 handling, and it has certain other > "Best Practices" as detailed in 3.9. The proposal involves changing those > recommendations. It does not involve changing any requirements. Even so, I think even changing a recommendation of "best practice" needs way better rationale than "feels right" or "ICU already does it" when a) major browsers (which operate in the most prominent environment of broken and hostile UTF-8) agree with the currently-recommended best practice and b) the currently-recommended best practice makes more sense for implementations where "UTF-8 decoding" is actually mere "UTF-8 validation". -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode wrote: > I’m not sure how the discussion of “which is better” relates to the > discussion of ill-formed UTF-8 at all. Clearly, the "which is better" issue is distracting from the underlying issue. I'll clarify what I meant on that point and then move on: I acknowledge that UTF-16 as the internal memory representation is the dominant design. However, because UTF-8 as the internal memory representation is *such a good design* (when legacy constraits permit) that *despite it not being the current dominant design*, I think the Unicode Consortium should be fully supportive of UTF-8 as the internal memory representation and not treat UTF-16 as the internal representation as the one true way of doing things that gets considered when speccing stuff. I.e. I wasn't arguing against UTF-16 as the internal memory representation (for the purposes of this thread) but trying to motivate why the Consortium should consider "UTF-8 internally" equally despite it not being the dominant design. So: When a decision could go either way from the "UTF-16 internally" perspective, but one way clearly makes more sense from the "UTF-8 internally" perspective, the "UTF-8 internally" perspective should be decisive in *such a case*. (I think the matter at hand is such a case.) At the very least a proposal should discuss the impact on the "UTF-8 internally" case, which the proposal at hand doesn't do. (Moving on to a different point.) The matter at hand isn't, however, a new green-field (in terms of implementations) issue to be decided but a proposed change to a standard that has many widely-deployed implementations. Even when observing only "UTF-16 internally" implementations, I think it would be appropriate for the proposal to include a review of what existing implementations, beyond ICU, do. Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick test with three major browsers that use UTF-16 internally and have independent (of each other) implementations of UTF-8 decoding (Firefox, Edge and Chrome) shows agreement on the current spec: there is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, 6 on the second, 4 on the third and 6 on the last line). Changing the Unicode standard away from that kind of interop needs *way* better rationale than "feels right". -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 05/15/2017 04:21 AM, Henri Sivonen via Unicode wrote: In reference to: http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf I think Unicode should not adopt the proposed change. The proposal is to make ICU's spec violation conforming. I think there is both a technical and a political reason why the proposal is a bad idea. Henri's claim that "The proposal is to make ICU's spec violation conforming" is a false statement, and hence all further commentary based on this false premise is irrelevant. I believe that ICU is actually currently conforming to TUS. The proposal reads: "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8..." There is nothing in here that is requiring any implementation to be changed. The word "recommend" does not mean the same as "require". Have you guys been so caught up in the current international political situation that you have lost the ability to read straight? TUS has certain requirements for UTF-8 handling, and it has certain other "Best Practices" as detailed in 3.9. The proposal involves changing those recommendations. It does not involve changing any requirements.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Softwares designed with only UCS-2 and not real UTF-16 support are still used today For example MySQL with its broken "UTF-8" encoding which in fact encodes supplementary characters as two separate 16-bit code-units for surrogates, each one blindly encoded as 3-byte sequences which would be ill-formed in standard UTF-8, buit that also does not differentiate invalid pairs of surrogates, and offers no collation support for supplementary characters. In this case some other softwares will break silently on these sequences (for example Mediawiki when installed with a MySQL backend server whose datastore was created with its broken "UTF-8", will silently discard any text starting at the first supplementary character found in the wikitext. This is not a problem of Mediawiki but the fact the MediaWiki does NOT support such MySQL server isntalled with its "UTF-8" datastore, but only supports MySQL if the storage encoding declared for the database was "binary" (but in that case there's no support of collation in MySQL, texts are just containing any random sequences of bytes and internationalization is then made in the client software, here Mediawiki and its PHP, ICU, or Lua libraries, and other tools written in Perl and other languages) Note that this does not affect Wikimedia in its wikis because they were initially installed corectly with the binary encoding in MySQL, but now Wikimedia wikis use another database engine with native UTF-8 support and full coverage of the UCS. Other wikis using Mediawiki will need to upgrade their MySQL version if they want to keep it for adminsitrative reasons (and not convert again their datastore to the binary encoding). Softwares running with only UCS-2 are exposed to such risks similar to the one seen in MediaWiki on incorrect MySQL installations, where any user may edit a page to insert any supplementary character (supplementary sinograms, emojis, Gothic letters, supplementary symbols...) which will look correct when previewing, and correct when it is parsed, accepted silently by MySQL, but then silently truncated because of the encoding error: when reloading the data from MySQL, there will effectively be unexpectedly discarded data. How to react to the risks of data losses or truncation ? Throwing an exception or just returning an error is in fact more dangerous than just replacing the ill-formed sequences by one or more U+FFFD: we preserve as much as possible, but anyway softwares should be able to perform some tests in their datastore to see if they correctly handle the encoding: this could be done when starting the sofware and emitting log messages when the backend do not support the encoding: all that is needed is to send a single supplementary character to the remote datastore in a junk table or field and then retrieve it immediately in another transaction to make sure it is preserved. Similar tests can be done to see if the remote datastore also preserves the encoding form or "normalizes it, or alters it (this alteration could happen with a leading BOM and some other silent alterations could be made on NULL and trailing spaces if the datastore does not use text fields with varying length but fixed length instead). Similar tests could be done to check the maximum length accepted (a VARCHAR(256) on a binary-encoded database will not always store 256 Unciode characters, but in a database encoded with non borken UTF-8, it should store 256 codepoints independantly of theior values, even if their UTF-8 encoding would be up to 1024 bytes. 2017-05-16 0:43 GMT+02:00 Richard Wordingham via Unicode < unicode@unicode.org>: > On Mon, 15 May 2017 21:38:26 + > David Starner via Unicode wrote: > > > > and the fact is that handling surrogates (which is what proponents > > > of UTF-8 or UCS-4 usually focus on) is no more complicated than > > > handling combining characters, which you have to do anyway. > > > Not necessarily; you can legally process Unicode text without worrying > > about combining characters, whereas you cannot process UTF-16 without > > handling surrogates. > > The problem with surrogates is inadequate testing. They're sufficiently > rare for many users that it may be a long time before an error is > discovered. It's not always obvious that code is designed for UCS-2 > rather than UTF-16. > > Richard. >
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
2017-05-15 19:54 GMT+02:00 Asmus Freytag via Unicode : > I think this political reason should be taken very seriously. There are > already too many instances where ICU can be seen "driving" the development > of property and algorithms. > > Those involved in the ICU project may not see the problem, but I agree > with Henri that it requires a bit more sensitivity from the UTC. > I don't think that the fact that ICU was originately using UTF-16 internally has ANY effect on the decision to represent ill-formed sequences as single or multiple U+FFFD. The internal encoding has nothing in common with the external encoding used when processing input data (which may be UTf-8, UTF-16, UTF-32, and could in all case present ill-formed sequences). That internal encoding here will paly no role in how to convert the ill-formed input, or if it will be converted. So yes, independantly of the internal encoding, we'll still ahve to choose between: - not converting the input and return an error or throw an exception - converting the input using a single U+FFFD (in its internal representation, this does not matter) to replace the complete sequence of ill-formed code units in the input data, and preferably return an error status - converting the input using as many U+FFFD (in its internal representation, this does not matter) to replace every ocurence of ill-formed code units in the input data, and preferably return an error status.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Mon, 15 May 2017 21:38:26 + David Starner via Unicode wrote: > > and the fact is that handling surrogates (which is what proponents > > of UTF-8 or UCS-4 usually focus on) is no more complicated than > > handling combining characters, which you have to do anyway. > Not necessarily; you can legally process Unicode text without worrying > about combining characters, whereas you cannot process UTF-16 without > handling surrogates. The problem with surrogates is inadequate testing. They're sufficiently rare for many users that it may be a long time before an error is discovered. It's not always obvious that code is designed for UCS-2 rather than UTF-16. Richard.
RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
I’m not sure how the discussion of “which is better” relates to the discussion of ill-formed UTF-8 at all. And to the last, saying “you cannot process UTF-16 without handling surrogates” seems to me to be the equivalent of saying “you cannot process UTF-8 without handling lead & trail bytes”. That’s how the respective encodings work. One could look at it and think “there are 128 unicode characters that have the same value in UTF-8 as UTF-32,” and “there are xx thousand unicode characters that have the same value in UTF-16 and UTF-32.” -Shawn From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of David Starner via Unicode Sent: Monday, May 15, 2017 2:38 PM To: unicode@unicode.org Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 On Mon, May 15, 2017 at 8:41 AM Alastair Houghton via Unicode mailto:unicode@unicode.org>> wrote: Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the case for other situations UTF-8 is clearly more efficient space-wise that includes more ASCII characters than characters between U+0800 and U+. Given the prevalence of spaces and ASCII punctuation, Latin, Greek, Cyrillic, Hebrew and Arabic will pretty much always be smaller in UTF-8. Even for scripts that go from 2 bytes to 3, webpages can get much smaller in UTF-8 (http://www.gov.cn/ goes from 63k in UTF-8 to 116k in UTF-16, a factor of 1.8). The max change in reverse is 1.5, as two bytes goes to three. and the fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 usually focus on) is no more complicated than handling combining characters, which you have to do anyway. Not necessarily; you can legally process Unicode text without worrying about combining characters, whereas you cannot process UTF-16 without handling surrogates.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Mon, May 15, 2017 at 8:41 AM Alastair Houghton via Unicode < unicode@unicode.org> wrote: > Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the > case for other situations UTF-8 is clearly more efficient space-wise that includes more ASCII characters than characters between U+0800 and U+. Given the prevalence of spaces and ASCII punctuation, Latin, Greek, Cyrillic, Hebrew and Arabic will pretty much always be smaller in UTF-8. Even for scripts that go from 2 bytes to 3, webpages can get much smaller in UTF-8 (http://www.gov.cn/ goes from 63k in UTF-8 to 116k in UTF-16, a factor of 1.8). The max change in reverse is 1.5, as two bytes goes to three. > and the fact is that handling surrogates (which is what proponents of > UTF-8 or UCS-4 usually focus on) is no more complicated than handling > combining characters, which you have to do anyway. > Not necessarily; you can legally process Unicode text without worrying about combining characters, whereas you cannot process UTF-16 without handling surrogates.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 5/15/2017 11:33 AM, Henri Sivonen via Unicode wrote: ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't representative of implementation concerns of implementations that use UTF-8 as their in-memory Unicode representation. Even though there are notable systems (Win32, Java, C#, JavaScript, ICU, etc.) that are stuck with UTF-16 as their in-memory representation, which makes concerns of such implementation very relevant, I think the Unicode Consortium should acknowledge that UTF-16 was, in retrospect, a mistake You may think that. There are those of us who do not. My point is: The proposal seems to arise from the "UTF-16 as the in-memory representation" mindset. While I don't expect that case in any way to go away, I think the Unicode Consortium should recognize the serious technical merit of the "UTF-8 as the in-memory representation" case as having significant enough merit that proposals like this should consider impact to both cases equally despite "UTF-8 as the in-memory representation" case at present appearing to be the minority case. That is, I think it's wrong to view things only or even primarily through the lens of the "UTF-16 as the in-memory representation" case that ICU represents. UTF-16 has some nice properties and there's not need to brand it a "mistake". UTF-8 has different nice properties, but there's equally not reason to treat it as more special than UTF-16. The UTC should adopt a position of perfect neutrality when it comes to assuming in-memory representation, in other words, not make assumptions that optimizing for any encoding form will benefit implementers. UTC, where ICU is strongly represented, needs to guard against basing encoding/properties/algorithm decisions (edge cases mostly), solely or primarily on the needs of a particular implementation that happens to be chosen by the ICU project. A./
RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
>> Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting >> multiple errors there makes no sense. > > Changing a specification as fundamental as this is something that should not > be undertaken lightly. IMO, the only think that can be agreed upon is that "something's bad with this UTF-8 data". I think that whether it's treated as a single group of corrupt bytes or each individual byte is considered a problem should be up to the implementation. #1 - This data should "never happen". In a system behaving normally, this condition should never be encountered. * At this point the data is "bad" and all bets are off. * Some applications may have a clue how the bad data could have happened and want to do something in particular. * It seems odd to me to spend much effort standardizing a scenario that should be impossible. #2 - Depending on implementation, either behavior, or some combination, may be more efficient. I'd rather allow apps to optimize for the common case, not the case-that-shouldn't-ever-happen #3 - We have no clue if this "maximal" sequence was a single error, 2 errors, or even more. The lead byte says how many trail bytes should follow, and those should be in a certain range. Values outside of those conditions are illegal, so we shouldn't ever encounter them. So if we did, then something really weird happened. * Did a single character get misencoded? * Was an illegal sequence illegally encoded? * Perhaps a byte got corrupted in transmission? * Maybe we dropped a packet/block, so this is really the beginning of a valid sequence and the tail of another completely valid sequence? In practice, all that most apps would be able to do would be to say "You have bad data, how bad I have no clue, but it's not right". A single bit could've flipped, or you could have only 3 pages of a 4000 page document. No clue at all. At that point it doesn't really matter how many FFFD's the error(s) are replaced with, and no assumptions should be made about the severity of the error. -Shawn
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Mon, May 15, 2017 at 6:37 PM, Alastair Houghton wrote: > On 15 May 2017, at 11:21, Henri Sivonen via Unicode > wrote: >> >> In reference to: >> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf >> >> I think Unicode should not adopt the proposed change. > > Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting > multiple errors there makes no sense. The currently-specced behavior makes perfect sense when you add error emission on top of a fail-fast UTF-8 validation state machine. >> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't >> representative of implementation concerns of implementations that use >> UTF-8 as their in-memory Unicode representation. >> >> Even though there are notable systems (Win32, Java, C#, JavaScript, >> ICU, etc.) that are stuck with UTF-16 as their in-memory >> representation, which makes concerns of such implementation very >> relevant, I think the Unicode Consortium should acknowledge that >> UTF-16 was, in retrospect, a mistake > > You may think that. There are those of us who do not. My point is: The proposal seems to arise from the "UTF-16 as the in-memory representation" mindset. While I don't expect that case in any way to go away, I think the Unicode Consortium should recognize the serious technical merit of the "UTF-8 as the in-memory representation" case as having significant enough merit that proposals like this should consider impact to both cases equally despite "UTF-8 as the in-memory representation" case at present appearing to be the minority case. That is, I think it's wrong to view things only or even primarily through the lens of the "UTF-16 as the in-memory representation" case that ICU represents. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 15 May 2017, at 18:52, Asmus Freytag wrote: > > On 5/15/2017 8:37 AM, Alastair Houghton via Unicode wrote: >> On 15 May 2017, at 11:21, Henri Sivonen via Unicode >> wrote: >>> In reference to: >>> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf >>> >>> I think Unicode should not adopt the proposed change. >> Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting >> multiple errors there makes no sense. > > Changing a specification as fundamental as this is something that should not > be undertaken lightly. Agreed. > Apparently we have a situation where implementations disagree, and have done > so for a while. This normally means not only that the implementations differ, > but that data exists in both formats. > > Even if it were true that all data is only stored in UTF-8, any data > converted from UFT-8 back to UTF-8 going through an interim stage that > requires UTF-8 conversion would then be different based on which converter is > used. > > Implementations working in UTF-8 natively would potentially see three formats: > 1) the original ill-formed data > 2) data converted with single FFFD > 3) data converted with multiple FFFD > > These forms cannot be compared for equality by binary matching. But that was always true, if you were under the impression that only one of (2) and (3) existed, and indeed claiming equality between two instances of U+FFFD might be problematic itself in some circumstances (you don’t know why the U+FFFDs were inserted - they may not replace the same original data). > The best that can be done is to convert (1) into one of the other forms and > then compare treating any run of FFFD code points as equal to any other run, > irrespective of length. It’s probably safer, actually, to refuse to compare U+FFFD as equal to anything (even itself) unless a special flag is passed. For “general purpose” applications, you could set that flag and then a single U+FFFD would compare equal to another single U+FFFD; no need for the complicated “any string of U+FFFD” logic (which in any case makes little sense - it could just as easily generate erroneous comparisons as fix the case we’re worrying about here). > Because we've had years of multiple implementations, it would be expected > that copious data exists in all three formats, and that data will not go > away. Changing the specification to pick one of these formats as solely > conformant is IMHO too late. I don’t think so. Even if we acknowledge the possibility of data in the other form, I think it’s useful guidance to implementers, both now and in the future. One might even imagine that the other, non-favoured form, would eventually fall out of use. Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 5/15/2017 3:21 AM, Henri Sivonen via Unicode wrote: Second, the political reason: Now that ICU is a Unicode Consortium project, I think the Unicode Consortium should be particular sensitive to biases arising from being both the source of the spec and the source of a popular implementation. It looks*really bad* both in terms of equal footing of ICU vs. other implementations for the purpose of how the standard is developed as well as the reliability of the standard text vs. ICU source code as the source of truth that other implementors need to pay attention to if the way the Unicode Consortium resolves a discrepancy between ICU behavior and a well-known spec provision (this isn't some ill-known corner case, after all) is by changing the spec instead of changing ICU*especially* when the change is not neutral for implementations that have made different but completely valid per then-existing spec and, in the absence of legacy constraints, superior architectural choices compared to ICU (i.e. UTF-8 internally instead of UTF-16 internally). I can see the irony of this viewpoint coming from a WHATWG-aligned browser developer, but I note that even browsers that use ICU for legacy encodings don't use ICU for UTF-8, so the ICU UTF-8 behavior isn't, in fact, the dominant browser UTF-8 behavior. That is, even Blink and WebKit use their own non-ICU UTF-8 decoder. The Web is the environment that's the most sensitive to how issues like this are handled, so it would be appropriate for the proposal to survey current browser behavior instead of just saying that ICU "feels right" or is "natural". I think this political reason should be taken very seriously. There are already too many instances where ICU can be seen "driving" the development of property and algorithms. Those involved in the ICU project may not see the problem, but I agree with Henri that it requires a bit more sensitivity from the UTC. A./
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 5/15/2017 8:37 AM, Alastair Houghton via Unicode wrote: On 15 May 2017, at 11:21, Henri Sivonen via Unicode wrote: In reference to: http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf I think Unicode should not adopt the proposed change. Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting multiple errors there makes no sense. Changing a specification as fundamental as this is something that should not be undertaken lightly. Apparently we have a situation where implementations disagree, and have done so for a while. This normally means not only that the implementations differ, but that data exists in both formats. Even if it were true that all data is only stored in UTF-8, any data converted from UFT-8 back to UTF-8 going through an interim stage that requires UTF-8 conversion would then be different based on which converter is used. Implementations working in UTF-8 natively would potentially see three formats: 1) the original ill-formed data 2) data converted with single FFFD 3) data converted with multiple FFFD These forms cannot be compared for equality by binary matching. The best that can be done is to convert (1) into one of the other forms and then compare treating any run of FFFD code points as equal to any other run, irrespective of length. (For security-critical applications, the presence of any FFFD should render the data invalid, so the comparisons we'd be talking about here would be for general purpose, like search). Because we've had years of multiple implementations, it would be expected that copious data exists in all three formats, and that data will not go away. Changing the specification to pick one of these formats as solely conformant is IMHO too late. A./ ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't representative of implementation concerns of implementations that use UTF-8 as their in-memory Unicode representation. Even though there are notable systems (Win32, Java, C#, JavaScript, ICU, etc.) that are stuck with UTF-16 as their in-memory representation, which makes concerns of such implementation very relevant, I think the Unicode Consortium should acknowledge that UTF-16 was, in retrospect, a mistake You may think that. There are those of us who do not. The fact is that UTF-16 makes sense as a default encoding in many cases. Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the case for other situations and the fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 usually focus on) is no more complicated than handling combining characters, which you have to do anyway. Therefore, despite UTF-16 being widely used as an in-memory representation of Unicode and in no way going away, I think the Unicode Consortium should be *very* sympathetic to technical considerations for implementations that use UTF-8 as the in-memory representation of Unicode. I don’t think the Unicode Consortium should be unsympathetic to people who use UTF-8 internally, for sure, but I don’t see what that has to do with either the original proposal or with your criticism of UTF-16. [snip] If the proposed change was adopted, while Draconian decoders (that fail upon first error) could retain their current state machine, implementations that emit U+FFFD for errors and continue would have to add more state machine states (i.e. more complexity) to consolidate more input bytes into a single U+FFFD even after a valid sequence is obviously impossible. “Impossible”? Why? You just need to add some error states (or *an* error state and a counter); it isn’t exactly difficult, and I’m sure ICU isn’t the only library that already did just that *because it’s clearly the right thing to do*. Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 15 May 2017, at 11:21, Henri Sivonen via Unicode wrote: > > In reference to: > http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf > > I think Unicode should not adopt the proposed change. Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting multiple errors there makes no sense. > ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't > representative of implementation concerns of implementations that use > UTF-8 as their in-memory Unicode representation. > > Even though there are notable systems (Win32, Java, C#, JavaScript, > ICU, etc.) that are stuck with UTF-16 as their in-memory > representation, which makes concerns of such implementation very > relevant, I think the Unicode Consortium should acknowledge that > UTF-16 was, in retrospect, a mistake You may think that. There are those of us who do not. The fact is that UTF-16 makes sense as a default encoding in many cases. Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the case for other situations and the fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 usually focus on) is no more complicated than handling combining characters, which you have to do anyway. > Therefore, despite UTF-16 being widely used as an in-memory > representation of Unicode and in no way going away, I think the > Unicode Consortium should be *very* sympathetic to technical > considerations for implementations that use UTF-8 as the in-memory > representation of Unicode. I don’t think the Unicode Consortium should be unsympathetic to people who use UTF-8 internally, for sure, but I don’t see what that has to do with either the original proposal or with your criticism of UTF-16. [snip] > If the proposed > change was adopted, while Draconian decoders (that fail upon first > error) could retain their current state machine, implementations that > emit U+FFFD for errors and continue would have to add more state > machine states (i.e. more complexity) to consolidate more input bytes > into a single U+FFFD even after a valid sequence is obviously > impossible. “Impossible”? Why? You just need to add some error states (or *an* error state and a counter); it isn’t exactly difficult, and I’m sure ICU isn’t the only library that already did just that *because it’s clearly the right thing to do*. Kind regards, Alastair. -- http://alastairs-place.net