-----BEGIN PGP SIGNED MESSAGE----- Lars Kristan wrote: > Now, an opposite example. You execute ls > ls.out, in a directory that has > some filenames (say, old files) in ISO and many others in UTF-8. What format > is the resulting file in? Well, since this is happening in the year 2016, > the editor will assume it's in UTF-8. We already agreed there are no BOM's > in files unless they are UTF-16, so the file must be UTF-8 just like > (almost) everything else is. Even if there BOM's would be used, should this > file have it? Anyway, some invalid sequences will be encountered by the > editor, but then hopefully it will simply display some replacement > characters (or ask if it can do so). Hopefully it will allow me to save the > file, with invalid sequences intact. Editing invalid sequences (or inserting > new ones) would be too much to ask, right?
I tried to address this in my Unicode 3.2 comments (part 2); I think it's an important issue given the tightening of the specification of UTF-8. Among other changes (which these partly depend on), I suggested adding the following notes to C12: > - The error condition that results from an ill-formed code sequence > need not cause all of the input in which the error occurs to be > rejected. It is permitted to store text in a form that allows > ill-formed code sequences to be regenerated when the text is output, > but only if this output is in the same Unicode Transformation > Format as the original ill-formed input. The reason for requiring that ill-formed sequences only be regenerated if the output UTF is the same as the input UTF, is that otherwise we would effectively be endorsing non-standard UTFs that are not bijective. For example, consider the "UTF-8B" proposal for round-trip conversion of UTF-8 -> UTF-16 -> UTF-8 (do a search in the archive of this list). That method is fine *provided that the non-standard UTF-16 that it produces can only appear internally*. If it can appear externally and is interpreted by general-purpose UTF-16 -> UTF-8 encoders, then that would create the same multiple representation problem for UTF-16 that Unicode 3.2 is trying to fix for UTF-8, because the UTF-16 -> UTF-8B conversion is not one-to-one. The above text tries to prohibit this, but without prohibiting, say, a UTF-16-based text-editor that uses UTF-8B in order to read and write UTF-8 files without destroying ill-formed sequences. The latter is harmless and does not create any multiple representation problems. > It is also permitted to > replace an ill-formed code sequence by a code reserved by the > implementation for that purpose, for example by a noncharacter code. Should a specific code be reserved for this? It is not the same thing as U+FFFD REPLACEMENT CHARACTER, even though that is what some transcoders use. Plan-9 calls it "rune_error" and uses U+0080, IIRC. I suggest U+FDEF. [I've thought about this a bit more, and I'm now convinced that it's useful to have a separate, standardised code for this - say U+FDEF ILL-FORMED INPUT MARKER. (Can noncharacters have names?) It's desirable that this be a noncharacter in order to reflect the fact that it indicates an error condition, not a "real" character, and to preserve the property that, given the other changes I suggested, every real character corresponds to exactly one code sequence in each fixed-byte-order UTF.] > Ill-formed sequences should not be deleted, however, since that > introduces similar security concerns to those described for > noncharacters in the notes to clause C10. [Those notes said: > For example, suppose that a security check is performed > that involves testing for the substring ".." (<U+002E, U+002E>) in > a file name. If it is left ambiguous whether or not noncharacters > are to be deleted, then the string <U+002E, U+FFFE, U+002E> could > potentially pass this check, but still be treated as equivalent to > ".." by the filesystem. Accordingly the required behaviour for > processes that receive noncharacters in input has been changed (see > C5 above): this should either cause an error, or the noncharacters > should be treated as unassigned; they must not be automatically > deleted. This does not preclude a higher-level protocol from > specifying explicitly that a string should be modified by deleting > noncharacters at a well-defined stage of its processing. ] > - Transformations between the Unicode 3.2 versions of UTF-8, UTF-16 > and UTF-32 are bijections between the corresponding sets of valid > (i.e. not ill-formed) code sequences. [This should just say "well-formed code sequences", where "well-formed" is defined somewhere as "not ill-formed".] > Ill-formed code sequences > detected during transformation are treated as error conditions > as described above. > What bothers me a little bit is that I would not be able to save such a file > as UTF-16 because of the invalid sequences in it. Right - the above proposal forbids using ill-formed UTF-16 to represent the invalid UTF-8 sequences externally, and explains why. (Would it be helpful for the rationale to be included in the standard itself?) > Why would I? Well, Windows > has more and more suppport for UTF-8, so maybe I don't really need to. I > still wish I had an option though. If a file (or any string) has ill-formed sequences that have come from some non-UTF charset, retaining them is normally only useful if the file is written out exactly as it was read. If you load it as UTF-8[B] but try to save it as UTF-16, you're asserting that the output really is UTF-16; if it is not, then IMHO the program should refuse to do that. (The text I proposed allows the file to be saved in a different UTF with the ill-formed sequences replaced by an ILL-FORMED INPUT MARKER, though.) Doug Ewell wrote: > "Lars Kristan" <[EMAIL PROTECTED]> wrote: > > Why not have BOM in UTF-8? Probably because of the applications that > > don't really need to know that a file is in UTF-8, especially since it > > may be pure ASCII in many cases (e.g. system configuration files). > > And if Unicode is THE codeset to be used in the future, then at some > > point in time all files would begin with a UTF-8 BOM. Quite unnecessary. > > Further problems arise when you concat files or start reading in the > > middle. > > That's why U+2060 WORD JOINER is being introduced in Unicode 3.2. > Hopefully it will take over the ZWNBSP semantics from U+FEFF, which can > then be used *solely* as a BOM. Eventually, if this happens, it will > become safe to strip BOM's as they appear. No it won't: silently stripping characters without considering that to be a change to the string is a potential security problem. It's unlikely that this would be a problem at the start of a *file*, but "UTF-16" in the sense of the IANA-registered charset of that name (i.e. swap byte order every time you see "U+FFFE", and strip U+FEFF anywhere it appears), is simply a bad idea IMHO. - -- David Hopwood <[EMAIL PROTECTED]> Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/ RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01 Nothing in this message is intended to be legally binding. If I revoke a public key but refuse to specify why, it is because the private key has been seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip -----BEGIN PGP SIGNATURE----- Version: 2.6.3i Charset: noconv iQEVAwUBPGuCJTkCAxeYt5gVAQHZrQgAxWLQ+MN+xQhKR8ADoteW++/KNhHmWZtS 3XspULtnzNZuutGLza0jGrS4szi0XbI1I7h1lz/osGOxqLXG9ZyD4r8cjguR/qYl tCx2c5LQ9JH7sAuC+bmadRMbYDl/VWeDQYq8AtL3HdpViz8gSZSooeurkS+/DhQL TzuFUM+eszJ4uwVLARmbxC+StLVbQ0OOI7FEgZv498ASOXN1xaJFAg724tVN6xFM H5rKPi2w1icu6HmiNE3pwenwJRv6VDv8DNze4KhKwj3mJZ+6BcuC9dSszldNgS78 5aRjUjTQ7bjC/FzsFMflqc3i65BKIJED7CZ6atd90sODzJxXZXikBA== =RQJm -----END PGP SIGNATURE-----

