Re: Unicode and end users

David Hopwood Fri, 15 Feb 2002 09:58:08 -0800

-----BEGIN PGP SIGNED MESSAGE-----

Lars Kristan wrote:
> Now, an opposite example. You execute ls > ls.out, in a directory that has
> some filenames (say, old files) in ISO and many others in UTF-8. What format
> is the resulting file in? Well, since this is happening in the year 2016,
> the editor will assume it's in UTF-8. We already agreed there are no BOM's
> in files unless they are UTF-16, so the file must be UTF-8 just like
> (almost) everything else is. Even if there BOM's would be used, should this
> file have it? Anyway, some invalid sequences will be encountered by the
> editor, but then hopefully it will simply display some replacement
> characters (or ask if it can do so). Hopefully it will allow me to save the
> file, with invalid sequences intact. Editing invalid sequences (or inserting
> new ones) would be too much to ask, right?

I tried to address this in my Unicode 3.2 comments (part 2); I think it's
an important issue given the tightening of the specification of UTF-8.

Among other changes (which these partly depend on), I suggested adding
the following notes to C12:

  >   - The error condition that results from an ill-formed code sequence
  >     need not cause all of the input in which the error occurs to be
  >     rejected. It is permitted to store text in a form that allows
  >     ill-formed code sequences to be regenerated when the text is output,
  >     but only if this output is in the same Unicode Transformation
  >     Format as the original ill-formed input.

  The reason for requiring that ill-formed sequences only be regenerated
  if the output UTF is the same as the input UTF, is that otherwise we
  would effectively be endorsing non-standard UTFs that are not bijective.

  For example, consider the "UTF-8B" proposal for round-trip conversion
  of UTF-8 -> UTF-16 -> UTF-8 (do a search in the archive of this list).
  That method is fine *provided that the non-standard UTF-16 that it
  produces can only appear internally*. If it can appear externally and
  is interpreted by general-purpose UTF-16 -> UTF-8 encoders, then that
  would create the same multiple representation problem for UTF-16 that
  Unicode 3.2 is trying to fix for UTF-8, because the UTF-16 -> UTF-8B
  conversion is not one-to-one.

  The above text tries to prohibit this, but without prohibiting, say,
  a UTF-16-based text-editor that uses UTF-8B in order to read and
  write UTF-8 files without destroying ill-formed sequences. The latter
  is harmless and does not create any multiple representation problems.

  >                                              It is also permitted to
  >     replace an ill-formed code sequence by a code reserved by the
  >     implementation for that purpose, for example by a noncharacter code.

  Should a specific code be reserved for this? It is not the same thing
  as U+FFFD REPLACEMENT CHARACTER, even though that is what some
  transcoders use. Plan-9 calls it "rune_error" and uses U+0080, IIRC.
  I suggest U+FDEF.

[I've thought about this a bit more, and I'm now convinced that it's
useful to have a separate, standardised code for this - say
U+FDEF ILL-FORMED INPUT MARKER. (Can noncharacters have names?)

It's desirable that this be a noncharacter in order to reflect the fact
that it indicates an error condition, not a "real" character, and to
preserve the property that, given the other changes I suggested, every
real character corresponds to exactly one code sequence in each
fixed-byte-order UTF.]

  >     Ill-formed sequences should not be deleted, however, since that
  >     introduces similar security concerns to those described for
  >     noncharacters in the notes to clause C10.

[Those notes said:
  >     For example, suppose that a security check is performed
  >     that involves testing for the substring ".." (<U+002E, U+002E>) in
  >     a file name. If it is left ambiguous whether or not noncharacters
  >     are to be deleted, then the string <U+002E, U+FFFE, U+002E> could
  >     potentially pass this check, but still be treated as equivalent to
  >     ".." by the filesystem. Accordingly the required behaviour for
  >     processes that receive noncharacters in input has been changed (see
  >     C5 above): this should either cause an error, or the noncharacters
  >     should be treated as unassigned; they must not be automatically
  >     deleted. This does not preclude a higher-level protocol from
  >     specifying explicitly that a string should be modified by deleting
  >     noncharacters at a well-defined stage of its processing.
]

  >   - Transformations between the Unicode 3.2 versions of UTF-8, UTF-16
  >     and UTF-32 are bijections between the corresponding sets of valid
  >     (i.e. not ill-formed) code sequences.

[This should just say "well-formed code sequences", where "well-formed"
is defined somewhere as "not ill-formed".]

  >     Ill-formed code sequences
  >     detected during transformation are treated as error conditions
  >     as described above.

> What bothers me a little bit is that I would not be able to save such a file
> as UTF-16 because of the invalid sequences in it.

Right - the above proposal forbids using ill-formed UTF-16 to represent the
invalid UTF-8 sequences externally, and explains why. (Would it be helpful
for the rationale to be included in the standard itself?)

> Why would I? Well, Windows
> has more and more suppport for UTF-8, so maybe I don't really need to. I
> still wish I had an option though.

If a file (or any string) has ill-formed sequences that have come from some
non-UTF charset, retaining them is normally only useful if the file is
written out exactly as it was read. If you load it as UTF-8[B] but try to
save it as UTF-16, you're asserting that the output really is UTF-16; if
it is not, then IMHO the program should refuse to do that. (The text I
proposed allows the file to be saved in a different UTF with the ill-formed
sequences replaced by an ILL-FORMED INPUT MARKER, though.)

Doug Ewell wrote:
> "Lars Kristan" <[EMAIL PROTECTED]> wrote:
> > Why not have BOM in UTF-8? Probably because of the applications that
> > don't really need to know that a file is in UTF-8, especially since it
> > may be pure ASCII in many cases (e.g. system configuration files).
> > And if Unicode is THE codeset to be used in the future, then at some
> > point in time all files would begin with a UTF-8 BOM. Quite unnecessary.
> > Further problems arise when you concat files or start reading in the
> > middle.
> 
> That's why U+2060 WORD JOINER is being introduced in Unicode 3.2.
> Hopefully it will take over the ZWNBSP semantics from U+FEFF, which can
> then be used *solely* as a BOM.  Eventually, if this happens, it will
> become safe to strip BOM's as they appear.

No it won't: silently stripping characters without considering that to be
a change to the string is a potential security problem. It's unlikely
that this would be a problem at the start of a *file*, but "UTF-16" in
the sense of the IANA-registered charset of that name (i.e. swap byte
order every time you see "U+FFFE", and strip U+FEFF anywhere it appears),
is simply a bad idea IMHO.

- -- 
David Hopwood <[EMAIL PROTECTED]>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPGuCJTkCAxeYt5gVAQHZrQgAxWLQ+MN+xQhKR8ADoteW++/KNhHmWZtS
3XspULtnzNZuutGLza0jGrS4szi0XbI1I7h1lz/osGOxqLXG9ZyD4r8cjguR/qYl
tCx2c5LQ9JH7sAuC+bmadRMbYDl/VWeDQYq8AtL3HdpViz8gSZSooeurkS+/DhQL
TzuFUM+eszJ4uwVLARmbxC+StLVbQ0OOI7FEgZv498ASOXN1xaJFAg724tVN6xFM
H5rKPi2w1icu6HmiNE3pwenwJRv6VDv8DNze4KhKwj3mJZ+6BcuC9dSszldNgS78
5aRjUjTQ7bjC/FzsFMflqc3i65BKIJED7CZ6atd90sODzJxXZXikBA==
=RQJm
-----END PGP SIGNATURE-----

Re: Unicode and end users

Reply via email to