RE: Unicode and end users

Lars Kristan Thu, 14 Feb 2002 04:58:04 -0800

Martin Kochanski wrote:
> Are there, in fact, many circumstances in which it is 
> necessary for an end user to create files that do *not* have 
> a BOM at the beginning?


AFAIK, UTF-8 files are NOT supposed to have a BOM in them.

Why is UTF-16 percieved as UNICODE? Well, we all know it's because UCS-2
used to be the ONLY implementation of Unicode. But there is another
important difference between UTF-16 and UTF-8. It is barely possible to
misinterpret UTF-16, because it uses shorts and not bytes. On the other
hand, UTF-8 and ASCII are in extreme cases identical.

Why not have BOM in UTF-8? Probably because of the applications that don't
really need to know that a file is in UTF-8, especially since it may be pure
ASCII in many cases (e.g. system configuration files). And if Unicode is THE
codeset to be used in the future, then at some point in time all files would
begin with a UTF-8 BOM. Quite unnecessary. Further problems arise when you
concat files or start reading in the middle.

To be honest, "Unicode" meaning UTF-16 and "UTF-8" are fine with me. It's
what I am used to. For UNIX users UTF-8 is just like EUC or ISO-8859-x,
another codeset. The fact that it is universal does not mean it has to be
called Unicode, I think UTF-8 is just fine and equally (or more) useful. And
on UNIX, it is essential that the user is aware of the codeset that is being
used. I keep seeing files being used as examples. Think filesystems, file
names. File names would surely not start with a BOM, even if files would.
Suppose you have a script that will create some files, it is published on
the web, and you want to save it so you can run it. Now, it is up to you,
how to save it. If you use UTF-8 filenames, you do not want to save it as
some ISO, neither as just any Unicode, but precisely UTF-8. The shell will
execute the script and use byte sequences from the file to create filenames.

Now, an opposite example. You execute ls > ls.out, in a directory that has
some filenames (say, old files) in ISO and many others in UTF-8. What format
is the resulting file in? Well, since this is happening in the year 2016,
the editor will assume it's in UTF-8. We already agreed there are no BOM's
in files unless they are UTF-16, so the file must be UTF-8 just like
(almost) everything else is. Even if there BOM's would be used, should this
file have it? Anyway, some invalid sequences will be encountered by the
editor, but then hopefully it will simply display some replacement
characters (or ask if it can do so). Hopefully it will allow me to save the
file, with invalid sequences intact. Editing invalid sequences (or inserting
new ones) would be too much to ask, right?

What bothers me a little bit is that I would not be able to save such a file
as UTF-16 because of the invalid sequences in it. Why would I? Well, Windows
has more and more suppport for UTF-8, so maybe I don't really need to. I
still wish I had an option though.

This again makes me think that UTF-8 and UTF-16 are not both Unicode. Maybe
UTF-16 is 'more' Unicode right now, because of the past. But maybe UTF-8
will be 'more' Unicode in the future, because it can contain invalid
sequences and these can be properly interpreted by someone at a later time.
Unless UTF-16 has that same ability, it will lose the battle of being an
'equally good Unicode format'.


And why do I keep this in the "Unicode and end users" thread? Because
invalid sequences (and old filenames) are a fact that users WILL experience
and pretending that this is just a case of non-conformance is not in the
best interest of the users.


Lars Kristan
Storage & Data Management Lab
HERMES SoftLab

RE: Unicode and end users

Reply via email to