Martin Kochanski wrote: > Are there, in fact, many circumstances in which it is > necessary for an end user to create files that do *not* have > a BOM at the beginning?
AFAIK, UTF-8 files are NOT supposed to have a BOM in them. Why is UTF-16 percieved as UNICODE? Well, we all know it's because UCS-2 used to be the ONLY implementation of Unicode. But there is another important difference between UTF-16 and UTF-8. It is barely possible to misinterpret UTF-16, because it uses shorts and not bytes. On the other hand, UTF-8 and ASCII are in extreme cases identical. Why not have BOM in UTF-8? Probably because of the applications that don't really need to know that a file is in UTF-8, especially since it may be pure ASCII in many cases (e.g. system configuration files). And if Unicode is THE codeset to be used in the future, then at some point in time all files would begin with a UTF-8 BOM. Quite unnecessary. Further problems arise when you concat files or start reading in the middle. To be honest, "Unicode" meaning UTF-16 and "UTF-8" are fine with me. It's what I am used to. For UNIX users UTF-8 is just like EUC or ISO-8859-x, another codeset. The fact that it is universal does not mean it has to be called Unicode, I think UTF-8 is just fine and equally (or more) useful. And on UNIX, it is essential that the user is aware of the codeset that is being used. I keep seeing files being used as examples. Think filesystems, file names. File names would surely not start with a BOM, even if files would. Suppose you have a script that will create some files, it is published on the web, and you want to save it so you can run it. Now, it is up to you, how to save it. If you use UTF-8 filenames, you do not want to save it as some ISO, neither as just any Unicode, but precisely UTF-8. The shell will execute the script and use byte sequences from the file to create filenames. Now, an opposite example. You execute ls > ls.out, in a directory that has some filenames (say, old files) in ISO and many others in UTF-8. What format is the resulting file in? Well, since this is happening in the year 2016, the editor will assume it's in UTF-8. We already agreed there are no BOM's in files unless they are UTF-16, so the file must be UTF-8 just like (almost) everything else is. Even if there BOM's would be used, should this file have it? Anyway, some invalid sequences will be encountered by the editor, but then hopefully it will simply display some replacement characters (or ask if it can do so). Hopefully it will allow me to save the file, with invalid sequences intact. Editing invalid sequences (or inserting new ones) would be too much to ask, right? What bothers me a little bit is that I would not be able to save such a file as UTF-16 because of the invalid sequences in it. Why would I? Well, Windows has more and more suppport for UTF-8, so maybe I don't really need to. I still wish I had an option though. This again makes me think that UTF-8 and UTF-16 are not both Unicode. Maybe UTF-16 is 'more' Unicode right now, because of the past. But maybe UTF-8 will be 'more' Unicode in the future, because it can contain invalid sequences and these can be properly interpreted by someone at a later time. Unless UTF-16 has that same ability, it will lose the battle of being an 'equally good Unicode format'. And why do I keep this in the "Unicode and end users" thread? Because invalid sequences (and old filenames) are a fact that users WILL experience and pretending that this is just a case of non-conformance is not in the best interest of the users. Lars Kristan Storage & Data Management Lab HERMES SoftLab

