On 19/10/09 08:29, pansz wrote: > > Tony Mechelynck 写道: >> - If you want one particular file to be recognized as UTF-8 not only by >> Vim but also by other programs (let's say by other Windows editors such >> as WordPad; or by browsers if the files are in HTML, CSS or even >> plaintext) it helps if you use ":setlocal bomb" (or maybe ":setlocal >> fenc=utf-8 bomb") before saving the file. Note that the BOM consists of >> bytes with the high bit set, so the following paragraph never applies to >> >> Best regards, >> Tony. > > Note this does not work if you're programmer. utf-8 files should *not* > contain the BOM, otherwise, it may not compile with gcc. > > By definition and by original design, utf-8 files should not have BOM, > you can use utf-8 BOM only if you view your file with your eye and do > not process the file with any program (such as compiler or lex parser, etc.)
As I said, modern browsers recognise the BOM, even in UTF-8. IIUC, the current HTML specifications mention it (to be used at the very start, before <!DOCTYPE and before <html>) as one of the ways to recognise that a page is in UTF-8 (the other two are the Content-Type header and the <meta http-equiv="Content-Type" ...> tag. I'm not sure how to decide when they disagree but IIUC it is foreseen too. When I was on Win XP, the only way to have WordPad recognise a UTF-8 file as UTF-8 was to have a BOM at its start (and when saving as "Unicode text", it would produce UTF-16le, or maybe UCS-2le, but never UTF-8; however it had a BOM and Vim read it with no problem). I'm also quoting one Q&A from the FAQ found on the unicode.org site: http://www.unicode.org/faq/utf_bom.html#bom5 ----8<---- Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order? A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts. [AF] & [MD] ---->8---- Maybe I ought to have mentioned the fact that bash indeed doesn't "see" the #! shebang when there is a BOM before it; but OTOH I have also noticed that (for intance) SeaMonkey 2 (or Firefox 3.5) display a *.txt file perfectly as UTF-8 when it has a BOM. In fact, when I want to print a text file which contains non-Latin1 text (for instance, a text in French with one word in Hebrew and one sentence in Greek) I pass it as UTF-8 with BOM to my browser (which prints it flawlessly) because AFAICT Vim's ":hardcopy" command doesn't work in that case. @bill lam: I am not at all convinced that "everyone on Linux uses UTF-8" and also not that "no one uses it with a BOM". These assertions sound to me like wishful thinking, overgeneralizations, and the same sort of "that's what I do therefore no one does otherwise" which led to (for instance) the disappearance of the throbber link in Firefox and Thunderbird a few versions ago. I am convinced that many people on Linux still occasionally use the "vim-minimal" Tiny version of Vim distributed under the executable name "vi" by RedHat, SuSE, and maybe others: that editor is compiled with -multi_byte and cannot handle UTF-8 (nor can it use CJK scripts, which may perhaps be why you aren't aware of it). As for using UTF-8 with BOM, I have no statistics on it about what other people do, but I found it to be (as the FAQ quoted above said) an excellent signature to mean that a file is in UTF-8. This ought not to conflict with shell scripts, which cannot have any BOM but are (normally) in 7-bit ASCII. The only place where I can imagine using UTF-8 in a shell script is in a text literal passed to a command (most typically to the "echo" command) but I'm aware that some people (though maybe it was on Windows) use Russian or Chinese in filenames (I prefer to stay with ASCII). Best regards, Tony. -- hundred-and-one symptoms of being an internet addict: 163. You go outside for the fresh air (at -30 degrees) but open the window first to hear new mail arrive. --~--~---------~--~----~------------~-------~--~----~ You received this message from the "vim_use" maillist. For more information, visit http://www.vim.org/maillist.php -~----------~----~----~----~------~----~------~--~---
