2012/7/10 Naena Guru <naenag...@gmail.com> > I wanted to see how hard it is to edit a page in Notepad. So I made a copy > of my LIYANNA page and replaced the character entities I used for Unicode > Sinhala, accented Pali and Sanskrit with their raw letters. Notepad forced > me to save the file in UTF-8 format. I ran it through W3C Validator. It > passed HTML5 test with the following warning: > > [image: Warning] Byte-Order Mark found in UTF-8 File. > > The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause > problems for some text editors and older browsers. You may want to consider > avoiding its use until it is better supported. > > The BOM is the first character of the file. There are myriad hoops that > non-Latin users go through to do things that we routinely do. This problem > I saw right at the inception. I already know why romanizing is so good. > Don't you? >
You should probably ignore this non-critical warning now ; it is only for extremely strict compatibility with deprecated softwares that should have been updated since long for obvious security and performance reasons. Those old browsers are deprecating fast (due to the massive and fast spread of security attacks, automatic security updates to close issues competely (instead of just by preventive virus detection based on code bahavior or code patterns which will never be complete and fast enough to react to these extremely frequent attacks). Older editors do not have the cumfort that newer editors have. The memory usage of these newer editors are no longer a problem (notably for web developers that have systems largely above what theiur average users have), and systems capable of running them have never been so cheap. In addition, memory and storage costs have dramatically decreased. We are more concerned about the bandwidth usage, so your web editing platform should include an optimisation process and converters that will automatically use a compact representation (numeric character references for example can be sent by your server as raw UTF-8, in addition the server can now support on-the-fly data compression over the HTTP sessions ; there also exists frontend proxies that will do that for you without requiring you to change the development/editing methods you use. Most text editors even in Linux can now open sucessfully UTF-8 files starting by a BOM without complaining. Just like Notepad does since long. And they allow you to change this edit mode before saving. Most text processors will silently discard the U+FEFF character (it should be safe to do that everywhere, given that U+FEFF should no longer be used for anything else than BOM's) [side node] But Notepad has another problem since long : it cannot sucessfully open a text file whose lines are terminated by LF only, it absolutely wants them to be converted using CR+LF sequences ; this problem is much more severe than the use of a leading BOM. As well, Excell cannot successfully decode an UTF-8 encoded CSV file. But it can autoamtically recognize it if you used instead the "import data" function. This is inconsistant (also it still does not allow specifying how to convert numbers using dots instead of commas, when running it on a non-English user locale, you need to manually use a search/replace function; it does not allow selecting the date format for CSV file imports, making searhd/replacements operations is not trivial on date fields ; no question is asked to the user, it only uses implicits defaults even when they are wrong, most of the time for actual cases of CSV files). [/side node] But It has nothing to do with your problem of romanization or behavior with Latin. BOMs are only absent from old 8-bit character sets that are no longer recommanded in any modern Internet protocols ; and from 7-bit ASCII used only for internal technical data but not for any text intended to be read and translated. Only UTF-8 support is mandatory now. And that's fine. HTTP headers or URLs require a specufinc encoding but webservers and designing tools can ta ke care of that Everythng else is optional and will require an explicit metadata (the exceptions being UTF-16 and UTF-32 which are not well suited for interchanges across heterogeneous networks and independant realms, but used mostly for internal processes, for which you absolutely don't need any byte order change, so for which you don't even need any BOM: If there's one, you can safely discard it from the input strings, adjusting the length and offset positions in the source if that source is randomly seeakable ; you don't need to adjust these lengths and/or positions if the source is a serial input stream which is not seekable in the backward direction or randomly seekable in the forward direction in a fast direct manner without reading all intermediate positions.)