Re: Unicode and Encoding Problems in Browsers
I'd like to mention that this problem which Muhammad Asif brings forth is an extant one in my circle of work. I work as PC technician, and one complaint I often get in tech support calls is that the user is unable to type Hebrew in the Search box in the MSN Israel website (msn.co.il) under Windows XP. At the first time, I told the user to set the Language for Non-Unicode Programs (known as the System Locale in Windows 2000, which sets the emulated ANSI codepage), but it didn't help: the user still complained of seeing boxes instead of proper Hebrew letters. The encoding of MSN.co.il is Hebrew (Windows). It doesn't happen under all machines. Mine at home runs XP too, but I don't have that problem. I suspect it's not related to Unicode/encodings stuff at all. The fact that it appears only under XP (and not 2000 or 98, for instance) leads me to believe it may have something to do with the Java VM (which is by default lacking in XP and updates browser components when installed). I hope that is of some enlightenment. ST _ MSN 8 helps eliminate e-mail viruses. Get 2 months FREE*. http://join.msn.com/?page=features/virus
OT: Haikus for Unicode-Haters
Unicode is shit! What a dreadful encoding. Who thought up this crap? UTF-16 Has those pesky surrogates Very bad design. Arabic shaping Difficult to implement It's a complex script. One should circumvent Endian related issues. UTF-8 does. _ STOP MORE SPAM with the new MSN 8 and get 2 months FREE* http://join.msn.com/?page=features/junkmail
Re: OT: Haikus for Unicode-Haters
You're right, but neither Monogolian nor Indic fits the 5-7-5 syllable constraint of haiku. Ben-ga-li-Sha-ping maybe? :-) But anyway, as I've been reading on Thomas Milo's (Decotype) paper on Arabic recently refered to here, Arabic typography isn't so simple once you get out of the simplified printing-Arabic paradigm. I have been using Arabic on computers since 1993, on Accent Software's word processor Dagesh (a multiscript word processor for Windows 3.x). The shaping mechanism for Arabic hasn't changed since. And I read this implementation goes back to the Apple Mac Arabic word processor Al-Kaatib Ad-Dawli, in the late 1980s. ST _ Tired of spam? Get advanced junk mail protection with MSN 8. http://join.msn.com/?page=features/junkmail
Re: Arabic Presentation Forms
Do you any suggestions on how I could convert a piece of Unicode text in this manner? Are there any programs that could do this? Roman Czyborra's arabjoin (a Perl script): http://czyborra.com/arabjoin/ It does the conversion to Arabic Presentation Forms. But also, which may not be what you need, it converts logically-ordered Arabic to visual order; this for display on systems that support neither BiDi nor Arabic shaping. ST _ MSN 8 with e-mail virus protection service: 2 months FREE* http://join.msn.com/?page=features/virus
Q: Any Unicode Qur'an extant?
Hello Unicoders. I'd like to know if there is any text version (Unicode) of the Arabic Qur'an. I don't expect there to be an exact book-copy version with all the cantillation frills; what I'm asking is whether a Qur'an of Arabic letters and Arabic tashkeel (vowel-pointing) alone is available. Has this important project already been carried out? Thanks in advance. -- Shlomi Tal שלומי טל (my name in UTF-8 encoded Hebrew) _ Send and receive Hotmail on your mobile device: http://mobile.msn.com
Re: Any Unicode Qur'an extant?
Thank you. I'd be glad to know when it's finished. -- Shlomi Tal שלומי טל (my name in UTF-8 encoded Hebrew) _ MSN Photos is the easiest way to share and print your photos: http://photos.msn.com/support/worldwide.aspx
Re: Teletext
Teletext uses VERY old technology encoding in general. I don't know if it's true for other languages, but Hebrew teletext encodes the Hebrew letters using the 7-bit SI-960, which maps the Hebrew letters instead of the lowercase Latin letters (positions 0x60 to 0x7A). In Hebrew teletext you get the following unmodern practices: 1. 7-bit encoding, which allows only uppercase Latin letters to be used in the mixed Hebrew/English mode. Compare Russian KOI-7, Greek ELOT 927, which are like Hebrew SI-960 in mapping the non-Latin alphabet on top of the lowercase letters. 2. Teletext offers no bidirectional algorithm. The display mechanism is limited to monodirectional LTR, necessitating the use of visually encoded Hebrew (that is, monodirectional LTR written Hebrew; see also my Hebrew FAQ for a longer explanation). This needs to be inverted to logical order when converting to Unicode. -- Shlomi Tal שלומי טל _ Send and receive Hotmail on your mobile device: http://mobile.msn.com
Re: Teletext
From: Lars Marius Garshol [EMAIL PROTECTED] This reminds me: does anyone have any pointers to information on how to convert visually encoded text (especially HTML, but also other formats) to Unicode? There are programs that do it on the fly for Hebrew. The best, which I have used myself, is HebTML, available for free downloading from http://www.billy.co.il . The author has been working with me on testing a new version that supports Unicode. However, I use this app much less than before, because Hebrew Internet is rapidly making the transition from visual to logical ordering. With IE 5.x and Mozilla supporting logical Hebrew, the years-old visual order is on the way out. The conversion of visual to logical text in BiDi scripts is straightforward: validate the BiDi property of the character, and if RTL then reverse. That means Hebrew letters reverse their order, digits and Latin letters stay the same. Things get more complicated, however, when hyphens, paired punctuation and telephone numbers appear. You need a smart converter for that. In essence, visually ordered Hebrew is a kludge for supporting Hebrew on platforms that weren't designed for it. In other words, it is an adaptation of Hebrew text to monodirectional LTR platforms. In modern software the onus of directionality passes on to software. -- Shlomi Tal ש×××× ×× _ Join the worlds largest e-mail service with MSN Hotmail. http://www.hotmail.com
Re: chcp 10000 (was: Filesystems)
Markus Scherer wrote: Hi Shlomi, [sending to the list] The number 1 in chcp 1 on Windows is, I assume, a magic number. It switches the command prompt into 16-bit-Unicode mode (=UTF-16 encoding form). All I can say is that this works, and works at least since NT 4. Not in my case, it doesn't - neither in Windows 2000 in the past, nor now in XP. Definitely chcp 1 switches me to the Macintosh Roman charset. Perhaps because I have all the codepage conversion tables installed. Look in Regional Options, Advanced: 1 is explicitly Mac-Roman. I do manage to work in UTF-16 through the command line, though; not by chcp, but by launching the command line in UTF-16 mode: cmd /u. (without the /u it is in ANSI mode). Plus, UTF-8 is available by doing chcp 65001. Strange. ââââââââââââââââââââââââââââââââ® â ש×××× ×× â â ââ looking for â of any sign â â°ââââââââââââââââââââââââââââââ⯠_ Join the worlds largest e-mail service with MSN Hotmail. http://www.hotmail.com
Q: Filesystem Encoding
Hello Unicoders, I have a question about filesystems. I never use anything but ASCII characters in filenames, and I would like to know if it is still justified. Of the various filesystems in use, I know only that the Joliet CDFS uses UCS-2BE. What about FAT16, FAT32, NTFS and Linux Ext2? In short: should I still stick to ASCII alone in filenames, or are there filesystems where I really don't have to anymore? Thanks in advance. _ Send and receive Hotmail on your mobile device: http://mobile.msn.com
The irony of it (was Re: Can browsers show text? I don't think so!)
The irony of it, that Linux users are much better organized, font-wise, than Windows users, thanks to Markus Kuhn's ISO 10646 X11 fonts which come with the XFree86 v4.0 distribution. I have yet to find Ethiopic or Cherokee anywhere on a default Win2000/XP install. So that Mozilla on Linux displays all characters fine. Except those which need complex rendering that X11 can't support: Arabic and Indic. /--\ | Marvelst thou not how matter combineth | | And assembles itself in wonderful shapes?| | Protons, electrons move of their own accord: | | The atoms are arranged at no-one's behest! | | | | http://www.geocities.com/stmetanat/ | \--/ _ Send and receive Hotmail on your mobile device: http://mobile.msn.com
Vim 6 - int'l support on any Windows platform!
Hello Unicoders! I've just done a test run of Vim (vi improved) version 6.1 on a localized Hebrew MS-Windows 98 Second Edition. I use Vim on my own Win2K machine, but I had no surprises that it should work there, because Win2K supports Unicode throughout. However, it was on the Hebrew Win2K that I got a real uplift: by setting the encoding to UTF-8 and changing the keymap, I could write not just English, not even also Hebrew, but also unsupported (for that platform) languages such as Greek and Russian! Saving to a file naturally wrote down the international characters in UTF-8. What a good way to get more international support on a system when you need it. Kudos to Bram Moolenaar and all the other Vim programmers. _ Send and receive Hotmail on your mobile device: http://mobile.msn.com
XTF-3 Description, Advantages/Drawbacks
OK, the eXperimental Transformation Format goes thus (I didn't make it clear enough): C0, G0, G1 and NBSP (0xA0) stay the same: a single byte. All Unicode characters from U+00A1 onwards are encoded in three bytes, the first of which is in the range C2..FE, the other two A1..C1. Thus U+00A1 = 0xC2 0xA1 0xA1 Advantages: 1. ASCII compatibility 2. C1 compatibility 3. Can be reduced to 7-bit SI/SO scheme with no control code overlap, thus being a UTF-7 without the real UTF-7's chief disadvantage of no sync. Disadvantages: 1. No simple way of filling bits like UTF-8's 110x 10xx. I suppose this brings us back to UTF-1's modulo complexities... 2. 3 bytes for all Unicode characters above U+00A0. 3. UTF-16 surrogate piggybacking - 6 bytes per outside-BMP codepoint. Really yucky, but those characters are rare. _ Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp.
Re: Lost in translation
Surprisingly to some, Unicode won't do much to solve this problem. It will make it much easier to store, exchange, and query Arabic-script text. But people who can't read the Arabic script will continue to need Latin transcriptions. However, Unicode does make transcription much easier, if you have an implementation that supports combining marks. Finally I can distinguish between front Teh and back (velarized) Tah by putting a dot under the latter, pharyngeal h with dot below, glottal marks for the two glottal consonants and so forth. Pity I only have it in Lucida Sans Unicode and Arial Unicode MS - Times New Roman lacks some of the combining marks. (btw RE: Uniconv - I recall Roman Czyborra mentioning it as the charset conversion module in Gaspar Sinai's Yudit editor). _ Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp.
Re: How is UTF8, UTF16 and UTF32 encoded?
The best non-technical introduction I've seen for UTF-8 is The Properties and Promizes (sic) of UTF-8 by Martin Dürst, here: http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf And another good easy introduction is Richard Gillam's Unicode Demystified, here: http://www.concentric.net/~rtgillam/pubs/unibook/ Look into chapter 6, Encoding Forms, which has useful illustrations of the UTFs. _ Join the worlds largest e-mail service with MSN Hotmail. http://www.hotmail.com
(informative) Explanation of Microsoft Windows Text-File Modes
Another FAQ-like essay of mine. Request for corrections. - Explanation of Microsoft Windows Text-File Modes by Shlomi Tal ([EMAIL PROTECTED]) Contents 1. Concepts 2. ANSI Mode 3. Unicode Mode 4. UTF-8 Mode -- Preliminary note: Windows 9x is shorthand for Microsoft Windows 95, 98 and ME; Windows XP is shorthand for Microsoft Windows NT 4.0, 2000 and XP. 1. Concepts ^^^ The more legacy-free line of Microsoft Windows operating systems are designed to use Unicode for all text internally, with provision of other representation modes for text for interoperability with other environments. The modes are specifically those that appear in the Windows XP text editor (Notepad), but they apply as general concepts. Text files can be divided according to the bit-stream representation they have, and according to the repertoire of characters they potentially hold. Bit-stream representation is the number and order of bits and bytes for encoding the text. Repertoire determines what characters are legal to use in a text file. Bit-stream and repertoire are closely linked, though the relations are not always straightforward. Microsoft Windows can handle text in at least one of three modes: 1. 8-bit stream with 256-character repertoire 2. 16-bit stream with 65536-character repertoire 3. 8-bit stream with 65536-character repertoire The first is the only option for Windows 9x, and the second is the native internal mode of Windows XP. The first involves switching the repertoire by changing 8-bit codepages, whereas the second is fix 16-bit repertoire. The third mode is a hybrid, combining the 65536-character repertoire in a single extended 8-bit codepage. 2. ANSI Mode The oldest mode for text files in Microsoft Windows, and the only option for the Windows 9x family, is ANSI mode, in which the system recognizes 256 characters. Half of these (the ASCII range, 00 to 7F) are constant, and the other half (80 to FF) change according to the particular language version of the system. ANSI modes enable the use of only two scripts: Basic Latin plus one more codeset. Other codesets cannot be used in ANSI mode without changing the codepage (which, as regards Windows 9x, means installing a different version of the operating system). In this area there is a notable difference between the enabled and the localized versions of Windows 9x. Enabled means supporting a codepage and input methods that make it possible to write in a particular language. For example, the US version of Windows 9x is also French enabled, for it has characters for French in the second half of its codepage (CP1252 in this case). Localized means that the whole interface has been translated to a different language. Localized is inherently enabled, and there are more different localized versions than enabled versions. The practical consequence of ANSI mode is that text files are not viewed uniformly between operating system versions when characters from the second half of the codepage are used. For example, German o-umlaut (U+00F6 LATIN SMALL LETTER O WITH DIAERESIS) will appear as Hebrew Tsadi (U+05E6 HEBREW LETTER TSADI) when the text file containing it is opened on an enabled or localized Hebrew Windows 9x system. This is because German o-umlaut is located on the same integer in the codepage map of CP1252 as is Hebrew Tsadi in CP1255 (the MS-Windows Latin/Hebrew codepage). There is no way of entering o-umlaut in a Hebrew Windows 9x version except through special applications. Windows XP abandons ANSI mode and uses Unicode mode instead (see next), but for compatibility with Windows 9x and other codepage-based environment it emulates the ANSI mode for one codepage at a time. That is, an option of system locale or system default language is chosen to determine which one of the 8-bit codepages Windows XP supports. This has the consequence that, for example, German o-umlaut will appear as Hebrew Tsadi if it is in an ANSI-mode text file and the system default language is set to Hebrew (more exactly to CP1255). All Windows 9x applications running on Windows XP will exhibit such behaviour. This applies mainly to the interface (menus, captions) of applications. Windows XP does not use ANSI mode internally, but it can save an external representation in a text file by saving it as ANSI. The file will be saved ordinarily only on condition that it does not contain any character outside the system's default ANSI codepage. If it does, then Notepad will trigger a warning to save as Unicode instead, and further saving will corrupt the original data (transcoding or conversion to question marks). 3. Unicode Mode ^^^ Windows XP handles text internally as UTF-16 (16 bits per character, plus support for surrogates from Windows 2000 onwards), and can store text as UTF-16 in either of little-endian or big-endian byte orders. The native byte order for the Intel x86
Re: Emoticons
Doug Ewell wrote: The smiling face and frowning face have fairly obvious value as emoticons. I use U+263A (in its UCN form, \u263a) sometimes when posting to this list. A winking face and a surprised or shocked face could arguably be useful as well. But once you get past those four, there's not much left except glyph variants In fact, of the three emoticons now extant, I use only the white smiling face. I don't see any special point in using the black smiling face (it's there because of CP437, I believe), and as for the white frowning face, it isn't in Times New Roman and other common WGL4 fontsets. Lucida Sans Unicode and Arial Unicode MS are not universal enough to trust. The only Unicode symbols I trust to use are those of the Times/WGL4 set, such as the male sign, female sign, card signs and so forth. Astrological symbols, for example, are out - I can print them on my laser printer from the Lucida Sans Unicode font, but I can't expect them to appear properly in everyone's browser. Some people believe that encoding certain entities (Klingon comes to mind) would bring great embarrassment to Unicode and cause people not to respect it or take it seriously. That's how I feel about encoding additional smileys. True, and yet there are so many symbols included for compatibility purposes, which are otherwise not useful. The box shades (dark, middle, light etc) make sense when you think of compatibility with CP437 and other terminal implementations, but I can't think of another situation where they might be useful. (The box-drawing lines, on the other hand, are still useful even outside the terminal graphics context). I think of the benefits of Unicode in terms of what more characters are available. I remember how hard it was, back in 1993 or so, to transliterate phonetic writing with all the macrons and combining dots, and now it's much easier. And :-) is just like a in that respect: just as the latter used to be a hack when you weren't sure you could get that diaeresis through, the former is the hack for those systems where you couldn't rely on universal CP437 display. Hacks such as those are where humans begin serving the machines instead of the other way round. I find it detestable. _ Chat with friends online, try MSN Messenger: http://messenger.msn.com
Emoticons
Branching off from the subject of symbol encodings, I wondered about the application of emoticons in the Miscellaneous Symbols block. Now even though I know characters such as the white smiling face were included for compatibility with DOS CP437 and its offshoots, yet the white frowning face wasn't in CP437. Pike and Thompson's paper on the Plan 9 Unicode conversion (Hello world or Kalimera Kosme or Konichiwa Sekai) says this: --- QUOTE --- Although we converted Plan 9 in the altruistic interests of serving foreign languages, we have found the large character set attractive for other reasons. The Unicode Standard includes many characters — mathematical symbols, scientific notation, more general punctuation, and more — that we now use daily in our work. We no longer test our imaginations to find ways to include non-ASCII symbols in our text; why type :-) when you can use the character ☺? --- UNQUOTE --- So, that emoticon, far from its original use as a compatibility character for CP437 (much like the box-drawing symbols), is actually useful for regular, running application. And since emoticons are very useful, and are not compatibility hacks, then why not add a few more to the Misc Symbols set? White winking face, for example? I already use the white smiling face on discussion boards, as an HTML NCR, and it's smashing. Wouldn't a few more be useful? Just my thoughts... Shlomi Tal Author of The Guide To Hebrew Computing http://www.pcphobia.co.il/hebcomp/ _ Chat with friends online, try MSN Messenger: http://messenger.msn.com
Re: Welcome to list 'unicode'
MS-Word 2000 and upwards use Unicode (to be more specific, UTF-16 little-endian). Earlier than that (97 and downwards) are still based on codepages, according to their versions (for example Hebrew MS-Word 97 stores strings in the Windows-1255 codepage; other languages can be typed and saved if you have the input method, as I did when I had Word 97 on my Win2K system, but they are stored as an easily-corrupted extended encoding. I upgraded to Word 2000 after Word 97 had corrupted all my Arabic text). i have a question.. i have a word editor(say MSWord)..does MSword have unicode compatibility...if not then how do i make it compatible to unicode standard.?? regards, deepak _ MSN Photos is the easiest way to share and print your photos: http://photos.msn.com/support/worldwide.aspx
Re: Variations of UTF-16
{{ But a BOM in every UTF-16 plain text file would make this completely hopeless. If we ever think we might want to do UNIX-style text processing on UTF-16, we have to resist that! }} If you're going to take the trouble of making text tools 16-bit aware, then you can afford to make them BOM-aware too. type a.txt b.txt c.txt d.txt on Windows 2000, assuming that they are all UTF-16 (with an FFFE at the beginning of each, as is usual in MS-Windows Unicode files), strips every BOM except the last, so that d.txt has only the usual one initial FFFE. So it's not an immovable obstacle. Concerning text files: nearly all of plain-text Unicode I've ever seen is in UTF-8. However, the ubiquitous MS-Office documents, from Office 2000 onwards, are all in UTF-16 (little-endian, without BOM). _ Join the worlds largest e-mail service with MSN Hotmail. http://www.hotmail.com
Re: Please help: Unicode sig in Hotmail
The sig is one of the situations where UTF-8 transfer hasn't worked for me. Normally I use UTF-8 in mails to transfer texts in Hebrew and Arabic, and it passes with no problem. I just switch IE to UTF-8 and it passes the bytes as they should be. Maybe those symbols are confusing the browser? As you know, the symbols (male sign, female sign, black heart) are control characters as mapped in CP437. Could it be the browser is interpreting them as such for compatibility? Particularly intriguing is that the black heart causes problems every time, because it corresponds to Control-C (interrupt). Pity about the lack of Hotmail support for UTF-8 transfer specification. This quite diminishes the advantage of web-based mail (not everywhere I can use Outlook Express with my account set up...). _ Chat with friends online, try MSN Messenger: http://messenger.msn.com
Please help: Unicode sig in Hotmail
I've built a UTF-8 sig for my outgoing messages: |--| | a BOY ♂ ...| | a GIRL ♀ ... | | they MEET ♂♀ ... | | HERE WE GO! | | ♂♥♀| |--| with Unicode symbols from the U+26xx block. However, it doesn't show up at all: neither in Compose, nor when I send a message to myself, nor when I send a message to someone else. Please tell me how I can put it right. Thanks in advance. _ Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp.
Hebrew Computing FAQ
I have a website at http://www.pcphobia.co.il/hebcomp/ called The Guide to Hebrew Computing, which is meant for native users of Hebrew and is therefore entirely in Hebrew (in two versions: UTF-8 encoded logical Hebrew and ISO-8859-8 encoded visual Hebrew); for the basic questions about Hebrew, especially about the difference between visual and logical which people have asked me after seeing those options in Mozilla and Internet Explorer, I have this FAQ, in English. Criticism and pointing out of errors gladly accepted. --- BEGIN --- Hebrew Computing FAQ by Shlomi Tal ([EMAIL PROTECTED]) Contents: 1. What is the difference between ISO-Visual and ISO-Logical? 2. How was Hebrew used on MS-DOS? 3. What is special about MS-Windows Hebrew (windows-1255) encoding? 4. Review of Standards - 1. What is the difference between ISO-Visual and ISO-Logical? ^ This question needs a long explanation going down to the very rudiments of human handwriting. ISO is just an encoding scheme; the difference between visual order and logical order has nothing to do with the encoding itself (ie the numbers assigned to each letter), but with the storage order of the numbers. Let us review the writing of English text by hand. The hand holds the pen near the top-left corner of the paper and then moves rightwards constantly. When there is no more room on the paper to the right, the hand moves back to the left edge and slides one row lower than before, and then begins the rightwards movement again. Writing Hebrew (and Arabic and other Semitic languages) by hand is a different matter. The hand holds the pen near the top-right corner of the paper and then moves leftwards. However, it moves leftwards as long as the text is in Hebrew. If numbers (or English text) are to be written, the hand will move rightwards for them and then resume the leftwards movement for Hebrew text again. In other words, writing Hebrew involves bidirectional (left-to-right and right-to-left) movement of the hand, in contrast to monodirectional English writing. Finally, upon running out of room to move leftwards, the hand moves back to the right edge and slides one row lower. So much for human handwriting. Computers, however, know nothing about directions. The numbers representing human letters are stored sequentially on the media. Making them flow from left to right and move on to the beginning of the next line is the job of software. Since computer systems were designed around English, the screen-handling routines have a uniform, clear rule for mimicking the handwriting process: if a byte follows another byte, it will be presented on the screen as a letter to the right of the letter that the previous byte represents: Sequential bytes: 0x48 0x65 0x6C 0x6C 0x6F Letters displayed on screen: Hello In addition, for word-wrapping applications (such as text editors) there is a routine for going to the beginning of the next line when the row is full. When it comes to displaying Hebrew on the screen, there is great difficulty. The display mechanisms of computers were originally designed for English, and can easily be accommodated to other left-to-right scripts, or even to a monodirectional right-to-left script by employing a simple display inversion, but Hebrew is bidirectional and more complicated to display (Arabic is even more complicated than Hebrew, but that's another story). There are two options for dealing with Hebrew text display: 1) Forcing Hebrew to conform to the constraints of English text display (ie treating Hebrew like a monodirectional LTR script). 2) Updating the display software to handle bidirectional display of Hebrew text in a way akin to its flow in handwriting. The first option is simple, easy to implement and does not require large computing resources by the standards of early computing (which for Hebrew means from the 1960s to the early 1980s). It requires only an encoding and a font mapping: numbers assigned to Hebrew letters, and Hebrew fonts for their display. However, it requires an effort on the part of the writer, since all text, including Hebrew letters, is written from left to right. Hebrew text must be written with the last letter typed first, so that the left-to-right display of the text can form the illusion of natural Hebrew flow. There were a few mechanisms to aid writers, such as pushing input methods for typing the Hebrew letters the natural way (from right to left), but editing, sorting, copying and any kind of manipulation stayed a painful task. The second option, implemented for Arabic first and then for Hebrew, consists in more intelligent software, and therefore more resources. The method assigns an implicit directionality to each character: LTR for English and numbers, RTL for Hebrew letters and neutral for punctuation marks. The Hebrew text is stored in the same sequential
MS/Unix BOM FAQ again (small fix)
A small fix for the FAQ; specifically, a fix for the typo/braino of construing 0x071F as little-endian 1F 70 instead of (the now fixed) 1F 07. Thanks to Wladislaw Vaintroub for pointing it out for me. --- BEGIN --- Microsoft Unicode Text File Byte Order Mark (BOM) FAQ by Shlomi Tal ([EMAIL PROTECTED]) Contents 1. What is a BOM? 2. Why does it matter? 3. Is the BOM mandatory or optional? - 1. What is a BOM? ^ BOM, or Byte-Order Mark, is a signature at the beginning of a Unicode text file. Since different processors handle sequences of bytes in a particular way, the BOM is used to mark which byte-order the text file was written in. Processors are either big-endian or little-endian. The former put the most significant byte first, and the latter put the least significant byte first. So that the 16-bit number 0x071F is serialized as: Big-endian 07 1F Little-endian 1F 07 Obviously a code with the value 0x071F will be interpreted as 0x1F07 if it passes from a processor of different byte-order without information about its original state. This is what the Unicode BOM seeks to avoid. The Unicode standard permits the character U+FEFF (Zero-Width Non-Breaking Space) at the beginning of the file as a mark for the byte order of the file. A Unicode text file beginning with FEFF is big-endian, and a file beginning with FFFE (not a legal Unicode character for any other purpose) is little-endian. All this is relevant to the 16-bit and 32-bit encodings of Unicode characters - UTF-16 and UTF-32 respectively. Thus: FE FF is UTF-16 Big-Endian FF FE is UTF-16 Little-Endian 00 00 FE FF is UTF-32 Big-Endian FF FE 00 00 is UTF-32 Little-Endian There is another, very common Unicode encoding scheme called UTF-8, which maps the Unicode repertoire into sequences of bytes. Since the order of bytes (as opposed to words of more than one byte) is the same for all processors, UTF-8 does not require a BOM. It can have one, though. In addition, a Unicode encoding scheme named UTF-7, which was meant as a mail-safe encoding but is now nearly obsolete, can have a BOM as well. Here too the BOM is not mandatory. 2. Why does it matter? ^^ It matters because Microsoft tools (most prominently Windows Notepad) prefix the BOM to Unicode text files regularly, whereas other systems and environments (Unix, Linux, web pages) are better off without the BOM, especially in the case of UTF-8 text files. Unix systems, for example, search for an initial #! in a shell script file in order to determine the interpreter for it. An initial BOM coming instead of the #! could easily disrupt this convention. Also, and this applies particularly to databases, and not only in Unix, the BOM can cause disorder when files are merged. Web pages usually use UTF-8, and although they can handle the BOM, it may appear as a strange character (a blank square or a question mark) on a browser that doesn't recognize it, and may also cause the above troubles when the file is saved to the local disk. Most of the Unicode text meant for open transfer between various systems (and the Web) is encoded in UTF-8. Unix systems regularly form UTF-8 text files without the BOM, but Windows systems prefix the BOM as usual. Here follows an explanation of when the Unicode BOM can or cannot be removed from text files on Microsoft Windows systems. 3. Is the BOM mandatory or optional? Microsoft Windows, beginning with the Unicode-supporting operating systems Windows 2000 and Windows XP, can handle UTF-16 Little-Endian, UTF-16 Big-Endian, UTF-8 and old 8-bit ANSI (Microsoft's non-standard name for its 8-bit Windows codepages, consisting of the ASCII repertoire for the first 128 characters and varying characters for the other 128). The native encoding for these systems is UTF-16 Little-Endian, which when saving under Notepad is called Unicode. UTF-16 Big-Endian is called Unicode Big-Endian, and UTF-8 keeps its name. Upon saving a Unicode text file in Notepad, the BOM is always prefixed. Thus, opening such a file with a text editor which is not Unicode-aware (such as edit.com) or doing a hexdump on it, you will see UTF-16 Little-Endian (Unicode) starting with FF FE, UTF-16 Big-Endian (Unicode Big-Endian) starting with FE FF, and UTF-8 starting with the UTF-8 encoding of the BOM: EF BB BF. For the first two encoding schemes (UTF-16), the user MUST NOT remove the BOM manually. Removing the BOM using an external tool (such as edit.com) and then opening the file with Notepad will reveal a pile of gibberish. Then, saving the file will corrupt it beyond recovery. This is because the BOM is necessary for the system to read the 16-bit values as they are and ignore their values as 8-bit sequences. Without the BOM, an 8-bit sequence forming part of a 16-bit Unicode character will be given its special ASCII value, which may be a control character. Many