Another FAQ-like essay of mine. Request for corrections. ---------
Explanation of Microsoft Windows Text-File Modes by Shlomi Tal ([EMAIL PROTECTED]) Contents 1. Concepts 2. ANSI Mode 3. Unicode Mode 4. UTF-8 Mode ---------------------------------------------------------------------- Preliminary note: Windows 9x is shorthand for Microsoft Windows 95, 98 and ME; Windows XP is shorthand for Microsoft Windows NT 4.0, 2000 and XP. 1. Concepts ^^^^^^^^^^^ The more legacy-free line of Microsoft Windows operating systems are designed to use Unicode for all text internally, with provision of other representation modes for text for interoperability with other environments. The modes are specifically those that appear in the Windows XP text editor (Notepad), but they apply as general concepts. Text files can be divided according to the bit-stream representation they have, and according to the repertoire of characters they potentially hold. Bit-stream representation is the number and order of bits and bytes for encoding the text. Repertoire determines what characters are legal to use in a text file. Bit-stream and repertoire are closely linked, though the relations are not always straightforward. Microsoft Windows can handle text in at least one of three modes: 1. 8-bit stream with 256-character repertoire 2. 16-bit stream with 65536-character repertoire 3. 8-bit stream with 65536-character repertoire The first is the only option for Windows 9x, and the second is the native internal mode of Windows XP. The first involves switching the repertoire by changing 8-bit codepages, whereas the second is fix 16-bit repertoire. The third mode is a hybrid, combining the 65536-character repertoire in a single extended 8-bit codepage. 2. ANSI Mode ^^^^^^^^^^^^ The oldest mode for text files in Microsoft Windows, and the only option for the Windows 9x family, is ANSI mode, in which the system recognizes 256 characters. Half of these (the ASCII range, 00 to 7F) are constant, and the other half (80 to FF) change according to the particular language version of the system. ANSI modes enable the use of only two scripts: Basic Latin plus one more codeset. Other codesets cannot be used in ANSI mode without changing the codepage (which, as regards Windows 9x, means installing a different version of the operating system). In this area there is a notable difference between the "enabled" and the "localized" versions of Windows 9x. "Enabled" means supporting a codepage and input methods that make it possible to write in a particular language. For example, the US version of Windows 9x is also French enabled, for it has characters for French in the second half of its codepage (CP1252 in this case). "Localized" means that the whole interface has been translated to a different language. Localized is inherently enabled, and there are more different localized versions than enabled versions. The practical consequence of ANSI mode is that text files are not viewed uniformly between operating system versions when characters from the second half of the codepage are used. For example, German o-umlaut (U+00F6 LATIN SMALL LETTER O WITH DIAERESIS) will appear as Hebrew Tsadi (U+05E6 HEBREW LETTER TSADI) when the text file containing it is opened on an enabled or localized Hebrew Windows 9x system. This is because German o-umlaut is located on the same integer in the codepage map of CP1252 as is Hebrew Tsadi in CP1255 (the MS-Windows Latin/Hebrew codepage). There is no way of entering o-umlaut in a Hebrew Windows 9x version except through special applications. Windows XP abandons ANSI mode and uses Unicode mode instead (see next), but for compatibility with Windows 9x and other codepage-based environment it emulates the ANSI mode for one codepage at a time. That is, an option of "system locale" or "system default language" is chosen to determine which one of the 8-bit codepages Windows XP supports. This has the consequence that, for example, German o-umlaut will appear as Hebrew Tsadi if it is in an ANSI-mode text file and the system default language is set to Hebrew (more exactly to CP1255). All Windows 9x applications running on Windows XP will exhibit such behaviour. This applies mainly to the interface (menus, captions) of applications. Windows XP does not use ANSI mode internally, but it can save an external representation in a text file by saving it as "ANSI". The file will be saved ordinarily only on condition that it does not contain any character outside the system's default ANSI codepage. If it does, then Notepad will trigger a warning to save as Unicode instead, and further saving will corrupt the original data (transcoding or conversion to question marks). 3. Unicode Mode ^^^^^^^^^^^^^^^ Windows XP handles text internally as UTF-16 (16 bits per character, plus support for surrogates from Windows 2000 onwards), and can store text as UTF-16 in either of little-endian or big-endian byte orders. The native byte order for the Intel x86 processors is little-endian. Unicode mode is not a codepage, but a totally different stream method for text in Windows XP. It is such that typing the command cmd /u opens a command prompt in which text is piped in and out as UTF-16 little-endian. Text in Unicode mode can contain any character, and can be converted to any 8-bit codepage (except for a few such as Hindi and Georgian which are Unicode only). The meanings of bytes change when using Unicode mode, for example, 0x03 0xA1 denotes a Greek letter instead of its constituent control character and symbol. To identify double-bytes as having this meaning, text files in Unicode mode (either "Unicode", which means UTF-16 little-endian, or "Unicode big endian", which means UTF-16 big-endian) must have a byte order mark (U+FEFF) prefixed to them. Removing the BOM results in a return to interpretation of the bytes as 8-bit codepage byte sequences, and may lead to corruption (see the author's Microsoft/Unix BOM FAQ for further). 4. UTF-8 Mode ^^^^^^^^^^^^^ UTF-8 mode is a hybrid: Windows XP treats it as an 8-bit codepage, providing conversion to UTF-16 little-endian internally just as in ANSI mode, but allows the whole repertoire of Unicode characters. UTF-8 is a codepage into which Unicode mode text can be converted; it is not a stream method by itself. That is, Windows XP does not support "UTF-8 Mode" in itself for any application. For example, the command prompt is either in 8-bit codepage mode (by starting it with "cmd") or in 16-bit Unicode mode (by starting it with "cmd /u"), but there is no UTF-8 mode for the command prompt, although UTF-8 display and input are supported through codepage 65001. Windows XP does not provide a way of manipulating UTF-8 strings directly; it supports UTF-8 by storing it externally (on disk) and converting it to UTF-16 little-endian for all other operations. This should not be a problem for interoperability with other environments, but Unicode-enabled programs for Windows must use UTF-16, not UTF-8 (unlike Unix). Old 8-bit text manipulation tools such as MS-DOS edit.com can handle UTF-8 strings without corrupting the file; not so in UTF-16, where just saving the file can transcode its control character values (such as 0x00 NULL to 0x20 SPACE). Unlike UTF-16 text files, UTF-8 files do not require a byte order mark in order to be identified as such. Although Windows XP does prefix the BOM as a regular procedure, it can be safely removed without corrupting the file. UTF-8 text is identified in Windows XP heuristically, that is, by the presence of legal UTF-8 sequences and absence of illegal sequences. _________________________________________________________________ Send and receive Hotmail on your mobile device: http://mobile.msn.com

