Microsoft Unicode Article Review

John Tisdale Thu, 05 Aug 2004 14:49:31 -0700

I’m in the early stages of writing an article for Microsoft for publication on Developing Multilingual Web Sites. I want to include a brief overview of Unicode. I must say that I read a lot of contradictory information on the topic online from various sources. I’ve done my best to differentiate fact from fiction so that I can provide readers with an accurate introduction to the topic.

I would really appreciate some review of the following segment of the article (in first draft form) for accuracy. Any technical corrections or general enhancements that anyone may wish to offer would be much appreciated. Please be gentle in dispensing criticism as this is just a starting point.

Feel free to respond directly to me at [EMAIL PROTECTED] (as this topic probably doesn’t warrant group discussion and bandwidth).

Thanks very much, John

Unicode Fundamentals

For our discussion, there are two fundamental terms with which you must be familiar. First, a character set or character repertoire is an organized collection of characters (the term character set is more common but character repertoire is more technically correct, as is coded character set). Second, an encoding scheme is a system for representing those characters in a computing environment. Distinguishing between these two terms is crucial to understanding how to leverage the benefits of Unicode.

Before Unicode, the majority of character sets contained only those characters needed by a single language or a small group of associated languages (such as iso-8859-2 which contains characters used in various European languages). The popularization of the Internet elevated the need for a more universal character set.

In 1989, the International Organization for Standardization (ISO) published the first draft of a character set standard that supported a broad range of languages. It was called the ISO/IEC 10646 standard or the Universal Multiple-Octet Coded Character Set (commonly referred to as the Universal Character Set or UCS).

Around the same time, a group of manufacturers in the U.S. formed the Unicode Consortium with a similar goal of creating a broad multilingual character set standard. The result of their work was the formation of the Unicode Standard. Since the early releases of these two standards, both groups have worked together closely to ensure compatibility between their standards. For details on the development of these standards, see http://www.unicode.org/versions/Unicode4.0.0/appC.pdf.

In most cases when someone refers to Unicode they usually are discussing the collective offerings of these two standards bodies (whether they realize it or not). Technically, this isn’t accurate but it certainly does simplify the discussion. In this article, I will sometimes use the term Unicode in a generic manner to refer to these collective standards (with apologies to those offended by this generalization) and when applicable I will make distinctions between them (referring to the Unicode Standard as Unicode and the ISO/IEC 10646 standard as UCS).

On Character Sets and Encoding Schemes

First, you should recognize that both of these standards separate the character repertoire from the encoding scheme. Many people confuse this fact and suggest that Unicode is a 16-bit character set. Yet, in neither standard is this accurate. The number of bits used is not associated with the character set but with the encoding scheme. Character sets are based on code points (a string of hexadecimal numbers) and not bits and bytes. So, to say that Unicode is represented by any number of bits is not correct. If you want to talk about bits and bytes, you need to talk about encoding schemes.

Each character in Unicode is represented by a code point. It is usually notated as U + hexadecimal code point. The U stands for Unicode, followed by the + sign, and then a hexadecimal number (the code point) representing a given character in the Unicode character repertoire. For example, the English uppercase letter A is represented as U+0041.

One way of encoding this character would be with UTF-8. This scheme would encode this character as 0x41. Encoding this same character using UCS-2 produces 0x00, 0x41. You can run the Windows charmap utility (if you are running Windows 2000, XP or 2003) to see how characters are mapped in Unicode in your system.

Basically, the Unicode Standard and the UCS character repertoires are the same (for practical purposes). Whenever one group publishes a new version of their standard, the other eventually releases a corresponding one. For example the Unicode Standard, Version 4.0 is the same as ISO/IEC 10646:2003. Hence, the code points are synchronized between these two standards.

So, the differences between these two standards are not with the character sets themselves, but with the standards they offer for encoding and processing the characters contained therein. Both standards provide multiple encoding schemes (each with their unique characteristics). A term frequently used in encoding scheme definitions is an octet. This term describes an 8-bit byte.

UCS provides two encoding schemes. UCS-2 uses two octets (or 16 bits) and UCS-4 uses four octets (or 32 bits) to encode characters. Unicode has three primary encoding schemes UTF-8, UTF-16 and UTF-32. UTF stands for Unicode (or UCS) Transformation Format. Although you will occasionally see someone refer to UTF-7, this is a specialized form (more of a derivative) that ensures itself fully compatible with ASCII for specialized applications such as email systems that are not designed to handle non-ASCII data. As such, it is not part of the current definition of the Unicode Standard.

One of the differences between Unicode and UCS encoding schemes is that the former provides variable-width encoding lengths and the latter does not. That is, UCS-2 is encoded with 2 bytes only and UCS-4 with 4 bytes only. Based on the naming convention, some people assume that UTF-8 is a single byte coding scheme. But, this isn’t the case. UTF-8 actually provides variable lengths from 1 to 4 octets. Additionally, UTF-16 can encodes characters in either 2-octet or 4-octet lengths. UTF-32 can only encode with four octets.

Also, you should be aware that the byte order can differ with UCS-2, UTF-16, UCS-4 and UTF-32. The two variations are known as Big-endian (BE) and Little-endian (LE). The name Big-endian is derived from the term Big End In (meaning Most Significant Byte first) and Little-endian comes from Little End In (meaning Most Significant Byte last). See Figure 1 for a synopsis of these encoding schemes.

Choosing a Unicode Encoding Scheme

In developing for the Web, most of your choices for Unicode encoding schemes will have already been made for you when you select a protocol or technology. Yet, you may find instances in which you will have the freedom to select which scheme to use for your application (especially in customized desktop applications). In such cases, there are several dynamics that will influence your decision.

First, it should be stated that there isn’t necessarily a right or wrong choice when it comes to a Unicode encoding scheme. In general, when choosing between the UCS and Unicode standards, the former tends to provide more generalized parameters about encoding and decoding whereas Unicode tends to provide more granularity, precision, and restriction in its standards. So, which you choose may depend upon how much precise definition you want versus freedom in tailoring the standard to your application.

The variable length capability UTF-8 may empower you with greater flexibility in your application (to vary the number of octets you need for a particular application). Yet, if you are designing an application that needs to parse Unicode at the byte-level, the variable length of UTF-8 will require much more complex algorithms than the fixed length encoding schemes of UCS (granted you could use the fixed length of UTF-32 but if you don’t need to encode more than 65,536 characters, you are wasting twice as much space than by using the fixed length UCS-2 scheme).

Because UTF-8 supports 8-bit encoding and Unicode’s character mapping precisely matches that of ASCII for the first 128 characters, UTF-8 affords you the ability to have Unicode and ASCII compatibility at the same time (talk about having your cake and eating it too). So, for cases in which maintaining ASCII compatibility is highly valued, UTF-8 makes an obvious choice. This is one of the primary reasons that Active Server Pages and Internet Explorer use UTF-8 encoding for Unicode.

Yet, if you are working with an application that must parse and manipulate text at the byte-level, the costliness of variable length encoding will probably outweigh the benefits of ASCII compatibility. In such a case the fixed length of UCS-2 will usually prove the better choice. This is why Windows NT and subsequent Microsoft operating systems, SQL Server 7 (and subsequent ones), XML, Java, COM, ODBC, OLEDB and the .NET framework are all built on UCS-2 Unicode encoding. The uniform length of UCS provides a good foundation when it comes to complex data manipulation.

If, on the other hand, you are creating an application that needs to display multiple Asian languages at the same time, UTF-32 or UCS-4 may be your only option (because the combination of characters found in these languages usually exceeds the 65,536 limitations of a 16-bit encoding scheme).

There are other technical differences between these standards that you may want to consider that are beyond the scope of this article (such as how UTF-16 supports surrogate pairs but UCS-2 does not). For a more detailed explanation of Unicode, see the Unicode Consortium’s article The Unicode^® Standard: A Technical Introduction (http://www.unicode.org/standard/principles.html) as well as Chapter 2 of the Unicode Consortium’s The Unicode Standard, Version 4.0 (http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf#G11178).

The separation that Unicode provides between the character set and the encoding scheme allows you to choose the smallest and most appropriate encoding scheme for referencing all of the characters you need for a given application (thus providing considerable power and flexibility). Unicode is an evolving standard that continues to be tweaked and elaborated upon.

Microsoft Unicode Article Review

Reply via email to