MSDN Article, Second Draft

John Tisdale Tue, 17 Aug 2004 21:13:28 -0700

Thanks everyone for your helpful feedback on the first draft of the MSDN
article. I couldn't fit in all of the suggestions as the Unicode portion is
only a small piece of my article. The following is the second draft based on
the corrections, additional information and resources provided.


Also, I would like to get feedback on the most accurate/appropriate term/s
for describing the CCS, CEF and CES (layers, levels, components, etc.)?

I am under a tight deadline and need to collect any final feedback rather
quickly before producing the final version.

Special thanks to Asmus for investing a lot of his time to help.

Thanks, John

--

Unicode Fundamentals
A considerable amount of misinformation on Unicode has proliferated among
developers and the problem has largely compounded over time. Deploying an
effective Unicode solution begins with a solid understanding of the
fundamentals. It should be noted that Unicode is a far too complex topic to
cover in any depth here. Additional resources will be given to take the
reader beyond the scope of this article.
Early character sets were very limited in scope. ASCII required only 7 bits
to represent its repertoire of 128 characters. ANSI pushed this scope 8 bits
which represented 256 characters while providing backward compatibility with
ASCII. Countless other character sets emerged that represented the
characters needed by various languages and language groups. The growing
complexities of managing numerous international character sets escalated the
need for a much broader solution that represented the characters of
virtually all written languages in a single character set.
Two standards emerged about the same time to address this demand. The
Unicode Consortium published the Unicode Standard and the International
Organization for Standardization (ISO) offered the ISO/IEF 10646 standard.
Fortunately, these two standards bodies synchronized their character sets
some years ago and continue to do so as new characters are added.
Yet, although the character sets are mapped identically, the standards for
encoding them vary in many ways (which are beyond the scope of this
article). It should be noted that if you implement Unicode you have fully
implemented ISO 10646, but the inverse isn't necessarily the case as Unicode
provides more granularity and restrictions in its standards (i.e., character
semantics, normalization, bi-directionality, etc.).
In most cases when someone refers to Unicode they usually are discussing the
collective offerings of these two standards bodies (whether they realize it
or not). Formally, these two are distinct standards, but the differences are
not relevant for the purposes of this article. So, I will use the term
Unicode in a generic manner to refer to these collective standards.
The design constraints for Unicode were demanding. Consider that if all of
the world's characters were placed into a single repertoire, it could have
required 32 bits for encoding. Yet, this parameter would have made the
solution impractical for most computing applications. The solution had to
provide broad character support while offering considerable flexibility for
encoding its characters in different environments and applications.
To meet this challenge Unicode was designed with three unique layers or
components. Understanding the distinctions of these components is critical
to leveraging Unicode and deploying effective solutions. They are the coded
character set (CCS), character encoding forms (CEF) and character encoding
schemes (CES).
In brief, the coded character set contains all of the characters in Unicode
along with a corresponding integer by which each is referenced. Unicode
provides three character encoding forms for transforming the character
references into computer readable data. The character encoding scheme
establishes how the pieces of data in the encoding form are serialized so
that they can be transmitted and stored.


Coded Character Sets
The fact that so many developers suggest that Unicode is a 16-bit character
set illustrates how much Unicode is misunderstood (and how these three
layers are not differentiated). The truth is that the Unicode character set
can be encoded using 8, 16 or 32 bits (if you get nothing else out of this
article than that, at least get that point straight - and help play a role
in the demise of this ruse by passing it on to others).
A coded character set (sometimes called a character repertoire) is a mapping
from a set of abstract characters to a set of nonnegative, noncontiguous
integers (between 0 and 1,114,111, called code points). The Unicode Standard
contains one and only one coded character set (which is precisely
synchronized with the ISO/IEF 10646 character set). This character set
contains most characters in use by most written languages (including some
dead languages) along with special characters used for mathematical and
other specialized applications.
Each character in Unicode is represented by a code point. These
integer-based values are typically notated as U + hexadecimal code point.
Each code point represents a given character in the Unicode character
repertoire. For example, the English uppercase letter A is represented as
U+0041.
If you are using Windows 2000, XP or 2003, you can run the charmap utility
to see how characters are mapped in Unicode on your system. These operating
systems are built on UTF-16 encoded Unicode.

Character Encoding Forms
The second component in Unicode is character encoding forms. Their purpose
is to map sets of code points contained in the character repertoire to
sequences of code units that can be represented in a computing environment
(using fixed-width or variable-width coding).
The Unicode Standard provides three forms for encoding its repertoire
(UTF-8, UTF-16 and UTF-32). You will often find references to USC-2 and
USC-4. These are competing encoding forms offered by ISO/IEF 10646 (USC-2 is
equivalent to UTF-16 and USC-4 to UTF-32). I will not discuss the
distinctions and merits of each and will suggest simply that most
implementations today use UTF-16 and UTF-32 (even though some are
occasionally mislabeled as USC-2 and USC-4).
As you might expect, UTF-32 is a 32-bit encoding form. That is, each
character is encoded using 4 bytes and 4 bytes only. Although this method
provides a fixed-width means of encoding, the overhead in terms of wasted
system resources (i.e. memory, disk space, transmission bandwidth) is
significant enough to limit its implementation (as at least half of the 32
bits will contain zeros in the majority of applications). Except in some
UNIX operating systems and specialized applications with specific needs,
UTF-32 is seldom implemented as an end-to-end solution (yet it does have its
strengths in certain applications).
UTF-16 is the default means of encoding the Unicode character repertoire
(which has perhaps played a role in the misnomer that Unicode is a 16-bit
character set). As you might expect, it is based on a 16-bit encoding form
(each code unit is represented by 2 bytes). But, this isn't the whole story.
Since 16-bits are capable of accessing only 65,536 characters, you might
guess that if you needed to access more characters than that you would be
forced to use UTF-32. But, this isn't the case. UTF-16 has the ability to
combine pairs of 16-bit code units in cases in which 16 bits are inadequate.
When code units are paired in this manner they are called surrogates.
In most cases a single 16-bit code unit is adequate because the most
commonly used characters in the repertoire are placed in what is known as
the Basic Multilingual Plane (BMP) - which is entirely accessible with a
single 16-bit code unit. In cases in which you need to access characters
that are not in the BMP, you can combine a pair of 16-bit code units into
surrogates. The pair contains a high surrogate and a low surrogate value.
Unicode provides for the use of 1,024 unique high surrogates and 1,024
unique (non-overlapping) low surrogates. Together, the possible combinations
of surrogates allow access of up to 1,048,544 characters using UTF-16 (in
case you are doing the math, you should know that 32 characters are reserved
as non-characters). So, UTF-16 is capable of representing the entire Unicode
character set given the extensibility that surrogates provide.
UTF-8 encoding was designed for applications that were built on 8-bit
platforms that need to support Unicode. Because the first 128 characters in
the Unicode repertoire precisely match those in the ASCII character set,
UTF-8 affords the opportunity to maintain ASCII compatibility while
significantly extending its scope.
UTF-8 is a variable-width encoding form based on byte-sized code units
(ranging between 1 and 4 bytes per code unit). You will occasionally run
across the term octet related to Unicode. This is a term defined by the
ISO/IEF 10646 standard. It is synonymous with the term byte in the Unicode
Standard (an 8-bit byte).
In UTF-8, the high bits of each byte are reserved to indicate where in the
unit code sequence that byte belongs. A range of 8-bit code unit values are
reserved to indicate the leading byte and the trailing byte in the sequence.
By sequencing four bytes to represent a code unit, UTF-8 is able to
represent the entire Unicode character repertoire.
You will occasionally see reference to an encoding form labeled as UTF-7.
This is a specialized form (more of a derivative) that ensures itself fully
compliant with ASCII for specialized applications such as email systems that
are only designed to handle ASCII data (the 8th bit is always equal to 0 to
ensure no loss of data). As such, it is not part of the current definition
of the Unicode Standard. See figure 1 for an illustration of how these
encoding forms represent data.

Character Encoding Schemes
Now that we know how the Unicode coded character set and the character
encoding forms are constructed, let's evaluate the final element of the
Unicode framework. A character encoding scheme provides reversible
transformations between sequences of code units and sequences of bytes. In
other words, the encoding scheme serializes the code units into sequences of
bytes that can be transmitted and stored as computer data.
When it comes to representing large data types, bytes must be combined. How
bytes are combined to represent large data types varies depending on a
computer's architecture and the operating system running on top of it. There
are two primary methods used to establish the byte order: big-endian (BE,
meaning Most Significant Byte first) and little-endian (LE, meaning Most
Significant Byte last). See Figure 2 for a synopsis of these encoding
schemes.
Intel microprocessor architecture generally provides native support for
little-endian byte-order (as do most Intel-compatible systems). Many
RISC-based processors natively support big-endian byte ordering. Some
solutions are designed to support either method (such as PowerPC).
UTF-16 and UTF-32 provide the means to support either byte sequencing
method. This issue is not relevant with UTF-8 because it utilizes individual
bytes that are encapsulated with the sequencing data (with bounded look
ahead). The byte order can be indicated with an internal file signature
using the byte order mark (BOM) U+FEFF. A BOM is not only not needed in
UTF-8, it also destroys ASCII-transparency (yet some development tools
automatically include a BOM when saving a file using UTF-8 encoding). Figure
3 illustrates the seven Unicode encoding schemes.
To learn more about character encoding, see the Unicode Technical Report #17
at http://www.unicode.org/reports/tr17/.

Choosing an Encoding Solution
In developing for the Web, most of your choices for Unicode encoding schemes
will have already been made for you when you select a protocol or
technology. Yet, you may find instances in which you will have the freedom
to select which scheme to use for your application (when developing
customized application, API's, etc.).
When transferring Unicode data to a Web client, such as a browser, generally
you will want to use UTF-8. This is because ASCII compatibility carries a
high value in the multi-platform world of the Web. As such, HTML and current
versions of Internet Explorer running on Windows 2000 or later use the UTF-8
encoding form. If you try to force UTF-16 encoding on IE, you will encounter
an error or it will default to UTF-8 anyway.
Windows NT and later as well as SQL Server 7 and 2000, XML, Java, COM, ODBC,
OLEDB and the .NET framework are all built on UTF-16 Unicode encoding. For
most applications, UTF-16 is the ideal solution. It is more efficient than
UTF-32 while generally providing the same character support scope.
There are cases where UTF-32 is the preferred choice. If you are developing
an application that must perform intense processing or complex manipulation
of byte-level data, the fixed-width characteristic of UTF-32 can be a
valuable asset. The extra code and processor bandwidth required to
accommodate variable-width code units can outweigh the cost of using 32-bits
to represent each code unit.
In such cases, the internal processing can be done using UTF-32 encoding and
the results can be transmitted or stored in UTF-16 (since Unicode provides
lossless transformation between these encoding forms - although there are
technical considerations that need to be understood before doing so, which
are beyond the scope of this article).
For a more detailed explanation of Unicode, see the Unicode Consortium's
article The Unicode(r) Standard: A Technical Introduction
(http://www.unicode.org/standard/principles.html) as well as Chapter 2 of
the Unicode Consortium's The Unicode Standard, Version 4.0
(http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf#G11178).
If you have specific questions about Unicode, I recommend joining the
Unicode Public email distribution list at
http://www.unicode.org/consortium/distlist.html.

<<attachment: winmail.dat>>

MSDN Article, Second Draft

Reply via email to