[unicode] Re: UCS-2 Files

Carl W. Brown Fri, 23 Mar 2001 09:22:04 -0800
Marco,

I find that people often understand it better when you get away from bytes,
octets etc. and describe Unicode strings as an array of unsigned short (16
bit unsigned integers) in the same manner as single byte characters are an
array of 8 bit integers.  This way the only time you have to deal with
endian issues is when you deal with the memory or transmission layout of the
data.  This also helps when you get into null terminated strings.  You can
not terminate a Unicode string with a byte null, it has to be a full 16 bit
character.

Carl

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Marco Cimarosti
Sent: Thursday, March 22, 2001 7:03 AM
To: 'Tomas McGuinness'; [EMAIL PROTECTED]
Subject: [unicode] Re: UCS-2 Files



Tomas McGuinness wrote:
> I have a question relating to UCS-2. I am currently
> developing a product
> that will support UCS-2 and I have been sent several
> documents encoded in
> UCS-2. I have no reader or writer for UCS-2 but I have
> performed Hexdumps in
> UNIX. At the beginning of the UCS-2 characters there are two rogue
> characters 0xFF and 0xFE. Have these characters any importance?

They are quite important, yes. See
http://www.unicode.org/unicode/faq/utf_bom.html#24 for details.

But, beware that they are NOT characters: they are OCTETS (also known as
"bytes")!

The first thing that I'd suggest you to do when starting working with
Unicode and other character sets is to carefully disjoining the terms "byte"
and "character". Better if you also keep the distinction between "octet" (a
series of 8 bits) and "byte" (a series of n bits, where n is often but NOT
always 8).

In brief, those two octets tell you that:

1.      It is an Unicode text file.

2.      It is in format UCS-2, UTF-16, or UTF-32 (to determine whether it is
UTF-32 you need to read the next two octets: if they are 0x00 0x00, then it
is UTF-32. Else it is either UCS-2 or UTF-16, which basically you don't need
to distinguish).

3.      The 16-bit units are little endian, so you have to interpret these
two octets as (0xFF + 0xFE * 256), which yields 0xFEFF, the code of the
"BOM".

4.      All subsequent pairs of octets a,b are interpreted the same way: (a
+ b * 256).

Regards.
_ Marco
[unicode] Re: UCS-2 Files

Reply via email to