More than you wanted to know about GB2312

Tom Emerson Thu, 02 Nov 2000 18:56:43 -0800
This message provides a brief description of how the GB2312 encoding
(really EUC-CN, GB2312 is properly a character set, not an encoding)
works, including how to convert between row-cell and hex notation, and
what a octet stream looks like when it contains GB2312 code points.

By way of exposition, I'll use the Simplified Characters for
Zhong1guo2 (China), U+4E2D U+58B1.

The GB2312 hex values for these characters is 0x5650 0x397A. To
convert these to row-cell, subtract 0x2020 from each and convert each
byte to decimal:

GB2312
Hex Value    0x5650      0x397A
           - 0x2020    - 0x2020
           --------    --------
             0x3630      0x195A
Row-Cell      54-48       25-90

So the row-cell values for these characters are 54-48 and 25-90.

In a text stream, GB2312 is encoded using an 8-bit encoding,
EUC-CN. Since GB2312 is a 7-bit encoding, to differentiate the Chinese
characters the high-bit is set, making the 8-bit. To accomplish this,
you 0x80 to the hex value, or 0xA0 to the row-cell value (which makes
sense, since the row-cell value is 0x20 less than the hex value, and
adding 0x80 to the hex value creates the EUC-CN value). So:

GB2312
Hex Value   0x5650      0x397A
          + 0x8080    + 0x8080
          --------    --------
EUC-CN      0xD6D0      0xB9FA

And indeed, if you create a GB-2312 encoded file containing Zhong1guo2
and then look at the hex values, this is what you will see. RFC 1922
(which defines ISO-2022-CN) calls this CN-GB encoding.

I know this is confusing, but hopefully this has helped a bit.

-- 
Tom Emerson                                          Basis Technology Corp.
Zenkaku Language Hacker                            http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"
More than you wanted to know about GB2312

Reply via email to