I think I've figured out a way to find the beginning of a GB18030 character starting
anywhere in a document. The algorithm is similar to finding the beginning of a DBCS
character in that you scan backward until you find a byte that can only come at the
start of a character. The main difference is that you check for being in four-byte
characters first (those of the form HdHd, where H is a byte in the range 0x81 - 0xFE
and d is an ASCII digit). If a four-byte character isn't involved (ordinary GBxxxx
don't use d as a trail byte), you revert to the DBCS approach for handling the rest of
GB18030.
This algorithm is handy when you want to stream in a file in chunks and need to know
if a chunk ends in the middle of a character. One can also solve this particular
problem by keeping track of character boundaries from the start of stream, but
typically more processing is involved.
Murray
-----Original Message-----
From: Carl W. Brown [mailto:[EMAIL PROTECTED]]
Sent: Fri 2001/09/21 04:56
To: Charlie Jolly; [EMAIL PROTECTED]
Cc:
Subject: RE: GB18030
Charlie,
GB18030 is designed to support all Unicode characters. It has the capacity
to also encode additional characters. I know of no plans to do so.
I don't think it will have much affect on Unicode. Most systems that handle
GB18030 will want to convert it to Unicode first to reduce processing
overhead. With most of the common MBCS code pages you can determine the
length of the character from the first byte. With GB18030 you some times
have to check the first two characters. UTF-8 for example is an MBCS
character set but if I am going backwards through a string I can do so.
With GB18030 I must start over from the beginning of the string to find the
start of the previous character.
It is smaller that UTF-8 for Chinese and larger for anyone else.
Carl
> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of Charlie Jolly
> Sent: Friday, September 21, 2001 1:42 AM
> To: [EMAIL PROTECTED]
> Subject: GB18030
>
>
> GB18030
>
> In what ways will this effect Unicode?
>
> Does it contain anything that Unicode doesn't?
>
>
>
>