I think I've figured out a way to find the beginning of a GB18030 character starting 
anywhere in a document. The algorithm is similar to finding the beginning of a DBCS 
character in that you scan backward until you find a byte that can only come at the 
start of a character. The main difference is that you check for being in four-byte 
characters first (those of the form HdHd, where H is a byte in the range 0x81 - 0xFE 
and d is an ASCII digit). If a four-byte character isn't involved (ordinary GBxxxx 
don't use d as a trail byte), you revert to the DBCS approach for handling the rest of 
GB18030. 
 
This algorithm is handy when you want to stream in a file in chunks and need to know 
if a chunk ends in the middle of a character. One can also solve this particular 
problem by keeping track of character boundaries from the start of stream, but 
typically more processing is involved.
 
Murray

        -----Original Message----- 
        From: Carl W. Brown [mailto:[EMAIL PROTECTED]] 
        Sent: Fri 2001/09/21 04:56 
        To: Charlie Jolly; [EMAIL PROTECTED] 
        Cc: 
        Subject: RE: GB18030
        
        

        Charlie,
        
        GB18030 is designed to support all Unicode characters.  It has the capacity
        to also encode additional characters.  I know of no plans to do so.
        
        I don't think it will have much affect on Unicode.  Most systems that handle
        GB18030 will want to convert it to Unicode first to reduce processing
        overhead.  With most of the common MBCS code pages you can determine the
        length of the character from the first  byte.  With GB18030 you some times
        have to check the first two characters.  UTF-8 for example is an MBCS
        character set but if I am going backwards through a string I can do so.
        With GB18030 I must start over from the beginning of the string to find the
        start of the previous character.
        
        It is smaller that UTF-8 for Chinese and larger for anyone else.
        
        Carl
        
        > -----Original Message-----
        > From: [EMAIL PROTECTED]
        > [mailto:[EMAIL PROTECTED]]On Behalf Of Charlie Jolly
        > Sent: Friday, September 21, 2001 1:42 AM
        > To: [EMAIL PROTECTED]
        > Subject: GB18030
        >
        >
        > GB18030
        >
        > In what ways will this effect Unicode?
        >
        > Does it contain anything that Unicode doesn't?
        >
        >
        >
        >
        
        
        


Reply via email to