thank you so much everyone for explaining it. I got it now!

-James

On 12/11/12 11:50 AM, "[email protected]"
<[email protected]> wrote:

>From: James Lin <James_Lin_at_symantec.com>
>> Hi
>> Does anyone know why ill-form occurred on the UTF-8? besides it doesn't
>>follow > the pattern of UTF-8 byte-sequences, i just wondering how or
>>why?
>> If i have a code point: U+4E8C or "二"
>> In UTF-8, it's "E4 BA 8C" while in UTF-16, it's "4E8C". Where is this
>>"BA" 
>> comes from?
>> 
>> thanks
>> -James 
>
>Each of the UTF encodings represents the binary data in different ways.
>So we 
>need to break the scalar value, U+4E8C, into its binary representation
>before 
>we proceed.
>
>4E8C -> 0100 1110 1000 1100
>
>Then, we need to look up the rules for UTF-8. It states that code points
>between U+800 and U+FFFF are encoded with three bytes, in the form
>1110xxxx 
>10xxxxxx 10xxxxxx. So plugging in our data, we get
>
>        4      E    8     C
>      0100   1110 10-00 1100
>      ||||   ||||//   \||||
>+ 1110xxxx 10xxxxxx 10xxxxxx
>
>= 11100100 10111010 10001100
>or  E  4     B  A     8  C
>
>-Van Anderson



Reply via email to