thank you so much everyone for explaining it. I got it now! -James
On 12/11/12 11:50 AM, "[email protected]" <[email protected]> wrote: >From: James Lin <James_Lin_at_symantec.com> >> Hi >> Does anyone know why ill-form occurred on the UTF-8? besides it doesn't >>follow > the pattern of UTF-8 byte-sequences, i just wondering how or >>why? >> If i have a code point: U+4E8C or "二" >> In UTF-8, it's "E4 BA 8C" while in UTF-16, it's "4E8C". Where is this >>"BA" >> comes from? >> >> thanks >> -James > >Each of the UTF encodings represents the binary data in different ways. >So we >need to break the scalar value, U+4E8C, into its binary representation >before >we proceed. > >4E8C -> 0100 1110 1000 1100 > >Then, we need to look up the rules for UTF-8. It states that code points >between U+800 and U+FFFF are encoded with three bytes, in the form >1110xxxx >10xxxxxx 10xxxxxx. So plugging in our data, we get > > 4 E 8 C > 0100 1110 10-00 1100 > |||| ||||// \|||| >+ 1110xxxx 10xxxxxx 10xxxxxx > >= 11100100 10111010 10001100 >or E 4 B A 8 C > >-Van Anderson

