On 10/11/09 19:44, Sean wrote:
>
> Hi,
>
> My input is from HTTP, 3 hard-coded bytes of UTF-8 hex value.
> What I want is 2 bytes unicode.
>
> For example:
> let input = "%E9%A6%AC"
> let output = "99AC"
>
> Based on the output, I can then get the real CJK: 馬.
>
> Is it possible to do it from within Vim?
>
> Thanks
>
> Sean

You can do it the hard way, with arithmetic computations which I shall 
explain below.

Or you can do it the easy way, by writing the bytes to disc as if they 
were Latin1 (see ":help ++opt") and reading them back as UTF-8.

Or you can use the iconv() function (q.v.).


UTF-8 bytes are divided in "waterproof" categories as follows:

- Bytes 0x00 to 0x7F are "single" bytes, they each represent a single 
codepoint in the exact same format as in Latin-1 or 7-bit US-ASCII.

- Bytes 0xC0 to (currently) 0xF4 or (as originally foreseen and still 
supported by Vim) 0xFD are "header" bytes in a multibyte sequence. Such 
a byte MUST be the first byte of its sequence and the number of "one" 
bits above the topmost "zero" bit indicates the number of bytes 
(including this one) in the whole sequence.

- Bytes 0x80 to 0xBF are "trailer" bytes in a multibyte sequence. They 
can be any byte in the sequence except the first.

- Bytes OxFE and OxFF are always invalid anywhere in UTF-8 text.

- In the bytes of a multibyte sequence, all bits after the topmost 
"zero" bit in each byte constitute the "payload": they are data bits, 
and in UTF-8 the most significant bits always come first.


Your example translates as follows:

0xE9 = 1110.1001 binary
        header byte
        the sequence is of three bytes
        payload: 1001
0xA6 = 1010.0110 binary
        trailer byte
        payload: 100110
0xAC = 1010.1100 binary
        trailer byte
        payload: 101100
Result (concatenated payload bits) 1001.1001.1010.1100 binary, or U+99AC

Note that some hanzi are above U+20000; the UTF-8 code for them consists 
of four bytes, not three: e.g. 𠄣 = U+20123 = UTF-8 0xF0 0xA0 0x84 0xA3 
= %F0%A0%84%A3 in "percent-escaped" HTTP coding.


The Unicode code space had originally been foreseen as ranging from 
U+0000  to U+7FFFFFFF but the current standards say that no codepoints 
above U+10FFFD will ever be valid; also, codepoints whose hex 
representation is xxFFFE or xxFFFF (where xx is anything) have been 
expressly designated as invalid, never to be used.


Best regards,
Tony.
-- 
Putt's Law:
        Technology is dominated by two types of people:
                Those who understand what they do not manage.
                Those who manage what they do not understand.

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply via email to