Hi Tony,

I thought I had enough knowledge on UNICODE and UTF8, but it is
nothing after reading your message.

Now, I get what I want:

let input = "\xE9\xA6\xAC"
let output=iconv(input, "utf-8", "utf8")

Bingo!  The output is real ==> '馬'

Thanks again.

Sean

On Nov 10, 12:06 pm, Tony Mechelynck <[email protected]>
wrote:
> On 10/11/09 19:44, Sean wrote:
>
>
>
>
>
> > Hi,
>
> > My input is from HTTP, 3 hard-coded bytes of UTF-8 hex value.
> > What I want is 2 bytes unicode.
>
> > For example:
> > let input = "%E9%A6%AC"
> > let output = "99AC"
>
> > Based on the output, I can then get the real CJK: 馬.
>
> > Is it possible to do it from within Vim?
>
> > Thanks
>
> > Sean
>
> You can do it the hard way, with arithmetic computations which I shall
> explain below.
>
> Or you can do it the easy way, by writing the bytes to disc as if they
> were Latin1 (see ":help ++opt") and reading them back as UTF-8.
>
> Or you can use the iconv() function (q.v.).
>
> UTF-8 bytes are divided in "waterproof" categories as follows:
>
> - Bytes 0x00 to 0x7F are "single" bytes, they each represent a single
> codepoint in the exact same format as in Latin-1 or 7-bit US-ASCII.
>
> - Bytes 0xC0 to (currently) 0xF4 or (as originally foreseen and still
> supported by Vim) 0xFD are "header" bytes in a multibyte sequence. Such
> a byte MUST be the first byte of its sequence and the number of "one"
> bits above the topmost "zero" bit indicates the number of bytes
> (including this one) in the whole sequence.
>
> - Bytes 0x80 to 0xBF are "trailer" bytes in a multibyte sequence. They
> can be any byte in the sequence except the first.
>
> - Bytes OxFE and OxFF are always invalid anywhere in UTF-8 text.
>
> - In the bytes of a multibyte sequence, all bits after the topmost
> "zero" bit in each byte constitute the "payload": they are data bits,
> and in UTF-8 the most significant bits always come first.
>
> Your example translates as follows:
>
> 0xE9 = 1110.1001 binary
>         header byte
>         the sequence is of three bytes
>         payload: 1001
> 0xA6 = 1010.0110 binary
>         trailer byte
>         payload: 100110
> 0xAC = 1010.1100 binary
>         trailer byte
>         payload: 101100
> Result (concatenated payload bits) 1001.1001.1010.1100 binary, or U+99AC
>
> Note that some hanzi are above U+20000; the UTF-8 code for them consists
> of four bytes, not three: e.g. 𠄣 = U+20123 = UTF-8 0xF0 0xA0 0x84 0xA3
> = %F0%A0%84%A3 in "percent-escaped" HTTP coding.
>
> The Unicode code space had originally been foreseen as ranging from
> U+0000  to U+7FFFFFFF but the current standards say that no codepoints
> above U+10FFFD will ever be valid; also, codepoints whose hex
> representation is xxFFFE or xxFFFF (where xx is anything) have been
> expressly designated as invalid, never to be used.
>
> Best regards,
> Tony.
> --
> Putt's Law:
>         Technology is dominated by two types of people:
>                 Those who understand what they do not manage.
>                 Those who manage what they do not understand.
--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply via email to