Hi Tony, I thought I had enough knowledge on UNICODE and UTF8, but it is nothing after reading your message.
Now, I get what I want: let input = "\xE9\xA6\xAC" let output=iconv(input, "utf-8", "utf8") Bingo! The output is real ==> '馬' Thanks again. Sean On Nov 10, 12:06 pm, Tony Mechelynck <[email protected]> wrote: > On 10/11/09 19:44, Sean wrote: > > > > > > > Hi, > > > My input is from HTTP, 3 hard-coded bytes of UTF-8 hex value. > > What I want is 2 bytes unicode. > > > For example: > > let input = "%E9%A6%AC" > > let output = "99AC" > > > Based on the output, I can then get the real CJK: 馬. > > > Is it possible to do it from within Vim? > > > Thanks > > > Sean > > You can do it the hard way, with arithmetic computations which I shall > explain below. > > Or you can do it the easy way, by writing the bytes to disc as if they > were Latin1 (see ":help ++opt") and reading them back as UTF-8. > > Or you can use the iconv() function (q.v.). > > UTF-8 bytes are divided in "waterproof" categories as follows: > > - Bytes 0x00 to 0x7F are "single" bytes, they each represent a single > codepoint in the exact same format as in Latin-1 or 7-bit US-ASCII. > > - Bytes 0xC0 to (currently) 0xF4 or (as originally foreseen and still > supported by Vim) 0xFD are "header" bytes in a multibyte sequence. Such > a byte MUST be the first byte of its sequence and the number of "one" > bits above the topmost "zero" bit indicates the number of bytes > (including this one) in the whole sequence. > > - Bytes 0x80 to 0xBF are "trailer" bytes in a multibyte sequence. They > can be any byte in the sequence except the first. > > - Bytes OxFE and OxFF are always invalid anywhere in UTF-8 text. > > - In the bytes of a multibyte sequence, all bits after the topmost > "zero" bit in each byte constitute the "payload": they are data bits, > and in UTF-8 the most significant bits always come first. > > Your example translates as follows: > > 0xE9 = 1110.1001 binary > header byte > the sequence is of three bytes > payload: 1001 > 0xA6 = 1010.0110 binary > trailer byte > payload: 100110 > 0xAC = 1010.1100 binary > trailer byte > payload: 101100 > Result (concatenated payload bits) 1001.1001.1010.1100 binary, or U+99AC > > Note that some hanzi are above U+20000; the UTF-8 code for them consists > of four bytes, not three: e.g. 𠄣 = U+20123 = UTF-8 0xF0 0xA0 0x84 0xA3 > = %F0%A0%84%A3 in "percent-escaped" HTTP coding. > > The Unicode code space had originally been foreseen as ranging from > U+0000 to U+7FFFFFFF but the current standards say that no codepoints > above U+10FFFD will ever be valid; also, codepoints whose hex > representation is xxFFFE or xxFFFF (where xx is anything) have been > expressly designated as invalid, never to be used. > > Best regards, > Tony. > -- > Putt's Law: > Technology is dominated by two types of people: > Those who understand what they do not manage. > Those who manage what they do not understand. --~--~---------~--~----~------------~-------~--~----~ You received this message from the "vim_use" maillist. For more information, visit http://www.vim.org/maillist.php -~----------~----~----~----~------~----~------~--~---
