When interchanging data with Windows such as clipboard operation, gvim will 
convert the text into UCS-2 encoding, but different from UTF-16, UCS-2 can't 
encode non-BMP characters. 

For example, when paste a non-BMP character U+248BB from Windows clipboard, 
it will insert two separated characters <d852> <dcbb>. It is caused by the 
function ucs2_to_utf8() in src/os_mswin.c, which treates the surrogate pairs 
as separated unicode characters, and convert it into bad UTF-8 sequence 
0xED 0xA1 0x92 0xED 0xB2 0xBB -- the correct UTF-8 sequence should be 
0xF0 0xA4 0xA2 0xBB.

Similarly, when copy a non-BMP character U+248BB into Windows clipboard, the 
content of clipboard will be U+48BB, because the function utf8_to_ucs2() 
in src/os_mswin.c will cast the integer 0x248BB into a short integer 0x48BB.

The attachment is a patch. The surrogate pairs handling has been add into the 
two functions mentioned above. This make the non-BMP characters can be 
correctly interchanged with Windows clipboard as I had tested:
        Non-BMP character paste from/copy into Windows clipboard
        +----------+--------------------------------+------------------------+
        |          | WindowsXP with GB18030 support |  Windows 98            |
        +----------+--------------------------------+------------------------+
        | editing  | before patch works bad         | before patch works bad |
        | UTF-* or | after patch works OK           | after patch works OK   |
        | UCS-4*   |                                |                        |
        | text     |                                |                        |
        +----------+--------------------------------+------------------------+
        | editing  | before patch works bad         | ( can not edit         |
        | GB18030  | after patch works OK           |   GB18030 text )       |
        | text     |                                |                        |
        +----------+--------------------------------+------------------------+
B.T.W.: It seems better to replace the functions name mentioned above with 
"utf16_to_utf8" and "utf8_to_utf16", I think.

Best regards,
Yanwei.
--

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Attachment: for72025.tgz
Description: Binary data

Raspunde prin e-mail lui