When interchanging data with Windows such as clipboard operation, gvim will convert the text into UCS-2 encoding, but different from UTF-16, UCS-2 can't encode non-BMP characters.
For example, when paste a non-BMP character U+248BB from Windows clipboard, it will insert two separated characters <d852> <dcbb>. It is caused by the function ucs2_to_utf8() in src/os_mswin.c, which treates the surrogate pairs as separated unicode characters, and convert it into bad UTF-8 sequence 0xED 0xA1 0x92 0xED 0xB2 0xBB -- the correct UTF-8 sequence should be 0xF0 0xA4 0xA2 0xBB. Similarly, when copy a non-BMP character U+248BB into Windows clipboard, the content of clipboard will be U+48BB, because the function utf8_to_ucs2() in src/os_mswin.c will cast the integer 0x248BB into a short integer 0x48BB. The attachment is a patch. The surrogate pairs handling has been add into the two functions mentioned above. This make the non-BMP characters can be correctly interchanged with Windows clipboard as I had tested: Non-BMP character paste from/copy into Windows clipboard +----------+--------------------------------+------------------------+ | | WindowsXP with GB18030 support | Windows 98 | +----------+--------------------------------+------------------------+ | editing | before patch works bad | before patch works bad | | UTF-* or | after patch works OK | after patch works OK | | UCS-4* | | | | text | | | +----------+--------------------------------+------------------------+ | editing | before patch works bad | ( can not edit | | GB18030 | after patch works OK | GB18030 text ) | | text | | | +----------+--------------------------------+------------------------+ B.T.W.: It seems better to replace the functions name mentioned above with "utf16_to_utf8" and "utf8_to_utf16", I think. Best regards, Yanwei. -- --~--~---------~--~----~------------~-------~--~----~ You received this message from the "vim_dev" maillist. For more information, visit http://www.vim.org/maillist.php -~----------~----~----~----~------~----~------~--~---
for72025.tgz
Description: Binary data