Re: Suggest ':TOhtml' to use 'fileencoding' rather than 'encoding' as default html charset
Sorry, it's my omission, I had set 'fileencoding' in '.vimrc'... ps: Excuse me to get this message so late. I cannot visit google group last few days. On 2010-8-28, 03:37 Ben Fritz fritzophre...@gmail.com wrote: On Aug 25, 11:11 pm, JiaYanwei jia...@126.com wrote: e.g. If the system/vim encoding is 'UTF-8', but a text file encoding is 'latin-1'. If the default HTML charset is 'encoding', after ':TOhtml', we should change the HTML charset to 'iso-8859-1', or save the generated HTML file by ':w ++enc=utf-8'. Hmm...unless I understand correctly, the sequence is: Load text file. File encoding is latin-1, Vim encoding is utf-8. Do :TOhtml to create a new html buffer. File encoding defaults to empty, Vim encoding is still utf-8. :TOhtml sees encoding and sets the charset in the generated markup to UTF-8. :w the new html buffer. Vim sees empty file encoding, so uses utf-8 as the new file's encoding. Thus file encoding matches the html charset. You claim that the new html buffer has latin-1 encoding. Am I missing something here? I still think using fileencoding might be the correct way to do it, but doing so would require 2html.vim to set the file encoding of the new html buffer explicitly to be equal to the source file. This also brings up another shortcoming of 2html, because using g:html_use_encoding may change the html charset meta tag, but it does NOT change the actual character encoding of the file. It looks like I will need to set the fileencoding of the new html buffer to whatever corresponds to the supplied user option as a separate fix. Any thoughts? -- You received this message from the vim_dev maillist. Do not top-post! Type your reply below the text you are replying to. For more information, visit http://www.vim.org/maillist.php
Re: Suggest ':TOhtml' to use 'fileencoding' rather than 'encoding' as default html charset
Oh, sorry, I forgeted that 'fileencoding' may be empty. This should be handled. I encountered the opposite that 'fileencoding' is often different from 'encoding' while editing existing files. Ben Fritz wrote: On Aug 26, 9:40 am, Ben Fritz fritzophre...@gmail.com wrote: From my understanding, 'fileencoding' is the encoding Vim is supposed to use to read/write the file. So, it does make sense that we should use this instead of just 'encoding' for the charset of the generated html. Does anyone know why TOhtml has used 'encoding' instead? One problem with the supplied patch, is that Vim will use 'encoding' for a file's encoding, if 'fileencoding' is empty. In my setup, it looks like 'fileencoding' is nearly always empty. So, the script will need to fall back to 'encoding' if 'fileencoding' is empty. Probably it should also re-detect the charset using 'encoding' when 'fileencoding' is not blank but does not resolve to a valid charset. Any thoughts? Like I said, I've never needed to mess with 'encoding' or 'fileencoding' in my daily use of Vim. -- You received this message from the vim_dev maillist. Do not top-post! Type your reply below the text you are replying to. For more information, visit http://www.vim.org/maillist.php
Suggest ':TOhtml' to use 'fileencoding' rather than 'encoding' as default html charset
I think this will be more reasonable than before. If the encoding of edited text file differ form the system/vim encoding, it's inconvenient to set default HTML charset to be 'encoding'. Thus, after ':TOhtml', we should modify the generated HTML file to make the file encoding the same as HTML charset. e.g. If the system/vim encoding is 'UTF-8', but a text file encoding is 'latin-1'. If the default HTML charset is 'encoding', after ':TOhtml', we should change the HTML charset to 'iso-8859-1', or save the generated HTML file by ':w ++enc=utf-8'. But if the default HTML charset is 'fileencoding', we should do nothing after ':TOhtml'. Changes as the attachment. Best regards, Yanwei. -- -- You received this message from the vim_dev maillist. Do not top-post! Type your reply below the text you are replying to. For more information, visit http://www.vim.org/maillist.php tohtml.diff Description: Binary data
Re:Re: [Win32] common dialogs of gVim cannot input some Unicode characters from IME
At 2010-08-07 21:57:23,Tony Mechelynck antoine.mechely...@gmail.com wrote: On 04/08/10 19:16, JiaYanwei wrote: At 2010-08-04 23:46:23, Bram Moolenaarb...@moolenaar.net wrote: JiaYanwei wrote: For example, I work with Windows Xp Simplified Chinese Edition. There's a character 'CIRCLED NUMBER TWENTY' - U+2473, beyond the character set of ACP (system active codepage) CP936. While it can be copyed and pasted into the textbox of Find and Replace dialog, but it can't be inputed directly from windows IME (the inputed character will be the question mark '?'). It puzzled me for a long time. I finally found the reason that ANSI Version functions such as DispatchMessageA and IsDialogMessageA will Ignore the WM_WCHAR message. The attachment 2274_uime.patch.gz is the patch for vim 7.2.446, 2477_uime.patch.gz is for 7.3d revision 2...@mercurial. Thanks. Can a few people verify this works OK with different compilers? I have just compiled it with msvc2005 express mingw and also have tested it. It works ok. ps: I have got a same waring many times while compile it by vc2005: warning C4819: The file contains a character that cannot be represented in the current code page (936). Save the file in Unicode format to prevent data loss This warning is useful for the IDE since soure maybe modified by it. But we don't compile vim with the IDE, so... could we add /wd4819 to CFLAGS to disable it? OTOH, instead of having the Unicode codepoint in UTF-8, maybe it should be represented in some sort of escape format? I'm not sure whether \u2473 or \xE2\x91\xB3 or something else is the right representation in this case though. Of course, you can input any codepoint into Vim (with 'encoding' set to UTF-8) by bypassing the IME, in this case by using Ctrl-V u 2 4 7 3 without the spaces. Or if you use it often, you can assign it to a mapping or make up a keymap (about the latter, see http://vim.wikia.com/wiki/How_to_make_a_keymap ). Thanks. Maybe I have't explained clearly. I just wish I can input Unicode Characters that beyond ACP by IME(e.g. some Pinyin input method, not directly by enter Unicode hex sequence) into Find and Replace dialog of gVim. Maybe the table as follows could help explain this more clearly: gVim gVim-RP notepad notepad-RP Copy paste characters inside ACP + ++ + Input characters inside ACP by IME + ++ + Copy paste characters beyond ACP + ++ + Input characters beyond ACP by IME + -+ + gVim: main edit window of gVim-win32 gVim-RP: the textbox of Find and Replace dialog of gVim-win32 notpad: main edit window of the notepad.exe of Windows notepad-RP: the textbox of Find and Replace dialog of notepad.exe -- hundred-and-one symptoms of being an internet addict: 2. You kiss your girlfriend's home page. /// Bram Moolenaar -- b...@moolenaar.net -- http://www.Moolenaar.net \\\ ///sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\ \\\download, build and distribute -- http://www.A-A-P.org/// \\\help me help AIDS victims -- http://ICCF-Holland.org/// Best regards, Tony. -- Violators can be fined, arrested or jailed for making ugly faces at a dog. [real standing law in Oklahoma, United States of America] Best regards, Yanwei. -- -- You received this message from the vim_dev maillist. Do not top-post! Type your reply below the text you are replying to. For more information, visit http://www.vim.org/maillist.php
[Win32] common dialogs of gVim cannot input some Unicode characters from IME
For example, I work with Windows Xp Simplified Chinese Edition. There's a character 'CIRCLED NUMBER TWENTY' - U+2473, beyond the character set of ACP (system active codepage) CP936. While it can be copyed and pasted into the textbox of Find and Replace dialog, but it can't be inputed directly from windows IME (the inputed character will be the question mark '?'). It puzzled me for a long time. I finally found the reason that ANSI Version functions such as DispatchMessageA and IsDialogMessageA will Ignore the WM_WCHAR message. The attachment 2274_uime.patch.gz is the patch for vim 7.2.446, 2477_uime.patch.gz is for 7.3d revision 2...@mercurial. Best regards, Yanwei. -- -- You received this message from the vim_dev maillist. Do not top-post! Type your reply below the text you are replying to. For more information, visit http://www.vim.org/maillist.php 2274_uime.patch.gz Description: GNU Zip compressed data 2477_uime.patch.gz Description: GNU Zip compressed data
Re: [Win32] common dialogs of gVim cannot input some Unicode characters from IME
At 2010-08-04 23:46:23, Bram Moolenaar b...@moolenaar.net wrote: JiaYanwei wrote: For example, I work with Windows Xp Simplified Chinese Edition. There's a character 'CIRCLED NUMBER TWENTY' - U+2473, beyond the character set of ACP (system active codepage) CP936. While it can be copyed and pasted into the textbox of Find and Replace dialog, but it can't be inputed directly from windows IME (the inputed character will be the question mark '?'). It puzzled me for a long time. I finally found the reason that ANSI Version functions such as DispatchMessageA and IsDialogMessageA will Ignore the WM_WCHAR message. The attachment 2274_uime.patch.gz is the patch for vim 7.2.446, 2477_uime.patch.gz is for 7.3d revision 2...@mercurial. Thanks. Can a few people verify this works OK with different compilers? I have just compiled it with msvc2005 express mingw and also have tested it. It works ok. ps: I have got a same waring many times while compile it by vc2005: warning C4819: The file contains a character that cannot be represented in the current code page (936). Save the file in Unicode format to prevent data loss This warning is useful for the IDE since soure maybe modified by it. But we don't compile vim with the IDE, so... could we add /wd4819 to CFLAGS to disable it? -- hundred-and-one symptoms of being an internet addict: 2. You kiss your girlfriend's home page. /// Bram Moolenaar -- b...@moolenaar.net -- http://www.Moolenaar.net \\\ ///sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\ \\\download, build and distribute -- http://www.A-A-P.org/// \\\help me help AIDS victims -- http://ICCF-Holland.org/// -- You received this message from the vim_dev maillist. Do not top-post! Type your reply below the text you are replying to. For more information, visit http://www.vim.org/maillist.php
[Win32] some dialog boxes of gVim doesn't support Unicode
The dialogs are poped up by the function inputdialog() and the commands promptfind, promptrepl. The procedures inside gVim cannot get the correct input from these dialogs if the input contains any unicode character beyond the character set of ACP (system active codepage), even if the gVim runs under Windows NT and with the setting 'enc=utf-8'. In fact, if encoding is set to UTF-8 or any other encoding that differs from the ACP, there may be more problems to get the input from these dialogs since there's no encoding convert. Here's a patch. It will detect Windows OS version when use these dialogs. If it is Windows NT, the wide versions of Windows API will be used instead of non-wide versions to get the input, then convert the wide string to the encoding used inside gVim. Best regards, Yanwei. -- --~--~-~--~~~---~--~~ You received this message from the vim_dev maillist. For more information, visit http://www.vim.org/maillist.php -~--~~~~--~~--~--~--- for72069.gz Description: application/gzip-compressed
Gvim for Windows doesn't handle non-BMP characters when interchanging data with Windows OS
When interchanging data with Windows such as clipboard operation, gvim will convert the text into UCS-2 encoding, but different from UTF-16, UCS-2 can't encode non-BMP characters. For example, when paste a non-BMP character U+248BB from Windows clipboard, it will insert two separated characters d852 dcbb. It is caused by the function ucs2_to_utf8() in src/os_mswin.c, which treates the surrogate pairs as separated unicode characters, and convert it into bad UTF-8 sequence 0xED 0xA1 0x92 0xED 0xB2 0xBB -- the correct UTF-8 sequence should be 0xF0 0xA4 0xA2 0xBB. Similarly, when copy a non-BMP character U+248BB into Windows clipboard, the content of clipboard will be U+48BB, because the function utf8_to_ucs2() in src/os_mswin.c will cast the integer 0x248BB into a short integer 0x48BB. The attachment is a patch. The surrogate pairs handling has been add into the two functions mentioned above. This make the non-BMP characters can be correctly interchanged with Windows clipboard as I had tested: Non-BMP character paste from/copy into Windows clipboard +--+++ | | WindowsXP with GB18030 support | Windows 98| +--+++ | editing | before patch works bad | before patch works bad | | UTF-* or | after patch works OK | after patch works OK | | UCS-4* ||| | text ||| +--+++ | editing | before patch works bad | ( can not edit | | GB18030 | after patch works OK | GB18030 text ) | | text ||| +--+++ B.T.W.: It seems better to replace the functions name mentioned above with utf16_to_utf8 and utf8_to_utf16, I think. Best regards, Yanwei. -- --~--~-~--~~~---~--~~ You received this message from the vim_dev maillist. For more information, visit http://www.vim.org/maillist.php -~--~~~~--~~--~--~--- for72025.tgz Description: Binary data
Re: Gvim for Windows doesn't handle non-BMP characters when interchanging data with Windows OS
Hello Tony, It's really to be the similar problem, but this one only arise under Windows operating system, the UTF-16le BOM problem is platform independence. I was uncertain wherher a combined patch was convenient. On 2008-10-22 23:21:11, Tony Mechelynck wrote: I expect this is related with the UTF-16le BOM problem you noticed this past Saturday. Maybe a combined patch would be OK, since in both cases, the problem involves using UCS-2 (where surrogates are undefined) instead of UTF-16 (where surrogate pairs encode codepoints above the BMP)? Best regards, Yanwei. --~--~-~--~~~---~--~~ You received this message from the vim_dev maillist. For more information, visit http://www.vim.org/maillist.php -~--~~~~--~~--~--~---
Re:Re: Gvim for Windows doesn't handle non-BMP characters when interchanging data with Windows OS
Oh, I had made a mistake, I want to say They're really similar problems the first sentence. On 2008-10-23 00:16:20, JiaYanwei Hello Tony, It's really to be the similar problem, but this one only arise under Windows operating system, the UTF-16le BOM problem is platform independence. I was uncertain wherher a combined patch was convenient. --~--~-~--~~~---~--~~ You received this message from the vim_dev maillist. For more information, visit http://www.vim.org/maillist.php -~--~~~~--~~--~--~---
Encoding recognizing problem with 2 byte BOM FF FE
For a 2 byte BOM FF FE, ucs-2le is used, which doesn't work for little-endian UTF-16 text. Like the patch 7.1.261, the only difference is the byte order. And I have also writen a patch for Vim-7.2.025: *** ../vim-7.2.025/src/fileio.c Wed Oct 15 15:09:56 2008 --- src/fileio.cSat Oct 18 11:42:25 2008 *** *** 5550,5559 name = ucs-4le; /* FF FE 00 00 */ len = 4; } ! else if (flags == FIO_ALL || flags == (FIO_UCS2 | FIO_ENDIAN_L)) ! name = ucs-2le; /* FF FE */ ! else if (flags == (FIO_UTF16 | FIO_ENDIAN_L)) name = utf-16le; /* FF FE */ } else if (p[0] == 0xfe p[1] == 0xff (flags == FIO_ALL || flags == FIO_UCS2 || flags == FIO_UTF16)) --- 5550,5561 name = ucs-4le; /* FF FE 00 00 */ len = 4; } ! /* For little endian: default to utf-16, it works also for ucs-2 text. */ ! else if (flags == FIO_ALL || flags == (FIO_UTF16 | FIO_ENDIAN_L)) name = utf-16le; /* FF FE */ + else if (flags == (FIO_UCS2 | FIO_ENDIAN_L)) + name = ucs-2le; /* FF FE */ + } else if (p[0] == 0xfe p[1] == 0xff (flags == FIO_ALL || flags == FIO_UCS2 || flags == FIO_UTF16)) -- Best regards, Yanwei --~--~-~--~~~---~--~~ You received this message from the vim_dev maillist. For more information, visit http://www.vim.org/maillist.php -~--~~~~--~~--~--~---
Re:Re: Encoding recognizing problem with 2 byte BOM FF FE
Hello Tony, Thanks for your helpful suggestion. By the way, wish Bram a wonderful holiday. on 2008-10-18 18:18:45, Tony Mechelynck wrote: I confirm that Vim 7.2.25 with 'fencs' starting in ucs-bom identifies UTF-16le files with BOM as if they were UCS-2le, even if codepoints above U+ are present, which is an error. For instance U+20025 is read back as (two surrogates shown as distinct characters) instead of as one double-wide character. Bram, there's work for you when you're back from holiday :-). I'm not competent to check the proposed patch by eyeball but I hope it does what is needed. Yanwei, in the meantime I suggest the following autocommand (untested) as a workaround which doesn't need recompilation: au BufReadPost * if (fenc == 'ucs-2le')bomb \ | e ++enc=utf-16le | endif Best regards, Yanwei. --~--~-~--~~~---~--~~ You received this message from the vim_dev maillist. For more information, visit http://www.vim.org/maillist.php -~--~~~~--~~--~--~---