On 11/06/09 15:14, Matt Wozniski wrote: > > Bram Moolenaar wrote: >> >> Matt Wozniski wrote: >> >>> >>> Well, keeping in mind that vim will use utf-8 internally even if you >>> explicitly :set enc=utf-16, maybe the best fix would be to always >>> change&encoding to 'utf-8' whenever doing a :set >>> encoding=SomethingUnicode? It seems like it would fix this bug. This >>> bug, as far as I can tell from a quick glance, is because vim tries to >>> convert from UTF-16 (&enc) to UTF-8 (&fenc) when writing the file, and >>> since the buffer is being internally stored as UTF-8 this is the wrong >>> thing to do. >> >> The main reason one would set 'encoding' to utf-16 is when this should >> be the default file format. On MS-Windows some files are utf-16, if you >> are editing a whole bunch of them this could be useful (even though >> using utf-8 should work). > > Well, that's another thing that has never worked, then. When 'enc' is > 'utf-16' and 'fenc' is unset, files are written out in utf-8, not > utf-16. > > Simple testcase: > > vim -u NONE -N --cmd 'set enc=utf-16 fenc= | exe "normal! i\<C-k>`e" | w > !iconv -f utf-16' -c 'q!' > iconv: incomplete character or shift sequence at end of buffer > shell returned 1 > > Change the '-f utf-16' to '-f utf-8' and iconv confirms that it's being > passed valid utf-8. > > Is the desired behavior even well defined? The docs seem to contradict; > :help 'encoding' says: > > When "unicode", "ucs-2" or "ucs-4" is used, Vim internally uses utf-8. > > but :help 'fileencoding' says: > > When 'fileencoding' is empty, the same value as 'encoding' will be > used (no conversion when reading or writing a file). > > In this case, 'fileencoding' is empty, but conversion *is* supposed to > occur when writing the file (from the internal utf-8 buffer to the > 'encoding' utf-16). > >> I don't think finding one bug is a good reason to drop support for this. >> It's probably easy to fix. > > ~Matt
I'm not Bram, so take my opinions below with a grain of salt; however, after attentively reading the Vim multibyte docs for years, I believe that the "desired" (or at least the "least surprising") behaviour would be: - If 'encoding' is one of ucs-2, ucs-2le, utf-16, utf-16le, ucs-4, ucs-4le (or utf-32, utf-32le which are aliases for ucs-4 ucs-4le; or the *be aliases for ucs-? utf-??), use utf-8 internally, but convert between utf-8 and 'encoding' when reading and writing if 'fileencoding' is empty. Vim ought to be able to do these conversions without calling iconv, they are trivial (the "least trivial" of them, I think, is when converting between UTF-16 surrogate pairs and UTF-8 representation for codepoints in the range U+10000 - U+10FFFF, but even that is systematic, and documented with no ambiguity somewhere on the Unicode site, and even IIRC on the Wikipedia). - With the same values of 'encoding', when 'fileencoding' is nonempty, always pass UTF-8 to represent the "internal encoding" when invoking iconv for reading or writing. The same of course applies when "bypassing" iconv, e.g. when 'fileencoding' is latin1. - With other values of 'encoding' (including utf-8), 'encoding' represents the actual memory representation. This is the "general case" and is what is documented wherever the Vim help doesn't explicitly mention the opposite. Best regards, Tony. -- Line Printer paper is strongest at the perforations. --~--~---------~--~----~------------~-------~--~----~ You received this message from the "vim_use" maillist. For more information, visit http://www.vim.org/maillist.php -~----------~----~----~----~------~----~------~--~---
