Ben Schmidt wrote: >> This is not true. In fact, if the file contains "señor" instead of >> "ññ", Vim does resort to Latin1. This said, Vim's failure here does >> sound like a bug. But I would like to hear from Bram first. > > Well spotted, Yongwei. So there is something more subtle about this bug, and > I > believe it is this: > > Vim doesn't recognise a file as invalid utf8 if, when you get to the first > invalid > sequence, there are less bytes in the file than would be required to read a > valid > sequence beginning with the unicode leader character read. I.e. if the last > byte > in the file is C2-DF, or one of the last two bytes is E0-EF or one of the > last > three bytes is F0-F4. As these sequences would take 2, 3 and 4 bytes > respectively > to read a valid character, and there are not that many bytes in the file, Vim > finishes its analysis thinking 'valid' as it hasn't read a 'whole invalid > character'. :-) > > This is a very specific scenario, though. Question for Dervish: was it just > with > this small test case that you noticed the problem, or does it occur > elsewhere?! > >> As I stated in another message, it looks to me when Vim reads from >> stdin, the content is already interpreted in termencoding. I have not >> yet found other results. > > This isn't true. I can set termencoding to e.g. big5 but Vim will read the > input > as latin1 or utf8 and thus display question marks as the ñ cannot be > represented. > On the other hand, with tenc=utf8 I can set fencs to big5 on the commandline > (vim > --cmd 'set fencs=big5' -) and have the <f1> interpreted and displayed as > Chinese. > > So I don't know about your Vim, but mine behaves exactly the same way whether > something is pumped into stdin or opened as a regular file from disk, using > fencs. > > I wonder if this behaviour could be platform-specific or depend on which > libraries > are available/compiled in. Because we both seem to have solutions, but > neither of > them works for the other person. > > Hmmmm. > > Ben.
Correction to my previous posts: With a file consisting only of 0xF1 0xF1 0x0A, "vim file" and "vim - <file" both display <f1><f1> even on my Linux system. The first byte (0xF1) would be the head byte of a 4-byte sequence (for a codepoint in the range U+40000 - U+7FFFF) if it were valid UTF-8. But there are only 3 bytes in the file, including the ending linefeed. Best regards, Tony. -- "Consequences, Schmonsequences, as long as I'm rich." -- "Ali Baba Bunny" [1957, Chuck Jones] --~--~---------~--~----~------------~-------~--~----~ You received this message from the "vim_dev" maillist. For more information, visit http://www.vim.org/maillist.php -~----------~----~----~----~------~----~------~--~---