Re: Summarizing encoding issues

Tony Mechelynck Wed, 17 Oct 2007 11:25:16 -0700

Ben Schmidt wrote:
>> This is not true.  In fact, if the file contains "señor" instead of
>> "ññ", Vim does resort to Latin1.  This said, Vim's failure here does
>> sound like a bug.  But I would like to hear from Bram first.
> 
> Well spotted, Yongwei. So there is something more subtle about this bug, and 
> I 
> believe it is this:
> 
> Vim doesn't recognise a file as invalid utf8 if, when you get to the first 
> invalid 
> sequence, there are less bytes in the file than would be required to read a 
> valid 
> sequence beginning with the unicode leader character read. I.e. if the last 
> byte 
> in the file is C2-DF, or one of the last two bytes is E0-EF or one of the 
> last 
> three bytes is F0-F4. As these sequences would take 2, 3 and 4 bytes 
> respectively 
> to read a valid character, and there are not that many bytes in the file, Vim 
> finishes its analysis thinking 'valid' as it hasn't read a 'whole invalid 
> character'. :-)
> 
> This is a very specific scenario, though. Question for Dervish: was it just 
> with 
> this small test case that you noticed the problem, or does it occur 
> elsewhere?!
> 
>> As I stated in another message, it looks to me when Vim reads from
>> stdin, the content is already interpreted in termencoding.  I have not
>> yet found other results.
> 
> This isn't true. I can set termencoding to e.g. big5 but Vim will read the 
> input 
> as latin1 or utf8 and thus display question marks as the ñ cannot be 
> represented. 
> On the other hand, with tenc=utf8 I can set fencs to big5 on the commandline 
> (vim 
> --cmd 'set fencs=big5' -) and have the <f1> interpreted and displayed as 
> Chinese.
> 
> So I don't know about your Vim, but mine behaves exactly the same way whether 
> something is pumped into stdin or opened as a regular file from disk, using 
> fencs.
> 
> I wonder if this behaviour could be platform-specific or depend on which 
> libraries 
> are available/compiled in. Because we both seem to have solutions, but 
> neither of 
> them works for the other person.
> 
> Hmmmm.
> 
> Ben.


Correction to my previous posts:

With a file consisting only of 0xF1 0xF1 0x0A, "vim file" and "vim - <file" 
both display <f1><f1> even on my Linux system. The first byte (0xF1) would be 
the head byte of a 4-byte sequence (for a codepoint in the range U+40000 - 
U+7FFFF) if it were valid UTF-8. But there are only 3 bytes in the file, 
including the ending linefeed.


Best regards,
Tony.
-- 
"Consequences, Schmonsequences, as long as I'm rich."
                -- "Ali Baba Bunny" [1957, Chuck Jones]

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Re: Summarizing encoding issues

Raspunde prin e-mail lui