Re: Summarizing encoding issues
DervishD wrote: Hi Yongwei :) * Yongwei Wu [EMAIL PROTECTED] dixit: On 17/10/2007, Ben Schmidt [EMAIL PROTECTED] wrote: Note that because of this buggy behaviour, Vim's default value for fencs is non-sensical: it will always succeed when it gets to utf-8 when enc=utf-8 without trying default or latin1, even if the file is invalid as utf-8. This is not true. In fact, if the file contains señor instead of ññ, Vim does resort to Latin1. This said, Vim's failure here does sound like a bug. But I would like to hear from Bram first. Exactly! I was just testing with some kind of corner case. ññ was the first thing I wrote fast and it stayed for my tests!. If I use ññ all works OK. Looks like the file must be longer than two bytes or vim gets confused. I have to make again all my tests. First quick'n'dirty test is correct: doing cat file | vim - shows the characters correctly if the file is longer than two bytes (not taking into account line endings). Thanks a lot for pointing! Raúl Núñez de Arenas Coronado Tip: vim - file is equivalent to cat file | vim - and executes one less program. Best regards, Tony. -- Nondeterminism means never having to say you are wrong. --~--~-~--~~~---~--~~ You received this message from the vim_dev maillist. For more information, visit http://www.vim.org/maillist.php -~--~~~~--~~--~--~---
Re: Summarizing encoding issues
First scenario: set enc=default set fenc=latin1 set fencs=ucs-bom,utf-8,latin1 set tenc=latin1 vim file-- Correct (fenc=latin1) vim file8 -- Correct (fenc=utf8) cat file8 | view - -- Correct (fenc=) Second scenario: set enc=utf8 set fenc=latin1 set fencs=ucs-bom,utf-8,latin1 set tenc=latin1 vim file-- INCORRECT (fenc=latin1) vim file8 -- Correct (fenc=utf8) cat file | view - -- INCORRECT (fenc=) cat file8 | view - -- Correct (fenc=) Can you double check the value of fenc for the 'vim file' case? I get 'fenc=utf-8' (and display is incorrect, understandably). Anyway, I think you have found a Vim bug here. CCing this mail to vim_dev. The bug is as follows: When Vim gets to the fencs entry that matches enc, as it doesn't need to convert the file, it simply reads it into the buffer. The bug is that Vim does this whether the file is valid for that encoding or not. Expected behaviour: Vim only loads the file without conversion if the file is valid for the encoding; if not, it should move to the next entry in fencs. Note that because of this buggy behaviour, Vim's default value for fencs is non-sensical: it will always succeed when it gets to utf-8 when enc=utf-8 without trying default or latin1, even if the file is invalid as utf-8. Further note that fixing this may cause difficulties when reading from a stdin which can't be rewound, so once fixed, setting fencs prior to reading stdin may become more important to avoid read failures. Previous posts from me and others have explained how to do that if you are unsure, though it looks like you're pretty savvy. I thought this was worth mentioning as this is how the whole thread started! Cheers, Ben. Send instant messages to your online friends http://au.messenger.yahoo.com --~--~-~--~~~---~--~~ You received this message from the vim_dev maillist. For more information, visit http://www.vim.org/maillist.php -~--~~~~--~~--~--~---
Re: Summarizing encoding issues
This is not true. In fact, if the file contains señor instead of ññ, Vim does resort to Latin1. This said, Vim's failure here does sound like a bug. But I would like to hear from Bram first. Well spotted, Yongwei. So there is something more subtle about this bug, and I believe it is this: Vim doesn't recognise a file as invalid utf8 if, when you get to the first invalid sequence, there are less bytes in the file than would be required to read a valid sequence beginning with the unicode leader character read. I.e. if the last byte in the file is C2-DF, or one of the last two bytes is E0-EF or one of the last three bytes is F0-F4. As these sequences would take 2, 3 and 4 bytes respectively to read a valid character, and there are not that many bytes in the file, Vim finishes its analysis thinking 'valid' as it hasn't read a 'whole invalid character'. :-) This is a very specific scenario, though. Question for Dervish: was it just with this small test case that you noticed the problem, or does it occur elsewhere?! As I stated in another message, it looks to me when Vim reads from stdin, the content is already interpreted in termencoding. I have not yet found other results. This isn't true. I can set termencoding to e.g. big5 but Vim will read the input as latin1 or utf8 and thus display question marks as the ñ cannot be represented. On the other hand, with tenc=utf8 I can set fencs to big5 on the commandline (vim --cmd 'set fencs=big5' -) and have the f1 interpreted and displayed as Chinese. So I don't know about your Vim, but mine behaves exactly the same way whether something is pumped into stdin or opened as a regular file from disk, using fencs. I wonder if this behaviour could be platform-specific or depend on which libraries are available/compiled in. Because we both seem to have solutions, but neither of them works for the other person. H. Ben. Send instant messages to your online friends http://au.messenger.yahoo.com --~--~-~--~~~---~--~~ You received this message from the vim_dev maillist. For more information, visit http://www.vim.org/maillist.php -~--~~~~--~~--~--~---
Re: Summarizing encoding issues
Ben Schmidt wrote: This is not true. In fact, if the file contains señor instead of ññ, Vim does resort to Latin1. This said, Vim's failure here does sound like a bug. But I would like to hear from Bram first. Well spotted, Yongwei. So there is something more subtle about this bug, and I believe it is this: Vim doesn't recognise a file as invalid utf8 if, when you get to the first invalid sequence, there are less bytes in the file than would be required to read a valid sequence beginning with the unicode leader character read. I.e. if the last byte in the file is C2-DF, or one of the last two bytes is E0-EF or one of the last three bytes is F0-F4. As these sequences would take 2, 3 and 4 bytes respectively to read a valid character, and there are not that many bytes in the file, Vim finishes its analysis thinking 'valid' as it hasn't read a 'whole invalid character'. :-) This is a very specific scenario, though. Question for Dervish: was it just with this small test case that you noticed the problem, or does it occur elsewhere?! As I stated in another message, it looks to me when Vim reads from stdin, the content is already interpreted in termencoding. I have not yet found other results. This isn't true. I can set termencoding to e.g. big5 but Vim will read the input as latin1 or utf8 and thus display question marks as the ñ cannot be represented. On the other hand, with tenc=utf8 I can set fencs to big5 on the commandline (vim --cmd 'set fencs=big5' -) and have the f1 interpreted and displayed as Chinese. So I don't know about your Vim, but mine behaves exactly the same way whether something is pumped into stdin or opened as a regular file from disk, using fencs. I wonder if this behaviour could be platform-specific or depend on which libraries are available/compiled in. Because we both seem to have solutions, but neither of them works for the other person. H. Ben. Correction to my previous posts: With a file consisting only of 0xF1 0xF1 0x0A, vim file and vim - file both display f1f1 even on my Linux system. The first byte (0xF1) would be the head byte of a 4-byte sequence (for a codepoint in the range U+4 - U+7) if it were valid UTF-8. But there are only 3 bytes in the file, including the ending linefeed. Best regards, Tony. -- Consequences, Schmonsequences, as long as I'm rich. -- Ali Baba Bunny [1957, Chuck Jones] --~--~-~--~~~---~--~~ You received this message from the vim_dev maillist. For more information, visit http://www.vim.org/maillist.php -~--~~~~--~~--~--~---
Re: Summarizing encoding issues
On 18/10/2007, Ben Schmidt [EMAIL PROTECTED] wrote: This is not true. In fact, if the file contains señor instead of ññ, Vim does resort to Latin1. This said, Vim's failure here does sound like a bug. But I would like to hear from Bram first. Well spotted, Yongwei. So there is something more subtle about this bug, and I believe it is this: Vim doesn't recognise a file as invalid utf8 if, when you get to the first invalid sequence, there are less bytes in the file than would be required to read a valid sequence beginning with the unicode leader character read. I.e. if the last byte in the file is C2-DF, or one of the last two bytes is E0-EF or one of the last three bytes is F0-F4. As these sequences would take 2, 3 and 4 bytes respectively to read a valid character, and there are not that many bytes in the file, Vim finishes its analysis thinking 'valid' as it hasn't read a 'whole invalid character'. :-) This is a very specific scenario, though. Question for Dervish: was it just with this small test case that you noticed the problem, or does it occur elsewhere?! As I stated in another message, it looks to me when Vim reads from stdin, the content is already interpreted in termencoding. I have not yet found other results. This isn't true. I can set termencoding to e.g. big5 but Vim will read the input as latin1 or utf8 and thus display question marks as the ñ cannot be represented. On the other hand, with tenc=utf8 I can set fencs to big5 on the commandline (vim --cmd 'set fencs=big5' -) and have the f1 interpreted and displayed as Chinese. Sorry, it seems my previous tests were faulty, probably because the default value of fencs makes sense. Now I see the behaviour is good as you described. With my test file (normal Latin1 text), this works well: cat test.txt|vim -u NONE - --cmd 'set enc=utf-8 tenc=latin1' -c 'set fenc=latin1' With Dervish's original test file, this does not work. I have to use: cat test.txt|vim -u NONE - --cmd 'set enc=utf-8 tenc=latin1 fencs=latin1' -c 'set fenc=latin1' So all makes sense, and no bugs are seen. The problems are because of a very strange test case. Best regards, Yongwei -- Wu Yongwei URL: http://wyw.dcweb.cn/ --~--~-~--~~~---~--~~ You received this message from the vim_dev maillist. For more information, visit http://www.vim.org/maillist.php -~--~~~~--~~--~--~---