Re: Summarizing encoding issues

2007-10-18 Fir de Conversatie Tony Mechelynck

DervishD wrote:
 Hi Yongwei :)
 
  * Yongwei Wu [EMAIL PROTECTED] dixit:
 On 17/10/2007, Ben Schmidt [EMAIL PROTECTED] wrote:
 Note that because of this buggy behaviour, Vim's default value for
 fencs is non-sensical: it will always succeed when it gets to utf-8
 when enc=utf-8 without trying default or latin1, even if the file is
 invalid as utf-8.
 This is not true.  In fact, if the file contains señor instead of
 ññ, Vim does resort to Latin1.  This said, Vim's failure here does
 sound like a bug.  But I would like to hear from Bram first.
 
 Exactly! I was just testing with some kind of corner case. ññ was the
 first thing I wrote fast and it stayed for my tests!. If I use ññ  all
 works OK. Looks like the file must be longer than two bytes or vim gets
 confused.
 
 I have to make again all my tests. First quick'n'dirty test is correct:
 doing cat file | vim - shows the characters correctly if the file is
 longer than two bytes (not taking into account line endings).
 
 Thanks a lot for pointing!
 
 Raúl Núñez de Arenas Coronado
 

Tip:

vim - file

is equivalent to

cat file | vim -

and executes one less program.


Best regards,
Tony.
-- 
Nondeterminism means never having to say you are wrong.

--~--~-~--~~~---~--~~
You received this message from the vim_dev maillist.
For more information, visit http://www.vim.org/maillist.php
-~--~~~~--~~--~--~---



Re: Summarizing encoding issues

2007-10-17 Fir de Conversatie Ben Schmidt

 First scenario:
 
 set enc=default
 set fenc=latin1
 set fencs=ucs-bom,utf-8,latin1
 set tenc=latin1
 
 vim file-- Correct (fenc=latin1)
 vim file8   -- Correct (fenc=utf8)
 cat file8 | view -  -- Correct (fenc=)
 
 
 Second scenario:
 
 set enc=utf8
 set fenc=latin1
 set fencs=ucs-bom,utf-8,latin1
 set tenc=latin1
 
 vim file-- INCORRECT (fenc=latin1)
 vim file8   -- Correct   (fenc=utf8)
 cat file  | view -  -- INCORRECT (fenc=)
 cat file8 | view -  -- Correct   (fenc=)

Can you double check the value of fenc for the 'vim file' case? I get 
'fenc=utf-8' 
(and display is incorrect, understandably).

Anyway, I think you have found a Vim bug here.

CCing this mail to vim_dev.

The bug is as follows: When Vim gets to the fencs entry that matches enc, as it 
doesn't need to convert the file, it simply reads it into the buffer. The bug 
is 
that Vim does this whether the file is valid for that encoding or not. Expected 
behaviour: Vim only loads the file without conversion if the file is valid for 
the 
encoding; if not, it should move to the next entry in fencs.

Note that because of this buggy behaviour, Vim's default value for fencs is 
non-sensical: it will always succeed when it gets to utf-8 when enc=utf-8 
without 
trying default or latin1, even if the file is invalid as utf-8.

Further note that fixing this may cause difficulties when reading from a stdin 
which can't be rewound, so once fixed, setting fencs prior to reading stdin may 
become more important to avoid read failures. Previous posts from me and others 
have explained how to do that if you are unsure, though it looks like you're 
pretty savvy. I thought this was worth mentioning as this is how the whole 
thread 
started!

Cheers,

Ben.




Send instant messages to your online friends http://au.messenger.yahoo.com 


--~--~-~--~~~---~--~~
You received this message from the vim_dev maillist.
For more information, visit http://www.vim.org/maillist.php
-~--~~~~--~~--~--~---



Re: Summarizing encoding issues

2007-10-17 Fir de Conversatie Ben Schmidt

 This is not true.  In fact, if the file contains señor instead of
 ññ, Vim does resort to Latin1.  This said, Vim's failure here does
 sound like a bug.  But I would like to hear from Bram first.

Well spotted, Yongwei. So there is something more subtle about this bug, and I 
believe it is this:

Vim doesn't recognise a file as invalid utf8 if, when you get to the first 
invalid 
sequence, there are less bytes in the file than would be required to read a 
valid 
sequence beginning with the unicode leader character read. I.e. if the last 
byte 
in the file is C2-DF, or one of the last two bytes is E0-EF or one of the last 
three bytes is F0-F4. As these sequences would take 2, 3 and 4 bytes 
respectively 
to read a valid character, and there are not that many bytes in the file, Vim 
finishes its analysis thinking 'valid' as it hasn't read a 'whole invalid 
character'. :-)

This is a very specific scenario, though. Question for Dervish: was it just 
with 
this small test case that you noticed the problem, or does it occur elsewhere?!

 As I stated in another message, it looks to me when Vim reads from
 stdin, the content is already interpreted in termencoding.  I have not
 yet found other results.

This isn't true. I can set termencoding to e.g. big5 but Vim will read the 
input 
as latin1 or utf8 and thus display question marks as the ñ cannot be 
represented. 
On the other hand, with tenc=utf8 I can set fencs to big5 on the commandline 
(vim 
--cmd 'set fencs=big5' -) and have the f1 interpreted and displayed as 
Chinese.

So I don't know about your Vim, but mine behaves exactly the same way whether 
something is pumped into stdin or opened as a regular file from disk, using 
fencs.

I wonder if this behaviour could be platform-specific or depend on which 
libraries 
are available/compiled in. Because we both seem to have solutions, but neither 
of 
them works for the other person.

H.

Ben.




Send instant messages to your online friends http://au.messenger.yahoo.com 


--~--~-~--~~~---~--~~
You received this message from the vim_dev maillist.
For more information, visit http://www.vim.org/maillist.php
-~--~~~~--~~--~--~---



Re: Summarizing encoding issues

2007-10-17 Fir de Conversatie Tony Mechelynck

Ben Schmidt wrote:
 This is not true.  In fact, if the file contains señor instead of
 ññ, Vim does resort to Latin1.  This said, Vim's failure here does
 sound like a bug.  But I would like to hear from Bram first.
 
 Well spotted, Yongwei. So there is something more subtle about this bug, and 
 I 
 believe it is this:
 
 Vim doesn't recognise a file as invalid utf8 if, when you get to the first 
 invalid 
 sequence, there are less bytes in the file than would be required to read a 
 valid 
 sequence beginning with the unicode leader character read. I.e. if the last 
 byte 
 in the file is C2-DF, or one of the last two bytes is E0-EF or one of the 
 last 
 three bytes is F0-F4. As these sequences would take 2, 3 and 4 bytes 
 respectively 
 to read a valid character, and there are not that many bytes in the file, Vim 
 finishes its analysis thinking 'valid' as it hasn't read a 'whole invalid 
 character'. :-)
 
 This is a very specific scenario, though. Question for Dervish: was it just 
 with 
 this small test case that you noticed the problem, or does it occur 
 elsewhere?!
 
 As I stated in another message, it looks to me when Vim reads from
 stdin, the content is already interpreted in termencoding.  I have not
 yet found other results.
 
 This isn't true. I can set termencoding to e.g. big5 but Vim will read the 
 input 
 as latin1 or utf8 and thus display question marks as the ñ cannot be 
 represented. 
 On the other hand, with tenc=utf8 I can set fencs to big5 on the commandline 
 (vim 
 --cmd 'set fencs=big5' -) and have the f1 interpreted and displayed as 
 Chinese.
 
 So I don't know about your Vim, but mine behaves exactly the same way whether 
 something is pumped into stdin or opened as a regular file from disk, using 
 fencs.
 
 I wonder if this behaviour could be platform-specific or depend on which 
 libraries 
 are available/compiled in. Because we both seem to have solutions, but 
 neither of 
 them works for the other person.
 
 H.
 
 Ben.

Correction to my previous posts:

With a file consisting only of 0xF1 0xF1 0x0A, vim file and vim - file 
both display f1f1 even on my Linux system. The first byte (0xF1) would be 
the head byte of a 4-byte sequence (for a codepoint in the range U+4 - 
U+7) if it were valid UTF-8. But there are only 3 bytes in the file, 
including the ending linefeed.


Best regards,
Tony.
-- 
Consequences, Schmonsequences, as long as I'm rich.
-- Ali Baba Bunny [1957, Chuck Jones]

--~--~-~--~~~---~--~~
You received this message from the vim_dev maillist.
For more information, visit http://www.vim.org/maillist.php
-~--~~~~--~~--~--~---



Re: Summarizing encoding issues

2007-10-17 Fir de Conversatie Yongwei Wu

On 18/10/2007, Ben Schmidt [EMAIL PROTECTED] wrote:

  This is not true.  In fact, if the file contains señor instead of
  ññ, Vim does resort to Latin1.  This said, Vim's failure here does
  sound like a bug.  But I would like to hear from Bram first.

 Well spotted, Yongwei. So there is something more subtle about this
 bug, and I believe it is this:

 Vim doesn't recognise a file as invalid utf8 if, when you get to the
 first invalid sequence, there are less bytes in the file than would
 be required to read a valid sequence beginning with the unicode
 leader character read. I.e. if the last byte in the file is C2-DF,
 or one of the last two bytes is E0-EF or one of the last three bytes
 is F0-F4. As these sequences would take 2, 3 and 4 bytes
 respectively to read a valid character, and there are not that many
 bytes in the file, Vim finishes its analysis thinking 'valid' as it
 hasn't read a 'whole invalid character'. :-)

 This is a very specific scenario, though. Question for Dervish: was
 it just with this small test case that you noticed the problem, or
 does it occur elsewhere?!

  As I stated in another message, it looks to me when Vim reads from
  stdin, the content is already interpreted in termencoding.  I have not
  yet found other results.

 This isn't true. I can set termencoding to e.g. big5 but Vim will
 read the input as latin1 or utf8 and thus display question marks as
 the ñ cannot be represented. On the other hand, with tenc=utf8 I can
 set fencs to big5 on the commandline (vim --cmd 'set fencs=big5' -)
 and have the f1 interpreted and displayed as Chinese.

Sorry, it seems my previous tests were faulty, probably because the
default value of fencs makes sense.  Now I see the behaviour is good
as you described.

With my test file (normal Latin1 text), this works well:

cat test.txt|vim -u NONE - --cmd 'set enc=utf-8 tenc=latin1' -c 'set
fenc=latin1'

With Dervish's original test file, this does not work.  I have to use:

cat test.txt|vim -u NONE - --cmd 'set enc=utf-8 tenc=latin1
fencs=latin1' -c 'set fenc=latin1'

So all makes sense, and no bugs are seen.  The problems are because
of a very strange test case.

Best regards,

Yongwei

-- 
Wu Yongwei
URL: http://wyw.dcweb.cn/

--~--~-~--~~~---~--~~
You received this message from the vim_dev maillist.
For more information, visit http://www.vim.org/maillist.php
-~--~~~~--~~--~--~---