Re: editing a _corrupted_ CP1252 file

Kenneth Reid Beesley Tue, 26 Jan 2016 12:11:47 -0800

> 
> editing a _corrupted_ CP1252 file      
> <http://groups.google.com/group/vim_use/t/d6874651567bc841?utm_source=digest&utm_medium=email>
>       
> Kenneth Reid Beesley <[email protected]>: Jan 25 11:18AM -0700 
> 
> 
> I know that some of my files are _supposed_ to be CP1252.
> But beforehand I don’t if or how they are corrupted. Usually the problem in a 
> corrupted file is the presence of \x81, \x8D, \x8F, \x90 and/or \x9D bytes, 
> which are illegal/undefined bytes in CP1252.
> The files are programs, so I need to zero in on each invalid byte (invalid 
> for CP1252), figure what’s going on, and edit it appropriately.
> So it needs to be done by hand. (There are not a lot of such bad characters.)
>  
> Again, the problem is that if I (try to) edit a corrupted file as CP1252 with 
> :e ++enc=cp1252, the bad bytes get silently replaced in the buffer with 
> question marks, which hides the problem rather than helping me find the bad 
> bytes.
> If I use ‘tr’ to replace the illegal bytes with some kind of valid bytes, 
> then the problems are just hidden some other way.
> If I try to edit a file as CP1252, using :e ++enc=cp1252, and the file 
> contains invalid bytes, then I need alarm bells to go off somehow.
>  
> 
>  
> 
> 
> Erik Christiansen <[email protected]>: Jan 26 02:58PM +1100 
> 
> On 25.01.16 11:18, Kenneth Reid Beesley wrote:
> > (invalid for CP1252), figure what’s going on, and edit it
> > appropriately. So it needs to be done by hand. (There are not a lot
> > of such bad characters.)
>  
> Ah, not simply remapping, then. For UTF-8, Vim has the "8g8" command, to
> hop to the next encoding violation. Unfortunately, there's no mention
> there of any ability to do that for CP1252.
>  
> What happens if you have fenc=utf-8, open the cp1252 file, and press 8g8 ?
>  
> Erik
>  <>


Thanks again, Erik,

Here’s my usual .gvimrc setup for encodings:

“  encoding used internally in the edit buffer
set encoding=utf-8

“  default encoding for saving any new files created
setglobal fileencoding=utf-8

“ when editing an existing file, try to read it in these encodings, use the 
first that succeeds
set fileencodings=ucs-bom,utf-8,latin1

***************

Here’s a little test.  First I create a little file that has bytes for ‘a’, 
‘b’, ‘c’ and \x81 (which is undefined in
both ISO-8859-1 (Latin 1) and in CP1252 (also known as Windows-1252), 

$ printf “\x61\x62\x63\x81” > corrupted.txt

I confirm that the bytes are as expected using

$ od -t x1 corrupted.txt

If I try to iconv the file from cp1252 to UTF-8

$ iconv -f cp1252 -t UTF-8 corrupted.txt > utf8.txt

iconv chokes appropriately on the \x81 byte, outputting the error message

“iconv: corrupted.txt:1:2: cannot convert”

If I simply try to gvim the file, with the fileencodings as shown above, 

$ gvim corrupted.txt

the fileencoding gets set to latin1 (the last option in fileencodings), and the 
offending \x81 byte gets displayed as <81> (in blue).
The blue <81> represents a single byte, and the command 8g8 (which you 
suggested) moves the cursor to that byte.  That’s not bad.  
At least I can find the offending bytes in a corrupted file.

********* Tests: FIddling with fileencodings

\x81 is undefined in both ISO-8859-1 and in CP1252
\x80 is  _defined_ in CP1252 but not assigned in ISO-8859-1 (but see below; 
\x80 seems officially to be a legal but non-graphic “C1” control character in 
latin1)

I create another little test file:

$ printf “\x80\x61\x62\x63\x81” > corrupted2.txt

$ iconv -f latin1 -t utf-8 corrupted2.txt

This, unfortunately, works.  Even though \x80 and \x81 are non-graphic bytes in 
Latin1, they are somehow considered valid, though little used, “C1” control 
bytes.
I’ve never quite understood this.  It makes it very hard to distinguish between 
ISO-8859-1 and CP1252.

$ iconv -f cp1252 -t utf-8 corrupted2.txt

chokes (appropriately) on the \x81 byte, which is not defined in cp1252.

$ iconv -f latin1 -t utf-8 corrupted2.txt

works without complaint.  Sigh.

If I change fileencodings to

set fileencodings=ucs-bom,utf-8,iso-8859-1,cp1252

and

$ gvim corrupted2.txt

the fileencoding gets set to latin1, but with <80> and <81> displayed in blue 
(representing single bytes) in the edit buffer.  The blue <HH> notation
seems to indicate that the byte is non-graphic.  There is no font glyph 
assigned to it.  I can use 8g8 to find the bad bytes displayed in blue as <HH>.

If I change fileencodings to

set fileencodings=ucs-bom,utf-8,cp1252,iso-8859-1

and

$ gvim corrupted2.txt

then the fileencoding is again detected (or defaulted to) latin1, with the <80> 
and <81> displayed in blue.

******* testing with a valid cp1252 file

$ printf “\x80\x61\x62\x63” > cp1252.txt

$ gvim cp1252.txt

brings up the file without any blue <HH> bytes.  The file is detected with 
fileencoding cp1252.

If I then change fileencodings to

set fileencodings=ucs-bom,utf-8,iso-8859-1,cp1252

and open the cp1252.txt file

$ gvim cp1252.txt

the encoding is detected as latin1 (because iso-8859-1 gets tried before 
cp1252, and it succeeds because
the value \x80 is a legal “C1” but non-graphic control byte in Latin1).

********** 

Conclusions:

It’s rather hard to test if a file is Latin1 vs CP1252 because Latin1 does 
allow non-graphic “C1” control bytes
in the range \x80 through \x9F.

It seems worthless to set

set fileencodings=ucs-bom,utf-8,iso-8859-1,cp1252

because any valid cp1252 file (or even a file intended to be cp1252 but 
containing illegal, for cp1252, bytes like \x81)
will succeed as iso-8859-1, and the fileencoding will be assigned as iso-8859-1 
(latin1).

Setting

set fileencodings=ucs-bom,utf-8,cp1252,iso-8859-1

will cause ‘gvim’ to edit a file with fileencoding cp1252 if the file contains 
bytes in the \x80 to \x9F range that are
legal for cp1252.

What’s dangerous (for me) is invoking

$ gvim -c “e ++enc=cp1252” filethatshouldbecp1252.txt

on a file that should be cp1252 but might contain illegal bytes \x81, \x8D, 
\x8F, \x90 and \x9D, because
if such undefined bytes do appear in the file, they get silently converted to 
question-mark characters.
Ideally, alarm bells would go off.  This should fail like iconv does when told 
to convert a file as cp1252
when it isn’t valid cp1252.  At least the illegal/undefined bytes should be 
displayed in blue as <81>
or whatever.  What’s dangerous for me is the silent conversion of invalid 
characters like \x81 to question marks.

Thanks again,

Ken


********************************
Kenneth R. Beesley, D.Phil.
PO Box 540475
North Salt Lake UT 84054
USA





-- 
-- 
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: editing a _corrupted_ CP1252 file

Reply via email to