Re: editing a _corrupted_ CP1252 file

Kenneth Reid Beesley Mon, 25 Jan 2016 10:19:02 -0800

> On 23Jan2016, at 22:44, [email protected] wrote:
> 
> On 22.01.16 17:45, Kenneth Reid Beesley wrote:
> > contain byte values that are undefined for CP1252, e.g. \x81, \x8D, \x8F, 
> > \x90 and \x9d.
> > I.e. these are potentially corrupted files that are mostly legal CP1252, 
> > should be legal
> > CP1252, and I have to make them legal CP1252.


Eric Christiansen replied
>  
> Have you considered using e.g. tr to translate everything in one go?
> E.g.
>  
> $ tr '\201\215\217\220\235' 'ABCDE' < filename
>  
> In that line, \201 is octal for \x81, etc. The replacement characters
> could also be specified in octal, if they're sufficiently weird. It
> won't handle unicode, but that's not required here.
>  
> The job could also be done by sed or awk. Doing it by hand seems rather
> laborious.

Thanks for the message, but tr is not a very attractive solution in my case.
        I know that the files are _supposed_ to be CP1252.
        But beforehand I don’t if or how they are corrupted.  Usually the 
problem in a corrupted file is the presence of \x81, \x8D, \x8F, \x90 and/or 
\x9D bytes,
                which are illegal/undefined bytes in CP1252.
        The files are programs, so I need to zero in on each invalid byte 
(invalid for CP1252), figure what’s going on, and edit it appropriately.
        So it needs to be done by hand.  (There are not a lot of such bad 
characters.)

        Again, the problem is that if I (try to) edit a corrupted file as 
CP1252 with :e ++enc=cp1252, the bad bytes get silently replaced in the buffer 
with question marks, which hides the problem rather than helping me find the 
bad bytes.
        If I use ‘tr’ to replace the illegal bytes with some kind of valid 
bytes, then the problems are just hidden some other way.
        If I try to edit a file as CP1252, using :e ++enc=cp1252, and the file 
contains invalid bytes, then I need alarm bells to go off somehow.

Looking at my .gvimrc file, I have the line

set fileencodings=ucs-bom,utf-8,iso-8859-1

I note that if I simply edit such a corrupted file without specifying :e 
++enc=cp1252, then apparently gvim goes through the list of fileencodings, 
failing with ucs-bom, failing with uff-8, and then defaulting to try to edit 
the file as iso-8859-1.  The resulting edit buffer _retains_ any bad bytes, 
displaying them as <81>, <8d>, <8f>, <90> and <9D>, which is helpful.  

Perhaps the best I can do right now is to specify 

set fileencodings=ucs-bom,utf-8,cp1252,iso-8859-1
        
I’ll try that for now.

Thanks again,

Ken


********************************
Kenneth R. Beesley, D.Phil.
PO Box 540475
North Salt Lake UT 84054
USA





-- 
-- 
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: editing a _corrupted_ CP1252 file

Reply via email to