problem: editing a _corrupted_ CP1252 file

Kenneth Reid Beesley Fri, 22 Jan 2016 16:45:31 -0800

I have a number of 8-bit text files that _should_ be in CP1252, but they may
contain byte values that are undefined for CP1252, e.g. \x81, \x8D, \x8F, \x90 
and \x9d.
I.e. these are potentially corrupted files that are mostly legal CP1252, should 
be legal
CP1252, and I have to make them legal CP1252.


The Problem:  if I edit them as CP1252, the illegal bytes get converted into 
question-mark characters in the buffer.

        Background

My buffer ‘encoding’ is always UTF-8.  (I have to edit files in a number of 
different encodings, and
this usually works well.)

I have a little alias  gvim1252  set to

        gvim -c “e ++enc=cp1252”

so that invoking

$ gvim1252  filename.txt

loads filename.txt (let’s assume that it _should_ be CP1252) and effectively 
invokes the command

        :e ++enc=cp1252

telling gvim that the ‘fileencoding’ is (or at least should be) cp1252.

Inside the edit buffer (where the ‘encoding’ is UTF-8), any illegal byte values 
from the original input file
(such as \x81 and the four others listed above) that cannot be converted from 
CP1252 to UTF-8 
(because they are simply undefined in CP1252) are simply and silently replaced 
with plain question-mark characters.

Even worse, if I then just write the buffer back out to file, the question 
marks in the buffer are
written to file as question marks.  I lose the information about the original 
bad bytes, and in my case, 
that’s dangerous behavior.  I need to easily find, evaluate, and fix such 
illegal characters during my editing.

        Desired Behavior

1.  When I edit a file that should be CP1252 (but might be corrupted with byte 
values
like \x81), and when I specify ++enc=cp1252, I’d like the bad byte values to be 
retained in the buffer, 
perhaps shown as highlighted

        <81>

or something else that stands out more than a plain question-mark character.  
These files can also
contain original question-marks that are supposed to be question marks.

2.  If I write the buffer back to file, I’d like any illegal bytes like <81> 
that I haven’t found/fixed
to be written back to file as they were originally.  (I understand that this 
might be problematic.)

3.  And, when I invoke ++enc=cp1252 on a corrupted file, perhaps I’d like some 
kind of error message telling me 
that the file was not in the indicated cp1252 encoding.  Even refusing to 
accept the ++enc command, for a corrupted
file, would be better than the current silent replacement of illegal bytes with 
question marks.

****  Any help getting the desired behavior would be much appreciated.

Ken


********************************
Kenneth R. Beesley, D.Phil.
PO Box 540475
North Salt Lake UT 84054
USA





-- 
-- 
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

problem: editing a _corrupted_ CP1252 file

Reply via email to