> > editing a _corrupted_ CP1252 file > <http://groups.google.com/group/vim_use/t/d6874651567bc841?utm_source=digest&utm_medium=email> > > Kenneth Reid Beesley <[email protected]>: Jan 25 11:18AM -0700 > > > I know that some of my files are _supposed_ to be CP1252. > But beforehand I don’t if or how they are corrupted. Usually the problem in a > corrupted file is the presence of \x81, \x8D, \x8F, \x90 and/or \x9D bytes, > which are illegal/undefined bytes in CP1252. > The files are programs, so I need to zero in on each invalid byte (invalid > for CP1252), figure what’s going on, and edit it appropriately. > So it needs to be done by hand. (There are not a lot of such bad characters.) > > Again, the problem is that if I (try to) edit a corrupted file as CP1252 with > :e ++enc=cp1252, the bad bytes get silently replaced in the buffer with > question marks, which hides the problem rather than helping me find the bad > bytes. > If I use ‘tr’ to replace the illegal bytes with some kind of valid bytes, > then the problems are just hidden some other way. > If I try to edit a file as CP1252, using :e ++enc=cp1252, and the file > contains invalid bytes, then I need alarm bells to go off somehow. > > > > > > Erik Christiansen <[email protected]>: Jan 26 02:58PM +1100 > > On 25.01.16 11:18, Kenneth Reid Beesley wrote: > > (invalid for CP1252), figure what’s going on, and edit it > > appropriately. So it needs to be done by hand. (There are not a lot > > of such bad characters.) > > Ah, not simply remapping, then. For UTF-8, Vim has the "8g8" command, to > hop to the next encoding violation. Unfortunately, there's no mention > there of any ability to do that for CP1252. > > What happens if you have fenc=utf-8, open the cp1252 file, and press 8g8 ? > > Erik > <>
Thanks again, Erik, Here’s my usual .gvimrc setup for encodings: “ encoding used internally in the edit buffer set encoding=utf-8 “ default encoding for saving any new files created setglobal fileencoding=utf-8 “ when editing an existing file, try to read it in these encodings, use the first that succeeds set fileencodings=ucs-bom,utf-8,latin1 *************** Here’s a little test. First I create a little file that has bytes for ‘a’, ‘b’, ‘c’ and \x81 (which is undefined in both ISO-8859-1 (Latin 1) and in CP1252 (also known as Windows-1252), $ printf “\x61\x62\x63\x81” > corrupted.txt I confirm that the bytes are as expected using $ od -t x1 corrupted.txt If I try to iconv the file from cp1252 to UTF-8 $ iconv -f cp1252 -t UTF-8 corrupted.txt > utf8.txt iconv chokes appropriately on the \x81 byte, outputting the error message “iconv: corrupted.txt:1:2: cannot convert” If I simply try to gvim the file, with the fileencodings as shown above, $ gvim corrupted.txt the fileencoding gets set to latin1 (the last option in fileencodings), and the offending \x81 byte gets displayed as <81> (in blue). The blue <81> represents a single byte, and the command 8g8 (which you suggested) moves the cursor to that byte. That’s not bad. At least I can find the offending bytes in a corrupted file. ********* Tests: FIddling with fileencodings \x81 is undefined in both ISO-8859-1 and in CP1252 \x80 is _defined_ in CP1252 but not assigned in ISO-8859-1 (but see below; \x80 seems officially to be a legal but non-graphic “C1” control character in latin1) I create another little test file: $ printf “\x80\x61\x62\x63\x81” > corrupted2.txt $ iconv -f latin1 -t utf-8 corrupted2.txt This, unfortunately, works. Even though \x80 and \x81 are non-graphic bytes in Latin1, they are somehow considered valid, though little used, “C1” control bytes. I’ve never quite understood this. It makes it very hard to distinguish between ISO-8859-1 and CP1252. $ iconv -f cp1252 -t utf-8 corrupted2.txt chokes (appropriately) on the \x81 byte, which is not defined in cp1252. $ iconv -f latin1 -t utf-8 corrupted2.txt works without complaint. Sigh. If I change fileencodings to set fileencodings=ucs-bom,utf-8,iso-8859-1,cp1252 and $ gvim corrupted2.txt the fileencoding gets set to latin1, but with <80> and <81> displayed in blue (representing single bytes) in the edit buffer. The blue <HH> notation seems to indicate that the byte is non-graphic. There is no font glyph assigned to it. I can use 8g8 to find the bad bytes displayed in blue as <HH>. If I change fileencodings to set fileencodings=ucs-bom,utf-8,cp1252,iso-8859-1 and $ gvim corrupted2.txt then the fileencoding is again detected (or defaulted to) latin1, with the <80> and <81> displayed in blue. ******* testing with a valid cp1252 file $ printf “\x80\x61\x62\x63” > cp1252.txt $ gvim cp1252.txt brings up the file without any blue <HH> bytes. The file is detected with fileencoding cp1252. If I then change fileencodings to set fileencodings=ucs-bom,utf-8,iso-8859-1,cp1252 and open the cp1252.txt file $ gvim cp1252.txt the encoding is detected as latin1 (because iso-8859-1 gets tried before cp1252, and it succeeds because the value \x80 is a legal “C1” but non-graphic control byte in Latin1). ********** Conclusions: It’s rather hard to test if a file is Latin1 vs CP1252 because Latin1 does allow non-graphic “C1” control bytes in the range \x80 through \x9F. It seems worthless to set set fileencodings=ucs-bom,utf-8,iso-8859-1,cp1252 because any valid cp1252 file (or even a file intended to be cp1252 but containing illegal, for cp1252, bytes like \x81) will succeed as iso-8859-1, and the fileencoding will be assigned as iso-8859-1 (latin1). Setting set fileencodings=ucs-bom,utf-8,cp1252,iso-8859-1 will cause ‘gvim’ to edit a file with fileencoding cp1252 if the file contains bytes in the \x80 to \x9F range that are legal for cp1252. What’s dangerous (for me) is invoking $ gvim -c “e ++enc=cp1252” filethatshouldbecp1252.txt on a file that should be cp1252 but might contain illegal bytes \x81, \x8D, \x8F, \x90 and \x9D, because if such undefined bytes do appear in the file, they get silently converted to question-mark characters. Ideally, alarm bells would go off. This should fail like iconv does when told to convert a file as cp1252 when it isn’t valid cp1252. At least the illegal/undefined bytes should be displayed in blue as <81> or whatever. What’s dangerous for me is the silent conversion of invalid characters like \x81 to question marks. Thanks again, Ken ******************************** Kenneth R. Beesley, D.Phil. PO Box 540475 North Salt Lake UT 84054 USA -- -- You received this message from the "vim_use" maillist. Do not top-post! Type your reply below the text you are replying to. For more information, visit http://www.vim.org/maillist.php --- You received this message because you are subscribed to the Google Groups "vim_use" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
