Re: Character encoding errors from latin-1 to utf-8

Tony Mechelynck Sat, 30 May 2009 19:28:43 -0700

On 22/05/09 21:56, Raúl Núñez de Arenas Coronado wrote:
>
> Terve Tuomas :)
>
> On Fri 22 May 2009 21:35 +0200, Tuomas Pyyhtiä<[email protected]>  dixit:
>> On Fri, 22 May 2009 21:07:15 +0300, Raúl Núñez de Arenas 
>> Coronado<[email protected]>  wrote:
>>
>> Terve Raúl!
>
> Thanks! I didn't know how to say "hello" in Finnish (I had to Google
> "terve") :)))))


Buenos días, hyvää päivää. As you can see, my delay in reading email is 
not getting any better.

>
>> And for the time being, I'm going to add cp1250 to filencodings in my
>> .gvimrc and make a mental note it's there.
>
> Remember, as long as "cp1250" is *before* latin1 in fencs, it is like
> latin1 is not even there, since cp1250 will always succeed.

Exactly, so it's better to have at most one 8-bit encoding in 
'fileencodings' (plural), and only in the last position, because 
anything after the first 8-bit encoding will never even be tried.

>
>>> The problem is that, in that Vim session, if you open a latin1 file
>>> it will be opened using cp1250, so I prefer the first method I told
>>> you, using ":e ++enc".
>>
>> And I thought all the time this *was* a latin-1 encoded file (as this
>> is the encoding my friend uses, and I think Vim win32 binaries are
>> compiled with that encoding enabled by default.)
>
> I've been there, too: I've got many files from friends with 0x92 in them
> (I think it is a single quote) but I insisted on opening them as
> "latin1", which they weren't. I almost never think of cp1250 :(

When you ask for Latin1 on Windows, what you get is not Latin1 but 
(usually) Windows-1252. The difference is that characters 0x80 to 0x9F 
are non-printable control characters in Latin1, while in Windows-1252 
they are additional printable characters. All the rest (0 to 0x7F and 
0xA0 to 0xFF) are identical.

>
>> Few questions: How am I able in the future to define what encoding the
>> file uses? i.e. How did you see what encoding to enforce? Do I have to
>> make my best guess and blindly enforce different encodings and proof
>> read the file  every time, or what's the best approach solving
>> encoding horrors if I ever get into them again?
>
> If you mean forcing the encoding *before* reading the file, then ":e
> ++enc=encoding" within vim, and the trick using "--cmd" are the only
> solutions I know.
>
> If you mean *changing* the encoding, which in turn mean *converting* the
> file, you can use ":set fenc=encoding", but please note that this will
> convert the file from the detected encoding (latin1 in your example) to
> encoding "encoding". For example, converting your example using ":set
> fenc=utf-8" won't work as expected, because it will convert from latin1
> to utf-8, not from cp1250 to utf-8.
>
> On the other hand, if you know the encoding in advance and don't want to
> use the "--cmd" trick or ":e ++enc" (I must confess I forget this one
> almost always), then you can use "iconv". That's what Vim uses
> internally (well, in library form I think) and works perfectly.
>
> I'm by no means a Vim encoding expert, and I had a hard time back when I
> started to use Vim with "encoding" and "termencoding", because I worked
> with utf-8 files but I have a latin1 Linux system. This said, feel free
> to ask whatever you need about this issue, and let's see if I know how
> to solve it O:) Fortunately, people in this list are very clever, so if
> I make a mistake someone will correct me.
>

Raúl is right, if you know the encoding of a file, and it is not what 
Vim would guess by means of your 'fileencodings', then you should open 
it with the ++enc=something modifier _between_ the :e (or :view etc.) 
command and the path/filename.

If you want to guess the encoding of some mysterious file which you got 
from some untrustworthy source, then AFAIK there are only the following 
rules:

1) If the file is in Unicode with BOM, your 'fileencodings' (plural) 
option starts with "ucs-bom", and you've set 'encoding' to utf-8 in your 
vimrc, then if you don't use the ++enc modifier, Vim will detect the 
'fileencoding' (singular) correctly, and set 'bomb' on (which, in this 
case, is the right thing to do).

2) If the whole file doesn't contain a single byte above 0x7F, then it 
is probably in US-ASCII, which is a common subset of UTF-8, Latin1, and 
quite a number of others; but I said "probably", not "certainly", so you 
should check this, and fall back on the next rule if this one doesn't work.

3) In all other cases, only trial and error will serve you: try (with 
++enc) what you think is most likely, see if it looks reasonable, and if 
not, try something else until it does.

Vim doesn't need iconv to convert between UTF-8 and Latin1 because that 
particular conversion is trivial: characters 0x00 to 0xFF of Latin1 
correspond respectively to Unicode codepoints U+0000 to U+00FF. For 
other conversions, IIUC Unix multibyte Vim is usually compiled with 
+multi_byte +iconv and uses an iconv library (static or shareable I'm 
not sure), while Windows multibyte Vim usually has +multi_byte_ime/dyn 
+iconv/dyn and uses a shareable iconv.dll or libiconv.dll library if it 
can find it; if it can't, some conversions will fail.


See also http://vim.wikia.com/Working_with_Unicode


Best regards,
Tony.
-- 
Anthony's Law of the Workshop:
        Any tool when dropped, will roll into the least accessible
        corner of the workshop.

Corollary:
        On the way to the corner, any dropped tool will first strike
        your toes.

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Re: Character encoding errors from latin-1 to utf-8

Reply via email to