Re: latin1 words in an utf-8 file

A.J.Mechelynck Sat, 23 Sep 2006 08:32:55 -0700

Christian Ebert wrote:

Hi Tony,


* A.J.Mechelynck on Saturday, September 23, 2006 at 09:57:40 +0200:

Christian Ebert wrote:

Is it possible to have eg. iso-8859-1 encoded words/passages in
an otherwise utf-8 encoded file? I mean, w/o automatic

                                          without

conversion, and I don't need the iso passages displayed in a
readable way, but so I can still write the file in utf-8 w/o
changing the "invalid" iso-8859-1 chars?

Hm, hope I made myself clear.


Hm, I probably didn't.

<snip detailed explanation with bleeding heart ;)>

Corollary of the conclusion:

#1.
cat file1.utf8.txt file2.latin1.txt file3.utf8.txt > file99.utf8.txt
will produce invalid output unless the Latin1 input file is actually 7-bitUS-ASCII. This is not a limitation of the "cat" program (which inherentlynever translates anything) but a false manoeuver on the part of the user.


Hm, I want illegal stuff, hehe.


Then don't use UTF-8 files.

#2.
gvim
        :if &tenc == "" | let &tenc = &enc | endif
        :set enc=utf-8 fencs=utf-bom,utf-8,latin1

                             ucs-bom

        :e ++enc=utf-8 file1.utf8.txt
        :$r ++enc=latin1 file2.latin1.txt
        :$r ++enc=utf-8 file3.utf-8.txt
        :saveas file99.utf8.txt


Then file99.utf8.txt is the same as the one produced with the
cat command. Which is actually what I want.

No. It is what the one produced with the cat command should have been, withthe Latin1 accented characters properly converted to UTF-8.


*But*:

Vim insists on converting the displayed text to latin1. What I
want is to have the contents displayed in utf-8 with a few
illegal characters in latin1.

With 'encoding' set to UTF-8, gvim displays all text in UTF-8. Take as examplea UTF-8 file with non-Latin1 characters, such as my homepagehttp://users.skynet.be/antoine.mechelynck/index.htm and you'll see thedifference, if your 'guifont' has the necessary glyphs.


Now I get:

#v+
VÃ¶gel <- utf-8

Vögel  <- latin1
#v-

because Vim automatically converts to latin1. Whereas I'd like to
have it the other way round: with "Vögel" displayed as garbage,
but I can continue editing the file in _utf-8_.

Is this possible in *G*vim? (I don't have the GUI installed)

Yes, it is possible in gvim, which is the GUI. In non-GUI Vim, what you seedepends on the locale or codepage used by your terminal or console emulator.Console Vim has no control over this.

I recommend that you install a GUI version of Vim for every serious work withUTF-8; or else configure your OS (if possible) to use a UTF-8 locale in itsconsole terminal, even outside of Vim. Not all OS versions and flavours allowthe latter, however, so I recommend that you get a version of gvim.


Example snippet from a fictitious LaTeX-file to show the purpose
(or to increase confusion):

#v+
The main part of the file is in utf-8 encoding and contains
non-ascii characters.

Then, I want to typeset, say, one word in \emph{spaced} small
caps. The \LaTeX-package that does this, is not capable to parse
utf-8 input, so this single word has to be in latin1 in case it
contains non-ascii chars:

\begingroup\inputencoding{latin1}
\caps{V?gel}
\endgroup

to use the above example (with ``?'' for garbage).

\caps{V\"ogel} gives orthographically correct output but I lose
the kerning of the font.

The above example \emph{works,} but the main part of the file is
displayed in ``disected'' utf-chars.

Is it possible to have this the other way round without automatic
conversion to latin1?
#v-

c

You can't mix, in a single file, UTF-8 represented as UTF-8 and upper-halfLatin1 represented as single bytes, and expect it to work. There are twosolutions, depending on the type of file and on what your programs accept:

a) UTF-8 solution: Convert the Latin parts to UTF-8. It might be useful tohave a BOM at the start of the file (by means of ":setlocal bomb"), it helpsmany programs to understand that the file is in Unicode. Of coursenon-Unicode-enabled programs won't understand the file; and gvim will alwaysbe able to display it (if you can find a proper 'guifont'), but Console Vimcan only display whichever codepoints are available in whatever charset theunderlying terminal is using. And you have to use the whole snippet (quotedabove) that I gave you, not just parts of it. In particular, if you don't"remember" the old value of 'encoding' in 'termencoding' before changing theformer, your keyboard (and, in Console Vim, your display) may start actingweirdly.

b) ASCII solution: In languages that accept it, represent everything aboveU+00FF in the _whole_ file by code sequences which represent the Unicodecodepoints by means of ASCII. For instance, in HTML you can represent "Vögel"by "Vögel" and "Говорите ли вы по-русски?" by"Говорите ливы по-русски?".Granted, the latter is not really human-readable until you get it through abrowser. I don't know TeX, but I guess this is what you call "dissected UTF-8chars".

BTW, I have the opposite problem than you do: I can't write bash scripts inLatin1 with âàéêèîôûù accents, Çç cedillas, äöü umlauts, ëï diaereses, «Frenchquotes» etc., because here, bash thinks every script is UTF-8. Damn POSIX fordeciding on their own high authority that all files on any computer MUST be inthe same charset! That's lowering barriers against file exchange betweencomputers.



Best regards,
Tony.

Re: latin1 words in an utf-8 file

Reply via email to