Christian Ebert wrote:
Hi Tony,

* A.J.Mechelynck on Saturday, September 23, 2006 at 09:57:40 +0200:
Christian Ebert wrote:
Is it possible to have eg. iso-8859-1 encoded words/passages in
an otherwise utf-8 encoded file? I mean, w/o automatic
                                          without
conversion, and I don't need the iso passages displayed in a
readable way, but so I can still write the file in utf-8 w/o
changing the "invalid" iso-8859-1 chars?

Hm, hope I made myself clear.

Hm, I probably didn't.

<snip detailed explanation with bleeding heart ;)>

Corollary of the conclusion:

#1.
cat file1.utf8.txt file2.latin1.txt file3.utf8.txt > file99.utf8.txt

will produce invalid output unless the Latin1 input file is actually 7-bit US-ASCII. This is not a limitation of the "cat" program (which inherently never translates anything) but a false manoeuver on the part of the user.

Hm, I want illegal stuff, hehe.

Then don't use UTF-8 files.


#2.
gvim
        :if &tenc == "" | let &tenc = &enc | endif
        :set enc=utf-8 fencs=utf-bom,utf-8,latin1
                             ucs-bom
        :e ++enc=utf-8 file1.utf8.txt
        :$r ++enc=latin1 file2.latin1.txt
        :$r ++enc=utf-8 file3.utf-8.txt
        :saveas file99.utf8.txt

Then file99.utf8.txt is the same as the one produced with the
cat command. Which is actually what I want.

No. It is what the one produced with the cat command should have been, with the Latin1 accented characters properly converted to UTF-8.


*But*:

Vim insists on converting the displayed text to latin1. What I
want is to have the contents displayed in utf-8 with a few
illegal characters in latin1.

With 'encoding' set to UTF-8, gvim displays all text in UTF-8. Take as example a UTF-8 file with non-Latin1 characters, such as my homepage http://users.skynet.be/antoine.mechelynck/index.htm and you'll see the difference, if your 'guifont' has the necessary glyphs.


Now I get:

#v+
Vögel <- utf-8

Vögel  <- latin1
#v-

because Vim automatically converts to latin1. Whereas I'd like to
have it the other way round: with "Vögel" displayed as garbage,
but I can continue editing the file in _utf-8_.

Is this possible in *G*vim? (I don't have the GUI installed)

Yes, it is possible in gvim, which is the GUI. In non-GUI Vim, what you see depends on the locale or codepage used by your terminal or console emulator. Console Vim has no control over this.

I recommend that you install a GUI version of Vim for every serious work with UTF-8; or else configure your OS (if possible) to use a UTF-8 locale in its console terminal, even outside of Vim. Not all OS versions and flavours allow the latter, however, so I recommend that you get a version of gvim.


Example snippet from a fictitious LaTeX-file to show the purpose
(or to increase confusion):

#v+
The main part of the file is in utf-8 encoding and contains
non-ascii characters.

Then, I want to typeset, say, one word in \emph{spaced} small
caps. The \LaTeX-package that does this, is not capable to parse
utf-8 input, so this single word has to be in latin1 in case it
contains non-ascii chars:

\begingroup\inputencoding{latin1}
\caps{V?gel}
\endgroup

to use the above example (with ``?'' for garbage).

\caps{V\"ogel} gives orthographically correct output but I lose
the kerning of the font.

The above example \emph{works,} but the main part of the file is
displayed in ``disected'' utf-chars.

Is it possible to have this the other way round without automatic
conversion to latin1?
#v-

c

You can't mix, in a single file, UTF-8 represented as UTF-8 and upper-half Latin1 represented as single bytes, and expect it to work. There are two solutions, depending on the type of file and on what your programs accept:

a) UTF-8 solution: Convert the Latin parts to UTF-8. It might be useful to have a BOM at the start of the file (by means of ":setlocal bomb"), it helps many programs to understand that the file is in Unicode. Of course non-Unicode-enabled programs won't understand the file; and gvim will always be able to display it (if you can find a proper 'guifont'), but Console Vim can only display whichever codepoints are available in whatever charset the underlying terminal is using. And you have to use the whole snippet (quoted above) that I gave you, not just parts of it. In particular, if you don't "remember" the old value of 'encoding' in 'termencoding' before changing the former, your keyboard (and, in Console Vim, your display) may start acting weirdly.

b) ASCII solution: In languages that accept it, represent everything above U+00FF in the _whole_ file by code sequences which represent the Unicode codepoints by means of ASCII. For instance, in HTML you can represent "Vögel" by "V&ouml;gel" and "Говорите ли вы по-русски?" by "&#1043;&#1086;&#1074;&#1086;&#1088;&#1080;&#1090;&#1077; &#1083;&#1080; &#1074;&#1099; &#1087;&#1086;-&#1088;&#1091;&#1089;&#1089;&#1082;&#1080;?". Granted, the latter is not really human-readable until you get it through a browser. I don't know TeX, but I guess this is what you call "dissected UTF-8 chars".

BTW, I have the opposite problem than you do: I can't write bash scripts in Latin1 with âàéêèîôûù accents, Çç cedillas, äöü umlauts, ëï diaereses, «French quotes» etc., because here, bash thinks every script is UTF-8. Damn POSIX for deciding on their own high authority that all files on any computer MUST be in the same charset! That's lowering barriers against file exchange between computers.


Best regards,
Tony.

Reply via email to