Christian Ebert wrote:
Hi Tony,
* A.J.Mechelynck on Saturday, September 23, 2006 at 09:57:40 +0200:
Christian Ebert wrote:
Is it possible to have eg. iso-8859-1 encoded words/passages in
an otherwise utf-8 encoded file? I mean, w/o automatic
without
conversion, and I don't need the iso passages displayed in a
readable way, but so I can still write the file in utf-8 w/o
changing the "invalid" iso-8859-1 chars?
Hm, hope I made myself clear.
Hm, I probably didn't.
<snip detailed explanation with bleeding heart ;)>
Corollary of the conclusion:
#1.
cat file1.utf8.txt file2.latin1.txt file3.utf8.txt > file99.utf8.txt
will produce invalid output unless the Latin1 input file is actually 7-bit
US-ASCII. This is not a limitation of the "cat" program (which inherently
never translates anything) but a false manoeuver on the part of the user.
Hm, I want illegal stuff, hehe.
Then don't use UTF-8 files.
#2.
gvim
:if &tenc == "" | let &tenc = &enc | endif
:set enc=utf-8 fencs=utf-bom,utf-8,latin1
ucs-bom
:e ++enc=utf-8 file1.utf8.txt
:$r ++enc=latin1 file2.latin1.txt
:$r ++enc=utf-8 file3.utf-8.txt
:saveas file99.utf8.txt
Then file99.utf8.txt is the same as the one produced with the
cat command. Which is actually what I want.
No. It is what the one produced with the cat command should have been, with
the Latin1 accented characters properly converted to UTF-8.
*But*:
Vim insists on converting the displayed text to latin1. What I
want is to have the contents displayed in utf-8 with a few
illegal characters in latin1.
With 'encoding' set to UTF-8, gvim displays all text in UTF-8. Take as example
a UTF-8 file with non-Latin1 characters, such as my homepage
http://users.skynet.be/antoine.mechelynck/index.htm and you'll see the
difference, if your 'guifont' has the necessary glyphs.
Now I get:
#v+
Vögel <- utf-8
Vögel <- latin1
#v-
because Vim automatically converts to latin1. Whereas I'd like to
have it the other way round: with "Vögel" displayed as garbage,
but I can continue editing the file in _utf-8_.
Is this possible in *G*vim? (I don't have the GUI installed)
Yes, it is possible in gvim, which is the GUI. In non-GUI Vim, what you see
depends on the locale or codepage used by your terminal or console emulator.
Console Vim has no control over this.
I recommend that you install a GUI version of Vim for every serious work with
UTF-8; or else configure your OS (if possible) to use a UTF-8 locale in its
console terminal, even outside of Vim. Not all OS versions and flavours allow
the latter, however, so I recommend that you get a version of gvim.
Example snippet from a fictitious LaTeX-file to show the purpose
(or to increase confusion):
#v+
The main part of the file is in utf-8 encoding and contains
non-ascii characters.
Then, I want to typeset, say, one word in \emph{spaced} small
caps. The \LaTeX-package that does this, is not capable to parse
utf-8 input, so this single word has to be in latin1 in case it
contains non-ascii chars:
\begingroup\inputencoding{latin1}
\caps{V?gel}
\endgroup
to use the above example (with ``?'' for garbage).
\caps{V\"ogel} gives orthographically correct output but I lose
the kerning of the font.
The above example \emph{works,} but the main part of the file is
displayed in ``disected'' utf-chars.
Is it possible to have this the other way round without automatic
conversion to latin1?
#v-
c
You can't mix, in a single file, UTF-8 represented as UTF-8 and upper-half
Latin1 represented as single bytes, and expect it to work. There are two
solutions, depending on the type of file and on what your programs accept:
a) UTF-8 solution: Convert the Latin parts to UTF-8. It might be useful to
have a BOM at the start of the file (by means of ":setlocal bomb"), it helps
many programs to understand that the file is in Unicode. Of course
non-Unicode-enabled programs won't understand the file; and gvim will always
be able to display it (if you can find a proper 'guifont'), but Console Vim
can only display whichever codepoints are available in whatever charset the
underlying terminal is using. And you have to use the whole snippet (quoted
above) that I gave you, not just parts of it. In particular, if you don't
"remember" the old value of 'encoding' in 'termencoding' before changing the
former, your keyboard (and, in Console Vim, your display) may start acting
weirdly.
b) ASCII solution: In languages that accept it, represent everything above
U+00FF in the _whole_ file by code sequences which represent the Unicode
codepoints by means of ASCII. For instance, in HTML you can represent "Vögel"
by "Vögel" and "Говорите ли вы по-русски?" by
"Говорите ли
вы по-русски?".
Granted, the latter is not really human-readable until you get it through a
browser. I don't know TeX, but I guess this is what you call "dissected UTF-8
chars".
BTW, I have the opposite problem than you do: I can't write bash scripts in
Latin1 with âàéêèîôûù accents, Çç cedillas, äöü umlauts, ëï diaereses, «French
quotes» etc., because here, bash thinks every script is UTF-8. Damn POSIX for
deciding on their own high authority that all files on any computer MUST be in
the same charset! That's lowering barriers against file exchange between
computers.
Best regards,
Tony.