Re: More than 'fileencodings': MultiEnc.vim and TellEnc

A.J.Mechelynck Sat, 24 Feb 2007 08:48:16 -0800

Yongwei Wu wrote:

The Vim option 'fileencodings' has some limitations: e.g., it cannot
autodetect GBK and Big5 files at the same time. That was my first
motivation to develop a solution for it. It has two parts: a generic
C++ program to decide the encoding of a file, and a Vim plugin to use
this program.


The program tellenc tells the encoding a file according to the following:

- Presence of any BOM character: The Unicode encoding of the BOM

Don't forget to test UTF-32 before UTF-16 because of the ambiguity between FFFE 00 00 (UTF-32le) vs. FF FE (UTF-16le).

- Absence of non-ASCII characters: ascii
- UTF-8 decodable: utf-8
- Uneven distribution of NULs in odd and even positions of the file:utf-16(le)
- Strange characters and not a Unicode encoding decided above: binary

Hm, yes, maybe a "sufficiently high" proportion of bytes in the range 00-1Fother than carriage-return and line-feed.

- Most high character followed by a low character: latin1

This may depend on the language: IIUC, the sequences ää öö are very common inFinnish, çà is a valid French word (as in: çà et là), "paragraphs" is commonlyabbreviated to §§ etc. Also, some "high" characters may be repeated forline-drawing or underlining purposes (I underline the main title with ÷÷÷÷÷÷÷in the files where I want to enforce Latin1 'fileencoding'). But I supposethat in general it is true. If I were you I would try to find some Finnishtext in Latin1 to check the validity of this part of the algorithm. (Maybe getsome pages of fi.wikipedia.org and make sure to store them locally in Latin1,not in UTF-8.) -- Or maybe disregard repeated characters, which would takecare of Finnish, of §§, and of underlining; just leave some "margin of error"for sequences like French çà etc.

Also, maybe refine it according to: latin1 if there are no bytes in the range80-9F, otherwise Windows-1252.

- Frequency analysis of DBCS characters: gbk (gb2312) and big5
- Otherwise: unknown

I believe the frequency analysis can be applied at least to Japanese
and Korean, but I do not know the languages and have no data. If you
are Japanese or Korean, you may want to use "tellenc -v" on your text
files and come up with some useful data to put into the program.
Patches are welcome, though I admit it is not well commented or
documented now: given enough interest, I will refactor and enhance the
program as need be.

I suppose Japanese and Korean text can be got from the web, either from therespective Wikipedias or from newspaper sites. gvim can, I suppose, convertthe text from the encoding mentioned in the web page's HTTP headers to UTF-8and to the other encodings common for that language. The "Han characters" andnational phonograms used in both languages should be easily distinguishablefrom gibberish (when looked at with a proper font, of course), even to someonewho doesn't know the language, so I expect that a "wrong encoding" would givethe page an "obviously wrong" look.

This script MultiEnc.vim does these things to decide the encoding of afile:


- If a file has a modeline fileencoding=..., it will be used as the
encoding to open the file.
- If a file is an HTML file, and it has the encoding specified with a
HTTP-EQUIV meta tag, it will be used as the encoding to open the file.
The file pattern of HTML files can be customized by the global
variable multienc_html_patterns.
- If a file cannot be decided by the steps above, tellenc may be used
to decide its encoding. This includes HTML files without a suitable
HTTP-EQUIV meta tag, and additional files can be detected with the
global variable multienc_auto_patterns.
- A file can be manually autodetected with the command
EditAutoEncoding (without a file name for the current buffer, or with
a file name to edit a new file).
- The autodetection may be overridden with the command
EditManualEncoding ("e ++enc=" may not work in some cases now).

The program used to tell the encoding of a file is "tellenc" by
default. It can also be changed with the environment variable
MULTIENC_TELLENC. A simplistic _vimrc (for Windows) may be like:

[...]


Best regards,
Tony.
--
Love means having to say you're sorry every five minutes.

Re: More than 'fileencodings': MultiEnc.vim and TellEnc

Reply via email to