Yongwei Wu wrote:
The Vim option 'fileencodings' has some limitations: e.g., it cannot
autodetect GBK and Big5 files at the same time. That was my first
motivation to develop a solution for it. It has two parts: a generic
C++ program to decide the encoding of a file, and a Vim plugin to use
this program.

The program tellenc tells the encoding a file according to the following:

- Presence of any BOM character: The Unicode encoding of the BOM

Don't forget to test UTF-32 before UTF-16 because of the ambiguity between FF FE 00 00 (UTF-32le) vs. FF FE (UTF-16le).

- Absence of non-ASCII characters: ascii
- UTF-8 decodable: utf-8
- Uneven distribution of NULs in odd and even positions of the file: utf-16(le)
- Strange characters and not a Unicode encoding decided above: binary

Hm, yes, maybe a "sufficiently high" proportion of bytes in the range 00-1F other than carriage-return and line-feed.

- Most high character followed by a low character: latin1

This may depend on the language: IIUC, the sequences ää öö are very common in Finnish, çà is a valid French word (as in: çà et là), "paragraphs" is commonly abbreviated to §§ etc. Also, some "high" characters may be repeated for line-drawing or underlining purposes (I underline the main title with ÷÷÷÷÷÷÷ in the files where I want to enforce Latin1 'fileencoding'). But I suppose that in general it is true. If I were you I would try to find some Finnish text in Latin1 to check the validity of this part of the algorithm. (Maybe get some pages of fi.wikipedia.org and make sure to store them locally in Latin1, not in UTF-8.) -- Or maybe disregard repeated characters, which would take care of Finnish, of §§, and of underlining; just leave some "margin of error" for sequences like French çà etc.

Also, maybe refine it according to: latin1 if there are no bytes in the range 80-9F, otherwise Windows-1252.

- Frequency analysis of DBCS characters: gbk (gb2312) and big5
- Otherwise: unknown

I believe the frequency analysis can be applied at least to Japanese
and Korean, but I do not know the languages and have no data. If you
are Japanese or Korean, you may want to use "tellenc -v" on your text
files and come up with some useful data to put into the program.
Patches are welcome, though I admit it is not well commented or
documented now: given enough interest, I will refactor and enhance the
program as need be.

I suppose Japanese and Korean text can be got from the web, either from the respective Wikipedias or from newspaper sites. gvim can, I suppose, convert the text from the encoding mentioned in the web page's HTTP headers to UTF-8 and to the other encodings common for that language. The "Han characters" and national phonograms used in both languages should be easily distinguishable from gibberish (when looked at with a proper font, of course), even to someone who doesn't know the language, so I expect that a "wrong encoding" would give the page an "obviously wrong" look.


This script MultiEnc.vim does these things to decide the encoding of a file:

- If a file has a modeline fileencoding=..., it will be used as the
encoding to open the file.
- If a file is an HTML file, and it has the encoding specified with a
HTTP-EQUIV meta tag, it will be used as the encoding to open the file.
The file pattern of HTML files can be customized by the global
variable multienc_html_patterns.
- If a file cannot be decided by the steps above, tellenc may be used
to decide its encoding. This includes HTML files without a suitable
HTTP-EQUIV meta tag, and additional files can be detected with the
global variable multienc_auto_patterns.
- A file can be manually autodetected with the command
EditAutoEncoding (without a file name for the current buffer, or with
a file name to edit a new file).
- The autodetection may be overridden with the command
EditManualEncoding ("e ++enc=" may not work in some cases now).

The program used to tell the encoding of a file is "tellenc" by
default. It can also be changed with the environment variable
MULTIENC_TELLENC. A simplistic _vimrc (for Windows) may be like:
[...]


Best regards,
Tony.
--
Love means having to say you're sorry every five minutes.

Reply via email to