Hi Tony,

On 2/25/07, A.J.Mechelynck <[EMAIL PROTECTED]> wrote:
Yongwei Wu wrote:
> The Vim option 'fileencodings' has some limitations: e.g., it cannot
> autodetect GBK and Big5 files at the same time. That was my first
> motivation to develop a solution for it. It has two parts: a generic
> C++ program to decide the encoding of a file, and a Vim plugin to use
> this program.
>
> The program tellenc tells the encoding a file according to the following:
>
> - Presence of any BOM character: The Unicode encoding of the BOM

Don't forget to test UTF-32 before UTF-16 because of the ambiguity between FF
FE 00 00 (UTF-32le) vs. FF FE (UTF-16le).

Yes, it is done that way.

> - Absence of non-ASCII characters: ascii
> - UTF-8 decodable: utf-8
> - Uneven distribution of NULs in odd and even positions of the file:
> utf-16(le)
> - Strange characters and not a Unicode encoding decided above: binary

Hm, yes, maybe a "sufficiently high" proportion of bytes in the range 00-1F
other than carriage-return and line-feed.

Currently I test for 0x00 (NUL), 0x1A (DOS/Windows EOF), 0x7F, and 0xFF.

> - Most high character followed by a low character: latin1

This may depend on the language: IIUC, the sequences ää öö are very common in
Finnish, çà is a valid French word (as in: çà et là), "paragraphs" is commonly
abbreviated to §§ etc. Also, some "high" characters may be repeated for
line-drawing or underlining purposes (I underline the main title with ÷÷÷÷÷÷÷
in the files where I want to enforce Latin1 'fileencoding'). But I suppose
that in general it is true. If I were you I would try to find some Finnish
text in Latin1 to check the validity of this part of the algorithm. (Maybe get
some pages of fi.wikipedia.org and make sure to store them locally in Latin1,
not in UTF-8.) -- Or maybe disregard repeated characters, which would take
care of Finnish, of §§, and of underlining; just leave some "margin of error"
for sequences like French çà etc.

Random French text passed the test, but random Finnish text failed
(got "unknown"). It seems "ää" occurs really often in Finnish text.

Also, maybe refine it according to: latin1 if there are no bytes in the range
80-9F, otherwise Windows-1252.

Good point.

> - Frequency analysis of DBCS characters: gbk (gb2312) and big5
> - Otherwise: unknown
>
> I believe the frequency analysis can be applied at least to Japanese
> and Korean, but I do not know the languages and have no data. If you
> are Japanese or Korean, you may want to use "tellenc -v" on your text
> files and come up with some useful data to put into the program.
> Patches are welcome, though I admit it is not well commented or
> documented now: given enough interest, I will refactor and enhance the
> program as need be.

I suppose Japanese and Korean text can be got from the web, either from the
respective Wikipedias or from newspaper sites. gvim can, I suppose, convert
the text from the encoding mentioned in the web page's HTTP headers to UTF-8
and to the other encodings common for that language. The "Han characters" and
national phonograms used in both languages should be easily distinguishable
from gibberish (when looked at with a proper font, of course), even to someone
who doesn't know the language, so I expect that a "wrong encoding" would give
the page an "obviously wrong" look.

I can certainly do this, but I believe a native user may do it better.
Certainly it is some area I can work on if no others volunteer.

> This script MultiEnc.vim does these things to decide the encoding of a
> file:
>
> - If a file has a modeline fileencoding=..., it will be used as the
> encoding to open the file.
> - If a file is an HTML file, and it has the encoding specified with a
> HTTP-EQUIV meta tag, it will be used as the encoding to open the file.
> The file pattern of HTML files can be customized by the global
> variable multienc_html_patterns.
> - If a file cannot be decided by the steps above, tellenc may be used
> to decide its encoding. This includes HTML files without a suitable
> HTTP-EQUIV meta tag, and additional files can be detected with the
> global variable multienc_auto_patterns.
> - A file can be manually autodetected with the command
> EditAutoEncoding (without a file name for the current buffer, or with
> a file name to edit a new file).
> - The autodetection may be overridden with the command
> EditManualEncoding ("e ++enc=" may not work in some cases now).
>
> The program used to tell the encoding of a file is "tellenc" by
> default. It can also be changed with the environment variable
> MULTIENC_TELLENC. A simplistic _vimrc (for Windows) may be like:
[...]

Best regards,

Yongwei

--
Wu Yongwei
URL: http://wyw.dcweb.cn/

Reply via email to