Re: More than 'fileencodings': MultiEnc.vim and TellEnc

2007-02-25 Thread Yongwei Wu

New version of tellenc is uploaded at
http://wyw.dcweb.cn/download.asp?path=file=tellenc.zip.

On 2/25/07, Yongwei Wu [EMAIL PROTECTED] wrote:

  - Most high character followed by a low character: latin1

 This may depend on the language: IIUC, the sequences ää öö are very common in
 Finnish, çà is a valid French word (as in: çà et là), paragraphs is commonly
 abbreviated to §§ etc. Also, some high characters may be repeated for
 line-drawing or underlining purposes (I underline the main title with ÷÷÷
 in the files where I want to enforce Latin1 'fileencoding'). But I suppose
 that in general it is true. If I were you I would try to find some Finnish
 text in Latin1 to check the validity of this part of the algorithm. (Maybe get
 some pages of fi.wikipedia.org and make sure to store them locally in Latin1,
 not in UTF-8.) -- Or maybe disregard repeated characters, which would take
 care of Finnish, of §§, and of underlining; just leave some margin of error
 for sequences like French çà etc.

Random French text passed the test, but random Finnish text failed
(got unknown). It seems ää occurs really often in Finnish text.


Finnish text now passes the test.


 Also, maybe refine it according to: latin1 if there are no bytes in the range
 80-9F, otherwise Windows-1252.

Good point.


Windows-1252 is now differentiated from Latin-1.

Best regards,

Yongwei

--
Wu Yongwei
URL: http://wyw.dcweb.cn/


Re: More than 'fileencodings': MultiEnc.vim and TellEnc

2007-02-24 Thread A.J.Mechelynck

Yongwei Wu wrote:

The Vim option 'fileencodings' has some limitations: e.g., it cannot
autodetect GBK and Big5 files at the same time. That was my first
motivation to develop a solution for it. It has two parts: a generic
C++ program to decide the encoding of a file, and a Vim plugin to use
this program.

The program tellenc tells the encoding a file according to the following:

- Presence of any BOM character: The Unicode encoding of the BOM


Don't forget to test UTF-32 before UTF-16 because of the ambiguity between FF 
FE 00 00 (UTF-32le) vs. FF FE (UTF-16le).



- Absence of non-ASCII characters: ascii
- UTF-8 decodable: utf-8
- Uneven distribution of NULs in odd and even positions of the file: 
utf-16(le)

- Strange characters and not a Unicode encoding decided above: binary


Hm, yes, maybe a sufficiently high proportion of bytes in the range 00-1F 
other than carriage-return and line-feed.



- Most high character followed by a low character: latin1


This may depend on the language: IIUC, the sequences ää öö are very common in 
Finnish, çà is a valid French word (as in: çà et là), paragraphs is commonly 
abbreviated to §§ etc. Also, some high characters may be repeated for 
line-drawing or underlining purposes (I underline the main title with ÷÷÷ 
in the files where I want to enforce Latin1 'fileencoding'). But I suppose 
that in general it is true. If I were you I would try to find some Finnish 
text in Latin1 to check the validity of this part of the algorithm. (Maybe get 
some pages of fi.wikipedia.org and make sure to store them locally in Latin1, 
not in UTF-8.) -- Or maybe disregard repeated characters, which would take 
care of Finnish, of §§, and of underlining; just leave some margin of error 
for sequences like French çà etc.


Also, maybe refine it according to: latin1 if there are no bytes in the range 
80-9F, otherwise Windows-1252.



- Frequency analysis of DBCS characters: gbk (gb2312) and big5
- Otherwise: unknown

I believe the frequency analysis can be applied at least to Japanese
and Korean, but I do not know the languages and have no data. If you
are Japanese or Korean, you may want to use tellenc -v on your text
files and come up with some useful data to put into the program.
Patches are welcome, though I admit it is not well commented or
documented now: given enough interest, I will refactor and enhance the
program as need be.


I suppose Japanese and Korean text can be got from the web, either from the 
respective Wikipedias or from newspaper sites. gvim can, I suppose, convert 
the text from the encoding mentioned in the web page's HTTP headers to UTF-8 
and to the other encodings common for that language. The Han characters and 
national phonograms used in both languages should be easily distinguishable 
from gibberish (when looked at with a proper font, of course), even to someone 
who doesn't know the language, so I expect that a wrong encoding would give 
the page an obviously wrong look.




This script MultiEnc.vim does these things to decide the encoding of a 
file:


- If a file has a modeline fileencoding=..., it will be used as the
encoding to open the file.
- If a file is an HTML file, and it has the encoding specified with a
HTTP-EQUIV meta tag, it will be used as the encoding to open the file.
The file pattern of HTML files can be customized by the global
variable multienc_html_patterns.
- If a file cannot be decided by the steps above, tellenc may be used
to decide its encoding. This includes HTML files without a suitable
HTTP-EQUIV meta tag, and additional files can be detected with the
global variable multienc_auto_patterns.
- A file can be manually autodetected with the command
EditAutoEncoding (without a file name for the current buffer, or with
a file name to edit a new file).
- The autodetection may be overridden with the command
EditManualEncoding (e ++enc= may not work in some cases now).

The program used to tell the encoding of a file is tellenc by
default. It can also be changed with the environment variable
MULTIENC_TELLENC. A simplistic _vimrc (for Windows) may be like:

[...]


Best regards,
Tony.
--
Love means having to say you're sorry every five minutes.



Re: More than 'fileencodings': MultiEnc.vim and TellEnc

2007-02-24 Thread Yongwei Wu

Hi Tony,

On 2/25/07, A.J.Mechelynck [EMAIL PROTECTED] wrote:

Yongwei Wu wrote:
 The Vim option 'fileencodings' has some limitations: e.g., it cannot
 autodetect GBK and Big5 files at the same time. That was my first
 motivation to develop a solution for it. It has two parts: a generic
 C++ program to decide the encoding of a file, and a Vim plugin to use
 this program.

 The program tellenc tells the encoding a file according to the following:

 - Presence of any BOM character: The Unicode encoding of the BOM

Don't forget to test UTF-32 before UTF-16 because of the ambiguity between FF
FE 00 00 (UTF-32le) vs. FF FE (UTF-16le).


Yes, it is done that way.


 - Absence of non-ASCII characters: ascii
 - UTF-8 decodable: utf-8
 - Uneven distribution of NULs in odd and even positions of the file:
 utf-16(le)
 - Strange characters and not a Unicode encoding decided above: binary

Hm, yes, maybe a sufficiently high proportion of bytes in the range 00-1F
other than carriage-return and line-feed.


Currently I test for 0x00 (NUL), 0x1A (DOS/Windows EOF), 0x7F, and 0xFF.


 - Most high character followed by a low character: latin1

This may depend on the language: IIUC, the sequences ää öö are very common in
Finnish, çà is a valid French word (as in: çà et là), paragraphs is commonly
abbreviated to §§ etc. Also, some high characters may be repeated for
line-drawing or underlining purposes (I underline the main title with ÷÷÷
in the files where I want to enforce Latin1 'fileencoding'). But I suppose
that in general it is true. If I were you I would try to find some Finnish
text in Latin1 to check the validity of this part of the algorithm. (Maybe get
some pages of fi.wikipedia.org and make sure to store them locally in Latin1,
not in UTF-8.) -- Or maybe disregard repeated characters, which would take
care of Finnish, of §§, and of underlining; just leave some margin of error
for sequences like French çà etc.


Random French text passed the test, but random Finnish text failed
(got unknown). It seems ää occurs really often in Finnish text.


Also, maybe refine it according to: latin1 if there are no bytes in the range
80-9F, otherwise Windows-1252.


Good point.


 - Frequency analysis of DBCS characters: gbk (gb2312) and big5
 - Otherwise: unknown

 I believe the frequency analysis can be applied at least to Japanese
 and Korean, but I do not know the languages and have no data. If you
 are Japanese or Korean, you may want to use tellenc -v on your text
 files and come up with some useful data to put into the program.
 Patches are welcome, though I admit it is not well commented or
 documented now: given enough interest, I will refactor and enhance the
 program as need be.

I suppose Japanese and Korean text can be got from the web, either from the
respective Wikipedias or from newspaper sites. gvim can, I suppose, convert
the text from the encoding mentioned in the web page's HTTP headers to UTF-8
and to the other encodings common for that language. The Han characters and
national phonograms used in both languages should be easily distinguishable
from gibberish (when looked at with a proper font, of course), even to someone
who doesn't know the language, so I expect that a wrong encoding would give
the page an obviously wrong look.


I can certainly do this, but I believe a native user may do it better.
Certainly it is some area I can work on if no others volunteer.


 This script MultiEnc.vim does these things to decide the encoding of a
 file:

 - If a file has a modeline fileencoding=..., it will be used as the
 encoding to open the file.
 - If a file is an HTML file, and it has the encoding specified with a
 HTTP-EQUIV meta tag, it will be used as the encoding to open the file.
 The file pattern of HTML files can be customized by the global
 variable multienc_html_patterns.
 - If a file cannot be decided by the steps above, tellenc may be used
 to decide its encoding. This includes HTML files without a suitable
 HTTP-EQUIV meta tag, and additional files can be detected with the
 global variable multienc_auto_patterns.
 - A file can be manually autodetected with the command
 EditAutoEncoding (without a file name for the current buffer, or with
 a file name to edit a new file).
 - The autodetection may be overridden with the command
 EditManualEncoding (e ++enc= may not work in some cases now).

 The program used to tell the encoding of a file is tellenc by
 default. It can also be changed with the environment variable
 MULTIENC_TELLENC. A simplistic _vimrc (for Windows) may be like:
[...]


Best regards,

Yongwei

--
Wu Yongwei
URL: http://wyw.dcweb.cn/


Re: More than 'fileencodings': MultiEnc.vim and TellEnc

2007-02-24 Thread mbbill
Hello Yongwei,

try FencView.vim
http://www.vim.org/scripts/script.php?script_id=1708

Saturday, February 24, 2007, 11:31:40 PM, you wrote:

?The Vim option 'fileencodings' has some limitations: e.g., it cannot
?autodetect GBK and Big5 files at the same time. That was my first
?motivation to develop a solution for it. It has two parts: a generic
?C++ program to decide the encoding of a file, and a Vim plugin to use
?this program.

?The program tellenc tells the encoding a file according to the following:

?- Presence of any BOM character: The Unicode encoding of the BOM
?- Absence of non-ASCII characters: ascii
?- UTF-8 decodable: utf-8
?- Uneven distribution of NULs in odd and even positions of the file: 
utf-16(le)
?- Strange characters and not a Unicode encoding decided above: binary
?- Most high character followed by a low character: latin1
?- Frequency analysis of DBCS characters: gbk (gb2312) and big5
?- Otherwise: unknown

?I believe the frequency analysis can be applied at least to Japanese
?and Korean, but I do not know the languages and have no data. If you
?are Japanese or Korean, you may want to use tellenc -v on your text
?files and come up with some useful data to put into the program.
?Patches are welcome, though I admit it is not well commented or
?documented now: given enough interest, I will refactor and enhance the
?program as need be.

?This script MultiEnc.vim does these things to decide the encoding of a file:

?- If a file has a modeline fileencoding=..., it will be used as the
?encoding to open the file.
?- If a file is an HTML file, and it has the encoding specified with a
?HTTP-EQUIV meta tag, it will be used as the encoding to open the file.
?The file pattern of HTML files can be customized by the global
?variable multienc_html_patterns.
?- If a file cannot be decided by the steps above, tellenc may be used
?to decide its encoding. This includes HTML files without a suitable
?HTTP-EQUIV meta tag, and additional files can be detected with the
?global variable multienc_auto_patterns.
?- A file can be manually autodetected with the command
?EditAutoEncoding (without a file name for the current buffer, or with
?a file name to edit a new file).
?- The autodetection may be overridden with the command
?EditManualEncoding (e ++enc= may not work in some cases now).

?The program used to tell the encoding of a file is tellenc by
?default. It can also be changed with the environment variable
?MULTIENC_TELLENC. A simplistic _vimrc (for Windows) may be like:

?--
? Legacy encoding is the system default encoding
?let g:legacy_encoding=encoding

?source $VIMRUNTIME/vimrc_example.vim
?source $VIMRUNTIME/mswin.vim

?if has('gui_running')
?  set encoding=utf-8
?else
?  if termencoding != ''  termencoding != encoding
?let encoding=termencoding
?let fileencodings='ucs-bom,utf-8,' . encoding
?  endif
?endif

? Set default file encoding(s) to the legacy encoding
?exec 'set fileencoding=' . g:legacy_encoding
?let fileencodings=substitute
?\(fileencodings, '\default\', g:legacy_encoding, '')

? File patterns of files for automatic encoding detection
?let multienc_auto_patterns='*.txt,*.tex'
?let multienc_html_patterns='*.htm{l\=},*.asp'
?--

?It is currently only tested on Windows. While I believe it should work
?on other platforms as well, there might be things I missed. Patches
?and bug reports are welcome.

?MultiEnc.vim is available at:
?  http://www.vim.org/scripts/script.php?script_id=1806

?Tellenc is available at:
?  http://wyw.dcweb.cn/download.asp?path=file=tellenc.zip

?Thank Tony and Benji for encouraging me to make it into a separate script.

?A question for Bram: Any way to extend Vim with DLLs? Starting an
?external program with system(...) is sometimes slow on Windows, and
?there will be a flashing command window, which is visible in some
?cases, esp. on slower machines.

?Best regards,

?Yongwei




-- 
Best regards,
 mbbillmailto:[EMAIL PROTECTED]



Re: More than 'fileencodings': MultiEnc.vim and TellEnc

2007-02-24 Thread A.J.Mechelynck

Yongwei Wu wrote:
[...]

Random French text passed the test, but random Finnish text failed
(got unknown). It seems ää occurs really often in Finnish text.

[...]

Yes indeed: e.g. Hello (wfw. Good day) is hyvää päivää in Finnish. One 
of the few phrases I know in that language. ;-)


That's one of the reasons I suggested to alter the test to:

Few or no sequences of 2 or more _different_ high characters: Latin1 if no 
bytes in the range 80-9F, otherwise Windows-1252.



Best regards,
Tony.
--
Q:  Why did the tachyon cross the road?
A:  Because it was on the other side.