Re: More than 'fileencodings': MultiEnc.vim and TellEnc
New version of tellenc is uploaded at http://wyw.dcweb.cn/download.asp?path=file=tellenc.zip. On 2/25/07, Yongwei Wu [EMAIL PROTECTED] wrote: - Most high character followed by a low character: latin1 This may depend on the language: IIUC, the sequences ää öö are very common in Finnish, çà is a valid French word (as in: çà et là), paragraphs is commonly abbreviated to §§ etc. Also, some high characters may be repeated for line-drawing or underlining purposes (I underline the main title with ÷÷÷ in the files where I want to enforce Latin1 'fileencoding'). But I suppose that in general it is true. If I were you I would try to find some Finnish text in Latin1 to check the validity of this part of the algorithm. (Maybe get some pages of fi.wikipedia.org and make sure to store them locally in Latin1, not in UTF-8.) -- Or maybe disregard repeated characters, which would take care of Finnish, of §§, and of underlining; just leave some margin of error for sequences like French çà etc. Random French text passed the test, but random Finnish text failed (got unknown). It seems ää occurs really often in Finnish text. Finnish text now passes the test. Also, maybe refine it according to: latin1 if there are no bytes in the range 80-9F, otherwise Windows-1252. Good point. Windows-1252 is now differentiated from Latin-1. Best regards, Yongwei -- Wu Yongwei URL: http://wyw.dcweb.cn/
Re: More than 'fileencodings': MultiEnc.vim and TellEnc
Yongwei Wu wrote: The Vim option 'fileencodings' has some limitations: e.g., it cannot autodetect GBK and Big5 files at the same time. That was my first motivation to develop a solution for it. It has two parts: a generic C++ program to decide the encoding of a file, and a Vim plugin to use this program. The program tellenc tells the encoding a file according to the following: - Presence of any BOM character: The Unicode encoding of the BOM Don't forget to test UTF-32 before UTF-16 because of the ambiguity between FF FE 00 00 (UTF-32le) vs. FF FE (UTF-16le). - Absence of non-ASCII characters: ascii - UTF-8 decodable: utf-8 - Uneven distribution of NULs in odd and even positions of the file: utf-16(le) - Strange characters and not a Unicode encoding decided above: binary Hm, yes, maybe a sufficiently high proportion of bytes in the range 00-1F other than carriage-return and line-feed. - Most high character followed by a low character: latin1 This may depend on the language: IIUC, the sequences ää öö are very common in Finnish, çà is a valid French word (as in: çà et là), paragraphs is commonly abbreviated to §§ etc. Also, some high characters may be repeated for line-drawing or underlining purposes (I underline the main title with ÷÷÷ in the files where I want to enforce Latin1 'fileencoding'). But I suppose that in general it is true. If I were you I would try to find some Finnish text in Latin1 to check the validity of this part of the algorithm. (Maybe get some pages of fi.wikipedia.org and make sure to store them locally in Latin1, not in UTF-8.) -- Or maybe disregard repeated characters, which would take care of Finnish, of §§, and of underlining; just leave some margin of error for sequences like French çà etc. Also, maybe refine it according to: latin1 if there are no bytes in the range 80-9F, otherwise Windows-1252. - Frequency analysis of DBCS characters: gbk (gb2312) and big5 - Otherwise: unknown I believe the frequency analysis can be applied at least to Japanese and Korean, but I do not know the languages and have no data. If you are Japanese or Korean, you may want to use tellenc -v on your text files and come up with some useful data to put into the program. Patches are welcome, though I admit it is not well commented or documented now: given enough interest, I will refactor and enhance the program as need be. I suppose Japanese and Korean text can be got from the web, either from the respective Wikipedias or from newspaper sites. gvim can, I suppose, convert the text from the encoding mentioned in the web page's HTTP headers to UTF-8 and to the other encodings common for that language. The Han characters and national phonograms used in both languages should be easily distinguishable from gibberish (when looked at with a proper font, of course), even to someone who doesn't know the language, so I expect that a wrong encoding would give the page an obviously wrong look. This script MultiEnc.vim does these things to decide the encoding of a file: - If a file has a modeline fileencoding=..., it will be used as the encoding to open the file. - If a file is an HTML file, and it has the encoding specified with a HTTP-EQUIV meta tag, it will be used as the encoding to open the file. The file pattern of HTML files can be customized by the global variable multienc_html_patterns. - If a file cannot be decided by the steps above, tellenc may be used to decide its encoding. This includes HTML files without a suitable HTTP-EQUIV meta tag, and additional files can be detected with the global variable multienc_auto_patterns. - A file can be manually autodetected with the command EditAutoEncoding (without a file name for the current buffer, or with a file name to edit a new file). - The autodetection may be overridden with the command EditManualEncoding (e ++enc= may not work in some cases now). The program used to tell the encoding of a file is tellenc by default. It can also be changed with the environment variable MULTIENC_TELLENC. A simplistic _vimrc (for Windows) may be like: [...] Best regards, Tony. -- Love means having to say you're sorry every five minutes.
Re: More than 'fileencodings': MultiEnc.vim and TellEnc
Hi Tony, On 2/25/07, A.J.Mechelynck [EMAIL PROTECTED] wrote: Yongwei Wu wrote: The Vim option 'fileencodings' has some limitations: e.g., it cannot autodetect GBK and Big5 files at the same time. That was my first motivation to develop a solution for it. It has two parts: a generic C++ program to decide the encoding of a file, and a Vim plugin to use this program. The program tellenc tells the encoding a file according to the following: - Presence of any BOM character: The Unicode encoding of the BOM Don't forget to test UTF-32 before UTF-16 because of the ambiguity between FF FE 00 00 (UTF-32le) vs. FF FE (UTF-16le). Yes, it is done that way. - Absence of non-ASCII characters: ascii - UTF-8 decodable: utf-8 - Uneven distribution of NULs in odd and even positions of the file: utf-16(le) - Strange characters and not a Unicode encoding decided above: binary Hm, yes, maybe a sufficiently high proportion of bytes in the range 00-1F other than carriage-return and line-feed. Currently I test for 0x00 (NUL), 0x1A (DOS/Windows EOF), 0x7F, and 0xFF. - Most high character followed by a low character: latin1 This may depend on the language: IIUC, the sequences ää öö are very common in Finnish, çà is a valid French word (as in: çà et là), paragraphs is commonly abbreviated to §§ etc. Also, some high characters may be repeated for line-drawing or underlining purposes (I underline the main title with ÷÷÷ in the files where I want to enforce Latin1 'fileencoding'). But I suppose that in general it is true. If I were you I would try to find some Finnish text in Latin1 to check the validity of this part of the algorithm. (Maybe get some pages of fi.wikipedia.org and make sure to store them locally in Latin1, not in UTF-8.) -- Or maybe disregard repeated characters, which would take care of Finnish, of §§, and of underlining; just leave some margin of error for sequences like French çà etc. Random French text passed the test, but random Finnish text failed (got unknown). It seems ää occurs really often in Finnish text. Also, maybe refine it according to: latin1 if there are no bytes in the range 80-9F, otherwise Windows-1252. Good point. - Frequency analysis of DBCS characters: gbk (gb2312) and big5 - Otherwise: unknown I believe the frequency analysis can be applied at least to Japanese and Korean, but I do not know the languages and have no data. If you are Japanese or Korean, you may want to use tellenc -v on your text files and come up with some useful data to put into the program. Patches are welcome, though I admit it is not well commented or documented now: given enough interest, I will refactor and enhance the program as need be. I suppose Japanese and Korean text can be got from the web, either from the respective Wikipedias or from newspaper sites. gvim can, I suppose, convert the text from the encoding mentioned in the web page's HTTP headers to UTF-8 and to the other encodings common for that language. The Han characters and national phonograms used in both languages should be easily distinguishable from gibberish (when looked at with a proper font, of course), even to someone who doesn't know the language, so I expect that a wrong encoding would give the page an obviously wrong look. I can certainly do this, but I believe a native user may do it better. Certainly it is some area I can work on if no others volunteer. This script MultiEnc.vim does these things to decide the encoding of a file: - If a file has a modeline fileencoding=..., it will be used as the encoding to open the file. - If a file is an HTML file, and it has the encoding specified with a HTTP-EQUIV meta tag, it will be used as the encoding to open the file. The file pattern of HTML files can be customized by the global variable multienc_html_patterns. - If a file cannot be decided by the steps above, tellenc may be used to decide its encoding. This includes HTML files without a suitable HTTP-EQUIV meta tag, and additional files can be detected with the global variable multienc_auto_patterns. - A file can be manually autodetected with the command EditAutoEncoding (without a file name for the current buffer, or with a file name to edit a new file). - The autodetection may be overridden with the command EditManualEncoding (e ++enc= may not work in some cases now). The program used to tell the encoding of a file is tellenc by default. It can also be changed with the environment variable MULTIENC_TELLENC. A simplistic _vimrc (for Windows) may be like: [...] Best regards, Yongwei -- Wu Yongwei URL: http://wyw.dcweb.cn/
Re: More than 'fileencodings': MultiEnc.vim and TellEnc
Hello Yongwei, try FencView.vim http://www.vim.org/scripts/script.php?script_id=1708 Saturday, February 24, 2007, 11:31:40 PM, you wrote: ?The Vim option 'fileencodings' has some limitations: e.g., it cannot ?autodetect GBK and Big5 files at the same time. That was my first ?motivation to develop a solution for it. It has two parts: a generic ?C++ program to decide the encoding of a file, and a Vim plugin to use ?this program. ?The program tellenc tells the encoding a file according to the following: ?- Presence of any BOM character: The Unicode encoding of the BOM ?- Absence of non-ASCII characters: ascii ?- UTF-8 decodable: utf-8 ?- Uneven distribution of NULs in odd and even positions of the file: utf-16(le) ?- Strange characters and not a Unicode encoding decided above: binary ?- Most high character followed by a low character: latin1 ?- Frequency analysis of DBCS characters: gbk (gb2312) and big5 ?- Otherwise: unknown ?I believe the frequency analysis can be applied at least to Japanese ?and Korean, but I do not know the languages and have no data. If you ?are Japanese or Korean, you may want to use tellenc -v on your text ?files and come up with some useful data to put into the program. ?Patches are welcome, though I admit it is not well commented or ?documented now: given enough interest, I will refactor and enhance the ?program as need be. ?This script MultiEnc.vim does these things to decide the encoding of a file: ?- If a file has a modeline fileencoding=..., it will be used as the ?encoding to open the file. ?- If a file is an HTML file, and it has the encoding specified with a ?HTTP-EQUIV meta tag, it will be used as the encoding to open the file. ?The file pattern of HTML files can be customized by the global ?variable multienc_html_patterns. ?- If a file cannot be decided by the steps above, tellenc may be used ?to decide its encoding. This includes HTML files without a suitable ?HTTP-EQUIV meta tag, and additional files can be detected with the ?global variable multienc_auto_patterns. ?- A file can be manually autodetected with the command ?EditAutoEncoding (without a file name for the current buffer, or with ?a file name to edit a new file). ?- The autodetection may be overridden with the command ?EditManualEncoding (e ++enc= may not work in some cases now). ?The program used to tell the encoding of a file is tellenc by ?default. It can also be changed with the environment variable ?MULTIENC_TELLENC. A simplistic _vimrc (for Windows) may be like: ?-- ? Legacy encoding is the system default encoding ?let g:legacy_encoding=encoding ?source $VIMRUNTIME/vimrc_example.vim ?source $VIMRUNTIME/mswin.vim ?if has('gui_running') ? set encoding=utf-8 ?else ? if termencoding != '' termencoding != encoding ?let encoding=termencoding ?let fileencodings='ucs-bom,utf-8,' . encoding ? endif ?endif ? Set default file encoding(s) to the legacy encoding ?exec 'set fileencoding=' . g:legacy_encoding ?let fileencodings=substitute ?\(fileencodings, '\default\', g:legacy_encoding, '') ? File patterns of files for automatic encoding detection ?let multienc_auto_patterns='*.txt,*.tex' ?let multienc_html_patterns='*.htm{l\=},*.asp' ?-- ?It is currently only tested on Windows. While I believe it should work ?on other platforms as well, there might be things I missed. Patches ?and bug reports are welcome. ?MultiEnc.vim is available at: ? http://www.vim.org/scripts/script.php?script_id=1806 ?Tellenc is available at: ? http://wyw.dcweb.cn/download.asp?path=file=tellenc.zip ?Thank Tony and Benji for encouraging me to make it into a separate script. ?A question for Bram: Any way to extend Vim with DLLs? Starting an ?external program with system(...) is sometimes slow on Windows, and ?there will be a flashing command window, which is visible in some ?cases, esp. on slower machines. ?Best regards, ?Yongwei -- Best regards, mbbillmailto:[EMAIL PROTECTED]
Re: More than 'fileencodings': MultiEnc.vim and TellEnc
Yongwei Wu wrote: [...] Random French text passed the test, but random Finnish text failed (got unknown). It seems ää occurs really often in Finnish text. [...] Yes indeed: e.g. Hello (wfw. Good day) is hyvää päivää in Finnish. One of the few phrases I know in that language. ;-) That's one of the reasons I suggested to alter the test to: Few or no sequences of 2 or more _different_ high characters: Latin1 if no bytes in the range 80-9F, otherwise Windows-1252. Best regards, Tony. -- Q: Why did the tachyon cross the road? A: Because it was on the other side.