Auto-guessing file encoding and integration with Vim (works for Latin1, GBK, and Big5 now)

Yongwei Wu Fri, 06 Oct 2006 09:37:43 -0700

This is a report of what I have already achieved. If you are dealing
with more encodings than the fileencodings option can handle, esp. if
you read and write Simplified and Traditional Chinese, please read on.


First, you need to have some external program to guess the encoding of
a text file. For my own purpose, I wrote tellenc.cpp, which can
differentiate between binary, ASCII, Latin1, GB2312, GBK, and Big5.
This is enough for me. If it is enough for you, fine; if not, you need
to write your own program or modifiy mine. My method works
approximately as follows:

* If a file contains 0x00, 0x1A, 0x7F, or 0xFF, it is regarded binary
* If a file contains none of the above, and all code points are less
than 0x7F, it is ASCII
* Regard code points greater than 0x7F as the first byte of a
double-byte sequence, and the frequencies of these sequences are
collected. GB and Big5 are decided by checking the most frequent
double-byte character should be among my most common character list.
Latin1 are decided by checking in most cases the character following a
byte greater than 0x7F is less than 0x7F.
* If none of the patter is followed, the encoding is unknown.

So most ISO-8859-x files are regarded as latin1, UTF-8 files as
unknown, and UTF-16/UTF-32 as binary. UTF-x should be well handled by
Vim already, and I really know next to nothing about ISO-8859-x
encodings other than Latin1. So it is good enough for me. In fact, I
have not yet found a false detection among *my files* so far.

The file tellenc can be downloaded from
<URL:http://wyw.dcweb.cn/#download>. Source and Win32 binary are
included. The Win32 binary was built with MSVC 6 + STLport 4.5.1.
Among the fastest performing executables that depend only on
MSVCRT.DLL and KERNEL32.DLL, this combination gives me the smallest
size as well. If you are interested, the command line used is: cl
/D_STLP_NO_IOSTREAMS /Ox /GX /G6 /Gr /MD tellenc.cpp /link
/opt:nowin98.

Now come back to Vim. I'll give the smallest changes to _vimrc (or
.vimrc). My real _vimrc is more complicated, since I have different
ways to detect encodings. See the above link to check my _vimrc in
detail, if you are intersted.

First, one needs to know the legacy encoding on one's system, which is
generally the most frequently used non-Unicode encoding, and which Vim
falls to when the encoding is not accuraly decided.

 if has('multi_byte')
   " Legacy encoding is the system default encoding
   let s:legacy_encoding=&encoding
 endif

After that, one can switch the encoding to UTF-8 to get multi-encoding support.

 set encoding=utf-8

A function to detect the encoding (iconv() is necessary to treat file
names that contain non-ASCII characters):

 function! EditAutoEncoding(...)
   if g:disable_encodingdetection || !has('iconv')
     return
   endif
   if a:0 > 1
     echoerr 'Only one file name should be supplied'
     return
   endif
   if a:0 == 1
     let filename=iconv(a:1, &encoding, s:legacy_encoding)
     let filename_e=' ' . a:1
   else
     let filename=iconv(expand('%:p'), &encoding, s:legacy_encoding)
     let filename_e=''
   endif
   if a:0 == 1
     try
       let g:disable_encodingdetection=1
       exec 'e' . filename_e
     finally
       let g:disable_encodingdetection=0
     endtry
   endif
   if &fileencoding != s:legacy_encoding
     return
   endif
   let result=system('tellenc "' . filename . '"')     " system specific
   let result=substitute(result, '\n$', '', '')
   if v:shell_error != 0
     echo iconv(result, s:legacy_encoding, &encoding)
     return
   endif
   if result =~ '^gb'
     let result='cp936'                                " system specific
   endif
   if result != s:legacy_encoding
     if result == 'binary'
       echo 'Binary file'
       sleep 2
     elseif result == 'unknown'
       echo 'Unknown encoding'
       sleep 2
     else
       try
         let g:disable_encodingdetection=1
         exec 'e ++enc=' . result . filename_e
       finally
         let g:disable_encodingdetection=0
       endtry
     endif
   endif
 endfunction

It can be globally disabled if one execute

 let g:disable_encodingdetection=1

And we need to put this line to set the initial state

 let g:disable_encodingdetection=0

A command is defined to use it more quickly:

 command -nargs=* -complete=file EditAutoEncoding call
                               \ EditAutoEncoding(<f-args>)

Want automatic detection on opening a file? Add something like

 " Detect file encoding based on content
 au BufReadPost *.txt nested       call EditAutoEncoding()
 au BufReadPost *.tex nested       call EditAutoEncoding()

Or simply

 au BufReadPost * nested       call EditAutoEncoding()

(If you do not want `nested', you can alternatively add `syntax on' to
the function. I use `nested' since I have other autocommands that
interfere with this one.)

If you use the autocommands, `e ++enc' no longer works well for the
`legacy encoding'. I have not found a way to tell between an encoding
got by fileencodings and ++enc. The work-around is using the variable
g:disable_encodingdetection--and that is the reason for some of the
complexities in that function. It should be automated too:

 function! EditManualEncoding(enc, ...)
   if a:0 > 1
     echoerr 'Only one file name should be supplied'
     return
   endif
   if a:0 == 1
     let filename = a:1
   else
     let filename = ''
   endif
   try
     let g:disable_encodingdetection=2
     exec 'e ++enc=' . a:enc . ' ' . filename
   finally
     let g:disable_encodingdetection=0
   endtry
 endfunction

 command -nargs=+ -complete=file EditManualEncoding call
                               \ EditManualEncoding(<f-args>)

The most difficult part for me is finding out all the interaction
between different detection ways, and specify the right precedence. In
my _vimrc, I currently have the following precedence:

Suffix detection < Tellenc detection < HTML meta tag detection <
Modeline specification < EditManualEncoding

I hope it is helpful. Feedback will be appreciated.

Best regards,

Yongwei
--
Wu Yongwei
URL: http://wyw.dcweb.cn/

Auto-guessing file encoding and integration with Vim (works for Latin1, GBK, and Big5 now)

Reply via email to