(patches) fixed 2 bugs in multi-byte

Dominique Pelle Sun, 06 Apr 2008 11:36:34 -0700

Bug#1:

Valgrind memory checker detects a bug when using utf-8 locale
and when loading a file which contains only 1 byte (0xfc).


Bug is easy to reproduce:

1/ make sure locale is utf-8

2/ create a file with only 1 byte 0xfc:

   $ perl -e 'print chr(0xfc)' > foo.txt

   Note that this is clearly not a valid utf-8 file.

3/ open the file with vim and valgrind:

   $ valgrind vim -u NONE foo.txt 2> vg.log

4/ notice that valgrind complains immediately as file is being read:

==15575== Conditional jump or move depends on uninitialised value(s)
==15575==    at 0x811F6FD: utf_head_off (mbyte.c:2494)
==15575==    by 0x80C32B0: readfile (fileio.c:1880)
==15575==    by 0x80527AE: open_buffer (buffer.c:130)
==15575==    by 0x80E7F82: create_windows (main.c:2459)
==15575==    by 0x80E5C84: main (main.c:799)

5/ notice as well that vim has set file encoding to utf-8 even
   though it is not a valid utf-8 file.

   I would expect fenc to be set automatically to latin1,
   and not utf-8.

   ':set fenc?'          shows: fileencoding=utf-8
   ':set fileencodings?" shows: fileencodings=ucs-bom,utf-8,latin1


Looking at the code, I see that function utf_head_off(base, p)
reads 1 to n bytes ahead p at line 2494:

    2474     int
    2475 utf_head_off(base, p)
    2476     char_u      *base;
    2477     char_u      *p;
    2478 {
    2479     char_u      *q;
    2480     char_u      *s;
    2481     int         c;
    2482 #ifdef FEAT_ARABIC
    2483     char_u      *j;
    2484 #endif
    2485
    2486     if (*p < 0x80)              /* be quick for ASCII */
    2487         return 0;
    2488
    2489     /* Skip backwards over trailing bytes: 10xx.xxxx
    2490      * Skip backwards again if on a composing char. */
    2491     for (q = p; ; --q)
    2492     {
    2493         /* Move s to the last byte of this char. */
!!! 2494         for (s = q; (s[1] & 0xc0) == 0x80; ++s)
    2495             ;
    2496         /* Move q to the first byte of this char. */
    2497         while (q > base && (*q & 0xc0) == 0x80)
    2498             --q;


Line 2494 is accessing s[1] (i.e. 'p[1]').  But only
p[0] is initialized since readfile() only read one byte
from file, so p[1] and beyond are uninitialized.

Loop at line 2494 can in theory access many other bytes
beyond the end of the buffer since exit condition
of loop depends on uninitialized memory.

I think adding a NUL after what is read from file fixes
this bug.  See attached patch but please review it.

However, it does not fix the fact that fenc is set to
utf-8 even though it is an invalid utf-8 file.


Bug #2:

Using the same input file, I can see another bug:

1/ yank the first line:  Y
   This copies the invalid utf-8 into the unnamed "" register

2/ copy the unnamed "" register onto the Ex line    :<CTRL-R>"

3/ observe the following error when copying "" register:

==6484==    at 0x80BAEBD: cmdline_paste_str (ex_getln.c:3028)
==6484==    by 0x8132B12: cmdline_paste_reg (ops.c:1528)
==6484==    by 0x80BADD2: cmdline_paste (ex_getln.c:3009)
==6484==    by 0x80B8029: getcmdline (ex_getln.c:1131)
==6484==    by 0x80B96C7: getexline (ex_getln.c:2100)
==6484==    by 0x80A31FA: do_cmdline (ex_docmd.c:995)
==6484==    by 0x8129F9A: nv_colon (normal.c:5179)
==6484==    by 0x8123556: normal_cmd (normal.c:1152)
==6484==    by 0x80E631D: main_loop (main.c:1181)
==6484==    by 0x80E5E6D: main (main.c:940)
==6484==  Address 0x51519C2 is 0 bytes after a block of size 2 alloc'd
==6484==    at 0x4022765: malloc (vg_replace_malloc.c:149)
==6484==    by 0x8113B6C: lalloc (misc2.c:862)
==6484==    by 0x8113A8E: alloc (misc2.c:761)
==6484==    by 0x8113EFB: vim_strsave (misc2.c:1166)
==6484==    by 0x81355CD: op_yank (ops.c:2901)
==6484==    by 0x8124AC6: do_pending_operator (normal.c:1890)
==6484==    by 0x81235D1: normal_cmd (normal.c:1178)
==6484==    by 0x80E631D: main_loop (main.c:1181)
==6484==    by 0x80E5E6D: main (main.c:940)

    3018     void
    3019 cmdline_paste_str(s, literally)
    3020     char_u      *s;
    3021     int         literally;
    3022 {
    3023     int         c, cv;
    3024
    3025     if (literally)
    3026         put_on_cmdline(s, -1, TRUE);
    3027     else
!!! 3028         while (*s != NUL)
    3029         {
    3030             cv = *s;
    3031             if (cv == Ctrl_V && s[1])
    3032                 ++s;
    3033 #ifdef FEAT_MBYTE
    3034             if (has_mbyte)
    3035             {
!!! 3036                 c = mb_ptr2char(s);
!!! 3037                 s += mb_char2len(c);
    3038             }
    3039             else
    3040 #endif

Call to mb_ptr2char(s) at line 3036 consumes only 1 byte
because s is an invalid utf-8 sequence and returns the
first byte which is 0xfc.   Calls to mb_char2len(0xfc)
at next line increments s by 2 bytes instead of 1 (!!)
because it takes 2 bytes to store character u+00fc in
utf-8 (which is inconsistent with the 1 byte consumed
by previous line).

I think calling 'c = mb_cptr2char_adv(&s);' to increment s
fixes the problem as in attached patch but please review it.

-- Dominique

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Index: fileio.c
===================================================================
RCS file: /cvsroot/vim/vim7/src/fileio.c,v
retrieving revision 1.114
diff -c -r1.114 fileio.c
*** fileio.c	11 Mar 2008 21:01:41 -0000	1.114
--- fileio.c	6 Apr 2008 18:14:14 -0000
***************
*** 1279,1284 ****
--- 1279,1291 ----
  		     * Read bytes from the file.
  		     */
  		    size = vim_read(fd, ptr, size);
+ 
+ #ifdef FEAT_MBYTE
+ 		    /* null terminate what has been read or else call to
+ 		     * utf_head_off() may access uninitialized bytes
+ 		     * beyond size. */
+ 		    ptr[size] = NUL;
+ #endif
  		}
  
  		if (size <= 0)
***************
*** 2424,2430 ****
  	/*
  	 * Work around a weird problem: When a file has two links (only
  	 * possible on NTFS) and we write through one link, then stat() it
! 	 * throught the other link, the timestamp information may be wrong.
  	 * It's correct again after reading the file, thus reset the timestamp
  	 * here.
  	 */
--- 2431,2437 ----
  	/*
  	 * Work around a weird problem: When a file has two links (only
  	 * possible on NTFS) and we write through one link, then stat() it
! 	 * through the other link, the timestamp information may be wrong.
  	 * It's correct again after reading the file, thus reset the timestamp
  	 * here.
  	 */
***************
*** 5908,5914 ****
  #endif
  
      /*
!      * Append the extention.
       * ext can start with '.' and cannot exceed 3 more characters.
       */
      STRCPY(s, ext);
--- 5915,5921 ----
  #endif
  
      /*
!      * Append the extension.
       * ext can start with '.' and cannot exceed 3 more characters.
       */
      STRCPY(s, ext);

Index: ex_getln.c
===================================================================
RCS file: /cvsroot/vim/vim7/src/ex_getln.c,v
retrieving revision 1.82
diff -c -r1.82 ex_getln.c
*** ex_getln.c	22 Jan 2008 11:44:39 -0000	1.82
--- ex_getln.c	6 Apr 2008 18:12:07 -0000
***************
*** 3032,3041 ****
  		++s;
  #ifdef FEAT_MBYTE
  	    if (has_mbyte)
! 	    {
! 		c = mb_ptr2char(s);
! 		s += mb_char2len(c);
! 	    }
  	    else
  #endif
  		c = *s++;
--- 3032,3038 ----
  		++s;
  #ifdef FEAT_MBYTE
  	    if (has_mbyte)
! 		c = mb_cptr2char_adv(&s);
  	    else
  #endif
  		c = *s++;
***************
*** 3335,3341 ****
  /*
   * Do wildcard expansion on the string 'str'.
   * Chars that should not be expanded must be preceded with a backslash.
!  * Return a pointer to alloced memory containing the new string.
   * Return NULL for failure.
   *
   * "orig" is the originally expanded string, copied to allocated memory.  It
--- 3332,3338 ----
  /*
   * Do wildcard expansion on the string 'str'.
   * Chars that should not be expanded must be preceded with a backslash.
!  * Return a pointer to allocated memory containing the new string.
   * Return NULL for failure.
   *
   * "orig" is the originally expanded string, copied to allocated memory.  It
***************
*** 6099,6105 ****
  
      exmode_active = save_exmode;
  
!     /* Safety check: The old window or buffer was deleted: It's a a bug when
       * this happens! */
      if (!win_valid(old_curwin) || !buf_valid(old_curbuf))
      {
--- 6096,6102 ----
  
      exmode_active = save_exmode;
  
!     /* Safety check: The old window or buffer was deleted: It's a bug when
       * this happens! */
      if (!win_valid(old_curwin) || !buf_valid(old_curbuf))
      {

(patches) fixed 2 bugs in multi-byte

Raspunde prin e-mail lui