Re: [patch] fixed 3 bugs with invalid utf-8 sequences

Dominique Pelle Mon, 30 Jun 2008 14:21:38 -0700

On Mon, Jun 30, 2008 at 9:52 AM, Dominique Pelle wrote:

> On Sun, Jun 29, 2008 at 11:43 PM, Dominique Pelle wrote:
>
>> On Sun, Jun 29, 2008 at 4:10 PM, Bram Moolenaar wrote:
>>
>>> Dominique Pelle wrote:
>>>
>>>> Here is another bug because of invalid utf-8 sequence:
>>>>
>>>> ==21329== Conditional jump or move depends on uninitialised value(s)
>>>> ==21329==    at 0x811C4C7: utf_ptr2char (mbyte.c:1390)
>>>> ==21329==    by 0x811C83F: utf_composinglike (mbyte.c:1498)
>>>> ==21329==    by 0x811CD96: utfc_ptr2len_len (mbyte.c:1756)
>>>> ==21329==    by 0x80FEAF1: msg_outtrans_len_attr (message.c:1355)
>>>> ==21329==    by 0x80FE9D0: msg_outtrans_len (message.c:1292)
>>>> ==21329==    by 0x80B9D7F: draw_cmdline (ex_getln.c:2641)
>>>> ==21329==    by 0x80BAA99: redrawcmd (ex_getln.c:3126)
>>>> ==21329==    by 0x80BAF82: nextwild (ex_getln.c:3317)
>>>> ==21329==    by 0x80B6D3B: getcmdline (ex_getln.c:803)
>>>> ==21329==    by 0x80B912E: getexline (ex_getln.c:2101)
>>>> ==21329==    by 0x80A33FC: do_cmdline (ex_docmd.c:991)
>>>> ==21329==    by 0x81280DD: nv_colon (normal.c:5181)
>>>> ==21329==    by 0x8121772: normal_cmd (normal.c:1152)
>>>> ==21329==    by 0x80E4E8A: main_loop (main.c:1177)
>>>> ==21329==    by 0x80E49DA: main (main.c:936)
>>>>
>>>> ==21434== Conditional jump or move depends on uninitialised value(s)
>>>> ==21434==    at 0x811C4C7: utf_ptr2char (mbyte.c:1390)
>>>> ==21434==    by 0x811C28C: utf_ptr2cells (mbyte.c:1268)
>>>> ==21434==    by 0x805D084: ptr2cells (charset.c:746)
>>>> ==21434==    by 0x80FEC65: msg_outtrans_len_attr (message.c:1394)
>>>> ==21434==    by 0x80FE9D0: msg_outtrans_len (message.c:1292)
>>>> ==21434==    by 0x80B9D7F: draw_cmdline (ex_getln.c:2641)
>>>> ==21434==    by 0x80BAA99: redrawcmd (ex_getln.c:3126)
>>>> ==21434==    by 0x80BAF82: nextwild (ex_getln.c:3317)
>>>> ==21434==    by 0x80B6D3B: getcmdline (ex_getln.c:803)
>>>> ==21434==    by 0x80B912E: getexline (ex_getln.c:2101)
>>>> ==21434==    by 0x80A33FC: do_cmdline (ex_docmd.c:991)
>>>> ==21434==    by 0x81280DD: nv_colon (normal.c:5181)
>>>> ==21434==    by 0x8121772: normal_cmd (normal.c:1152)
>>>> ==21434==    by 0x80E4E8A: main_loop (main.c:1177)
>>>> ==21434==    by 0x80E49DA: main (main.c:936)
>>>>
>>>> It's easy to reproduce:
>>>>
>>>> 1/ Create a file names with invalid utf-8 name:
>>>>
>>>>    $ mkdir testcase
>>>>    $ touch testcase/$(perl -e 'print chr(0xfb)')
>>>>
>>>> 2/ Start Vim in 'no compatible' mode with Valgrind:
>>>>
>>>>    $ valgrind vim -u NONE -c -N 2> vg.log
>>>>
>>>> 4/ In ex mode, enter:
>>>>
>>>>    :e testcase/<TAB>
>>>>
>>>>    When pressing <TAB> key (file completion), vim should display
>>>>    file name with invalid utf-8 sequence 'testcase/<fb>'
>>>>    and observe in vg.log that valgrind reports above errors.
>>>>
>>>> The 2 error messages are actually 2 distinct bugs.
>>>>
>>>> - First error: in mbyte.c:1756, calling
>>>>   UTF_COMPOSINGLIKE(p + prevlen, p + len) is unsafe, since it
>>>>   can read bytes beyond size.  So in theory, there is a small risk
>>>>   that vim combines characters when it should not.
>>>>
>>>>   Attached patch "fix1-invalid-utf8-seq.patch" fixes this first
>>>>   error, by calling utf_ptr2len_len(p + len, size - len) before
>>>>   UTF_COMPOSINGLIKE(...) to ensure that UTF_COMPOSINGLIKE(...)
>>>>   can't access beyond size.
>>>>
>>>> - Second error: in message.c:1394, calling ptr2cells(str) is
>>>>   unsafe since it can read bytes in str beyond len (which are
>>>>   uninitialized).  At line 1394, we know that the character is
>>>>   only 1 byte long (mb_l is 1) so it should call char2cells(*str).
>>>>   mb_l is 1 here because of incomplete utf-8 sequence so calling
>>>>   ptr2cells(str) overflows.
>>>>
>>>>   Attached patch "fix2-invalid-utf8-seq.patch" fixes it.
>>>>
>>>> After fixing these 2 bugs, valgrind no longer complains with
>>>> the above testcase.
>>>>
>>>> However, there is a 3rd bug which can be triggered with a slightly
>>>> different test case, using file <ff> instead of <fb>.  The difference
>>>> is that <ff> is an invalid utf-8 sequence, whereas <fb> was an
>>>> incomplete utf-8 sequence:
>>>>
>>>> 1/ Create a file names with invalid utf-8 names:
>>>>
>>>>    $ mkdir testcase
>>>>    $ touch testcase/$(perl -e 'print chr(0xff)')
>>>>
>>>> 2/ Start Vim in 'no compatible' mode with Valgrind:
>>>>
>>>>    $ valgrind vim -u NONE -c -N 2> vg.log
>>>>
>>>> 3/ In ex mode, enter:
>>>>
>>>>    :e testcase/<TAB><TAB>
>>>>
>>>>    When pressing <TAB> key (file completion), vim should display
>>>>    file name with invalid utf-8 sequence 'e: testcase/<fb>' (without
>>>>    valgrind errors after above patches) then when processing <TAB>
>>>>    again, vim displays next file 'e: testcase/<ff>' but observe
>>>>    that valgrind reports following error:
>>>>
>>>> ==22527== Conditional jump or move depends on uninitialised value(s)
>>>> ==22527==    at 0x811C4CB: utf_ptr2char (mbyte.c:1390)
>>>> ==22527==    by 0x811C843: utf_composinglike (mbyte.c:1498)
>>>> ==22527==    by 0x811CDC8: utfc_ptr2len_len (mbyte.c:1769)
>>>> ==22527==    by 0x80FEAF1: msg_outtrans_len_attr (message.c:1355)
>>>> ==22527==    by 0x80FE9D0: msg_outtrans_len (message.c:1292)
>>>> ==22527==    by 0x80B9D7F: draw_cmdline (ex_getln.c:2641)
>>>> ==22527==    by 0x80BAA99: redrawcmd (ex_getln.c:3126)
>>>> ==22527==    by 0x80BAF82: nextwild (ex_getln.c:3317)
>>>> ==22527==    by 0x80B6CC6: getcmdline (ex_getln.c:790)
>>>> ==22527==    by 0x80B912E: getexline (ex_getln.c:2101)
>>>> ==22527==    by 0x80A33FC: do_cmdline (ex_docmd.c:991)
>>>> ==22527==    by 0x81280F5: nv_colon (normal.c:5181)
>>>> ==22527==    by 0x812178A: normal_cmd (normal.c:1152)
>>>> ==22527==    by 0x80E4E8A: main_loop (main.c:1177)
>>>> ==22527==    by 0x80E49DA: main (main.c:936)
>>>>
>>>> This bug happens because utf_ptr2char(p) does not check
>>>> whether first byte of utf-8 sequence is legal.  When p[0] >= 0x80
>>>> and utf8len_tab[p[0]] is 1, then the first byte is illegal, and
>>>> utf_ptr2char(p) should return 1 immediately, without checking p[1].
>>>>
>>>> Attached patch fix3-invalid-utf8-seq.patch fixes this 3rd bug.
>>>>
>>>> After those 3 fixes, I no longer observe errors.
>>>
>>> All three look like good fixes to me.  I'll include them.
>>>
>>>> I hope errors with invalid utf-8 sequences are not too nitpicking.
>>>> I'm not aware of any other such errors at the moment.  So hopefully
>>>> no more patches like this.
>>>
>>> Depending on the system reading uninitialised memory may cause a crash.
>>> And when mixing up latin1 with utf-8 it's easy to get illegal byte
>>> sequences.  So it definitely makes sense to fix all these little
>>> problems.  I'm very glad you find them.
>>
>>
>>
>> Hmmm, just re-looking at my patches again, I think patch #2 may
>> not be quite right.  It was this patch:
>>
>> Index: message.c
>> ===================================================================
>> RCS file: /cvsroot/vim/vim7/src/message.c,v
>> retrieving revision 1.62
>> diff -c -r1.62 message.c
>> *** message.c   28 Jun 2008 14:09:47 -0000      1.62
>> --- message.c   29 Jun 2008 11:12:47 -0000
>> ***************
>> *** 1391,1397 ****
>>                plain_start = str + 1;
>>                msg_puts_attr(s, attr == 0 ? hl_attr(HLF_8) : attr);
>>            }
>> !           retval += ptr2cells(str);
>>            ++str;
>>        }
>>      }
>> --- 1391,1397 ----
>>                plain_start = str + 1;
>>                msg_puts_attr(s, attr == 0 ? hl_attr(HLF_8) : attr);
>>            }
>> !           retval += char2cells(*str);
>>            ++str;
>>        }
>>      }
>>
>>
>> It certainly avoids reading uninitialized memory,
>> but if *str character is an invalid utf-8 (say <ff>
>> for example) then shouldn't retval be incremented
>> by the number of cells 4 because it's an invalid
>> utf-8 sequence?  With patch, retval gets incremented
>> by char2cells(0xff)  which is 1, as if it was a legal
>> latin1 character (latin small letter y with diaeresis
>> has code 0xff).
>>
>> I don't observe anything wrong when using Vim though
>> but that does not mean it's right.  I will try to fix this
>> tomorrow when I have time.
>>
>> -- Dominique
>
>
> Ah, I do observe something wrong because of of the
> bug in patch #2 that I described in my previous comment.
>
> With:
>
> set nocompatible
> set wildmode=longest,list
> set wildmenu
>
> When file names contain invalid utf-8 sequences, I see file names
> incorrectly aligned on screen when doing file completion in ex mode,
> because of incorrect cell count.  See screenshot below:
>
> http://dominique.pelle.free.fr/pic/bug-cells.png
>
> I will try to fix it tonight when I have time.
>
> -- Dominique



Attached patch fixes incorrect cell computation
introduced my previous patch "fix2-invalid-utf8-seq.patch"
while still correcting the original bug (read overflow).

I tested it, but please review it.

I'm not too pleased about the hard-coded 4 cells
for illegal utf-8 sequence, but attached patch minimizes
changes.  More elegant perhaps would be to add a safe
function utf_ptr2cells_len(p, maxlen) which guarantees
to read no more than maxlen bytes.

-- Dominique

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Index: message.c
===================================================================
RCS file: /cvsroot/vim/vim7/src/message.c,v
retrieving revision 1.63
diff -c -r1.63 message.c
*** message.c	29 Jun 2008 14:15:37 -0000	1.63
--- message.c	30 Jun 2008 21:17:53 -0000
***************
*** 1391,1397 ****
  		plain_start = str + 1;
  		msg_puts_attr(s, attr == 0 ? hl_attr(HLF_8) : attr);
  	    }
! 	    retval += char2cells(*str);
  	    ++str;
  	}
      }
--- 1391,1407 ----
  		plain_start = str + 1;
  		msg_puts_attr(s, attr == 0 ? hl_attr(HLF_8) : attr);
  	    }
! #ifdef FEAT_MBYTE
! 	    if (enc_utf8 && *str >= 0x80)
! 		/* Since mb_l is 1, this is an invalid or incomplete utf-8
! 		 * sequence, so number of cells is 4: <xx>. Can't call
! 		 * ptr2cells(str) without guarantee that it does not
! 		 * overflow str[len]. */
! 		retval += 4;
! 	    else
! #endif
! 		retval += ptr2cells(str);
! 
  	    ++str;
  	}
      }

Re: [patch] fixed 3 bugs with invalid utf-8 sequences

Raspunde prin e-mail lui