patch to fix access beyond allocated buffer in regex.c

Dominique Pelle Thu, 22 Nov 2007 15:14:19 -0800

Valgrind memory checker detects out of bounds memory access
when using random characters in regular expressions.


It happens only when vim is built with +multi_byte

==6133== Invalid read of size 1
==6133==    at 0x8159FC2: peekchr (regexp.c:2606)
==6133==    by 0x8159B60: regatom (regexp.c:2331)
==6133==    by 0x8157E16: regpiece (regexp.c:1433)
==6133==    by 0x8157D78: regconcat (regexp.c:1394)
==6133==    by 0x8157B29: regbranch (regexp.c:1309)
==6133==    by 0x8157803: reg (regexp.c:1222)
==6133==    by 0x8157311: vim_regcomp (regexp.c:1019)
==6133==    by 0x8170B5A: search_regcomp (search.c:215)
==6133==    by 0x81713BC: searchit (search.c:531)
==6133==    by 0x81725A3: do_search (search.c:1247)
==6133==    by 0x80A72CC: get_address (ex_docmd.c:3903)
==6133==    by 0x80A4067: do_one_cmd (ex_docmd.c:1965)
==6133==    by 0x80A2966: do_cmdline (ex_docmd.c:1099)
==6133==    by 0x80A2018: do_cmdline_cmd (ex_docmd.c:705)
==6133==    by 0x80E72B8: exe_commands (main.c:2663)
==6133==    by 0x80E4CB2: main (main.c:875)
==6133==  Address 0x4F9FA16 is 0 bytes after a block of size 22 alloc'd
==6133==    at 0x4022765: malloc (vg_replace_malloc.c:149)
==6133==    by 0x8112860: lalloc (misc2.c:857)
==6133==    by 0x8112782: alloc (misc2.c:756)
==6133==    by 0x8112BE5: vim_strsave (misc2.c:1144)
==6133==    by 0x80A27A8: do_cmdline (ex_docmd.c:1029)
==6133==    by 0x80A2018: do_cmdline_cmd (ex_docmd.c:705)
==6133==    by 0x80E72B8: exe_commands (main.c:2663)
==6133==    by 0x80E4CB2: main (main.c:875)

(and many more errors of the same kind)

I can reproduce these errors often (not 100%) by doing a search
with '/' using a regex containing a few random characters:

  $ echo test > test.txt
  $ vim -c "/$(head -c 10 /dev/urandom | tee testcase)" test.txt 2> vg.log

For example, errors happen all the time with testcase file containing:

  $ od -An -x testcase
  6d8d 7aa2 4e33 8214 f492

Bug happens when regex ends with an invalid (or incomplete)
UTF-8 sequence.

Regex with invalid UTF-8 sequences are not supposed to happen
generally so this is a low priority bug.  However, accessing
buffer out of bound should not happen, even with garbage regex.

I can see why it happens:

regex.c:

2331             for (len = 0; c != NUL && (len == 0
2332                         || (re_multi_type(peekchr()) == NOT_MULTI
2333                             && !one_exactly
2334                             && !is_Magic(c))); ++len)
2335             {
2336                 c = no_Magic(c);
2337 #ifdef FEAT_MBYTE
2338                 if (has_mbyte)
2339                 {
2340                     regmbc(c);
2341                     if (enc_utf8)
2342                     {
2343                         int     l;
2344
2345                         /* Need to get composing character too. */
2346                         for (;;)
2347                         {
2348                             l = utf_ptr2len(regparse);
2349                             if (!UTF_COMPOSINGLIKE(regparse, regparse + l))
2350                                 break;
2351                             regmbc(utf_ptr2char(regparse));
2352                             skipchr();
2353                         }
2354                     }
2355                 }
2356                 else
2357 #endif
2358                     regc(c);
2359                 c = getchr();
2360             }

peekchr() at line 2332 accesses buffer regparse[] out of bounds.
regparse pointer goes 1 byte beyond allocated size because getchr()
at line 2359 consumed 2 bytes and regparse goes 1 bytes beyond
end of allocated size.  2 bytes are consumed because getchr()
at line 2359 calls skipchr() which increments regparse
by 2 at line 2773:

regex.c:

2761     static void
2762 skipchr()
2763 {
2764     /* peekchr() eats a backslash, do the same here */
2765     if (*regparse == '\\')
2766         prevchr_len = 1;
2767     else
2768         prevchr_len = 0;
2769     if (regparse[prevchr_len] != NUL)
2770     {
2771 #ifdef FEAT_MBYTE
2772         if (enc_utf8)
2773             prevchr_len += utf_char2len(mb_ptr2char(regparse +
prevchr_len));
2774         else if (has_mbyte)
2775             prevchr_len += (*mb_ptr2len)(regparse + prevchr_len);
2776         else
2777 #endif
2778             ++prevchr_len;
2779     }
2780     regparse += prevchr_len;
2781     prev_at_start = at_start;

At line 2773, call to mb_ptr2char (function pointer which points
to utf_char2len()) detects an invalid/incomplete UTF-8 sequence
but in that case returns the first byte. When this first byte is >= 0x80,
then call to utf_char2len() also at line 2773 return 2 (or more)
thus consuming 2 bytes.  Adding 2 bytes to regparse at line
2790 can make regparse go 1 byte beyond end of regex string.

I attach a patch which fixes it.  Perhaps there is a better way of
fixing it.

I'm using vim-7.1 with patches 1-156, on Linux, built without
optimizations (-O0), configured with "configure --with-features=huge".

-- Dominique

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Index: regexp.c
===================================================================
RCS file: /cvsroot/vim/vim7/src/regexp.c,v
retrieving revision 1.44
diff -c -r1.44 regexp.c
*** regexp.c	11 Aug 2007 11:58:14 -0000	1.44
--- regexp.c	22 Nov 2007 23:07:50 -0000
***************
*** 582,587 ****
--- 582,588 ----
   */
  
  static char_u	*regparse;	/* Input-scan pointer. */
+ static char_u	*endregparse;	/* end of input-scan pointer. */
  static int	prevchr_len;	/* byte length of previous char */
  static int	num_complex_braces; /* Complex \{...} count */
  static int	regnpar;	/* () count. */
***************
*** 2590,2595 ****
--- 2591,2597 ----
      char_u *str;
  {
      regparse = str;
+     endregparse = str + STRLEN(str);
      prevchr_len = 0;
      curchr = prevprevchr = prevchr = nextchr = -1;
      at_start = TRUE;
***************
*** 2778,2783 ****
--- 2780,2790 ----
  	    ++prevchr_len;
      }
      regparse += prevchr_len;
+ 
+     /* don't let regparse go beyond endregparse (which could happen when
+      * regparse ends with invalid utf-8 sequence */
+     if (regparse > endregparse)
+         regparse = endregparse; 
      prev_at_start = at_start;
      at_start = FALSE;
      prevprevchr = prevchr;

patch to fix access beyond allocated buffer in regex.c

Raspunde prin e-mail lui