Dominique Pelle wrote:
> Valgrind memory checker detects out of bounds memory access
> when using random characters in regular expressions.
>
> It happens only when vim is built with +multi_byte
>
> ==6133== Invalid read of size 1
> ==6133== at 0x8159FC2: peekchr (regexp.c:2606)
> ==6133== by 0x8159B60: regatom (regexp.c:2331)
> ==6133== by 0x8157E16: regpiece (regexp.c:1433)
> ==6133== by 0x8157D78: regconcat (regexp.c:1394)
> ==6133== by 0x8157B29: regbranch (regexp.c:1309)
> ==6133== by 0x8157803: reg (regexp.c:1222)
> ==6133== by 0x8157311: vim_regcomp (regexp.c:1019)
> ==6133== by 0x8170B5A: search_regcomp (search.c:215)
> ==6133== by 0x81713BC: searchit (search.c:531)
> ==6133== by 0x81725A3: do_search (search.c:1247)
> ==6133== by 0x80A72CC: get_address (ex_docmd.c:3903)
> ==6133== by 0x80A4067: do_one_cmd (ex_docmd.c:1965)
> ==6133== by 0x80A2966: do_cmdline (ex_docmd.c:1099)
> ==6133== by 0x80A2018: do_cmdline_cmd (ex_docmd.c:705)
> ==6133== by 0x80E72B8: exe_commands (main.c:2663)
> ==6133== by 0x80E4CB2: main (main.c:875)
> ==6133== Address 0x4F9FA16 is 0 bytes after a block of size 22 alloc'd
> ==6133== at 0x4022765: malloc (vg_replace_malloc.c:149)
> ==6133== by 0x8112860: lalloc (misc2.c:857)
> ==6133== by 0x8112782: alloc (misc2.c:756)
> ==6133== by 0x8112BE5: vim_strsave (misc2.c:1144)
> ==6133== by 0x80A27A8: do_cmdline (ex_docmd.c:1029)
> ==6133== by 0x80A2018: do_cmdline_cmd (ex_docmd.c:705)
> ==6133== by 0x80E72B8: exe_commands (main.c:2663)
> ==6133== by 0x80E4CB2: main (main.c:875)
>
> (and many more errors of the same kind)
>
> I can reproduce these errors often (not 100%) by doing a search
> with '/' using a regex containing a few random characters:
>
> $ echo test > test.txt
> $ vim -c "/$(head -c 10 /dev/urandom | tee testcase)" test.txt 2> vg.log
>
> For example, errors happen all the time with testcase file containing:
>
> $ od -An -x testcase
> 6d8d 7aa2 4e33 8214 f492
>
> Bug happens when regex ends with an invalid (or incomplete)
> UTF-8 sequence.
>
> Regex with invalid UTF-8 sequences are not supposed to happen
> generally so this is a low priority bug. However, accessing
> buffer out of bound should not happen, even with garbage regex.
>
> I can see why it happens:
>
> regex.c:
>
> 2331 for (len = 0; c != NUL && (len == 0
> 2332 || (re_multi_type(peekchr()) == NOT_MULTI
> 2333 && !one_exactly
> 2334 && !is_Magic(c))); ++len)
> 2335 {
> 2336 c = no_Magic(c);
> 2337 #ifdef FEAT_MBYTE
> 2338 if (has_mbyte)
> 2339 {
> 2340 regmbc(c);
> 2341 if (enc_utf8)
> 2342 {
> 2343 int l;
> 2344
> 2345 /* Need to get composing character too. */
> 2346 for (;;)
> 2347 {
> 2348 l = utf_ptr2len(regparse);
> 2349 if (!UTF_COMPOSINGLIKE(regparse, regparse +
> l))
> 2350 break;
> 2351 regmbc(utf_ptr2char(regparse));
> 2352 skipchr();
> 2353 }
> 2354 }
> 2355 }
> 2356 else
> 2357 #endif
> 2358 regc(c);
> 2359 c = getchr();
> 2360 }
>
> peekchr() at line 2332 accesses buffer regparse[] out of bounds.
> regparse pointer goes 1 byte beyond allocated size because getchr()
> at line 2359 consumed 2 bytes and regparse goes 1 bytes beyond
> end of allocated size. 2 bytes are consumed because getchr()
> at line 2359 calls skipchr() which increments regparse
> by 2 at line 2773:
>
> regex.c:
>
> 2761 static void
> 2762 skipchr()
> 2763 {
> 2764 /* peekchr() eats a backslash, do the same here */
> 2765 if (*regparse == '\\')
> 2766 prevchr_len = 1;
> 2767 else
> 2768 prevchr_len = 0;
> 2769 if (regparse[prevchr_len] != NUL)
> 2770 {
> 2771 #ifdef FEAT_MBYTE
> 2772 if (enc_utf8)
> 2773 prevchr_len += utf_char2len(mb_ptr2char(regparse +
> prevchr_len));
> 2774 else if (has_mbyte)
> 2775 prevchr_len += (*mb_ptr2len)(regparse + prevchr_len);
> 2776 else
> 2777 #endif
> 2778 ++prevchr_len;
> 2779 }
> 2780 regparse += prevchr_len;
> 2781 prev_at_start = at_start;
>
> At line 2773, call to mb_ptr2char (function pointer which points
> to utf_char2len()) detects an invalid/incomplete UTF-8 sequence
> but in that case returns the first byte. When this first byte is >= 0x80,
> then call to utf_char2len() also at line 2773 return 2 (or more)
> thus consuming 2 bytes. Adding 2 bytes to regparse at line
> 2790 can make regparse go 1 byte beyond end of regex string.
>
> I attach a patch which fixes it. Perhaps there is a better way of
> fixing it.
Thanks for locating this problem. For the fix I think it's simplest to
use utf_ptr2len(). It checks for the NUL character in the same way as
mb_ptr2char():
*** ../vim-7.1.159/src/regexp.c Sat Aug 11 13:57:31 2007
--- src/regexp.c Sat Nov 24 13:23:53 2007
***************
*** 2770,2776 ****
{
#ifdef FEAT_MBYTE
if (enc_utf8)
! prevchr_len += utf_char2len(mb_ptr2char(regparse + prevchr_len));
else if (has_mbyte)
prevchr_len += (*mb_ptr2len)(regparse + prevchr_len);
else
--- 2770,2777 ----
{
#ifdef FEAT_MBYTE
if (enc_utf8)
! /* exclude composing chars that mb_ptr2len does include */
! prevchr_len += utf_ptr2len(regparse + prevchr_len);
else if (has_mbyte)
prevchr_len += (*mb_ptr2len)(regparse + prevchr_len);
else
Not sure if that fixes all situations, perhaps you can check that?
--
hundred-and-one symptoms of being an internet addict:
139. You down your lunch in five minutes, at your desk, so you can
spend the rest of the hour surfing the Net.
/// Bram Moolenaar -- [EMAIL PROTECTED] -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ download, build and distribute -- http://www.A-A-P.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///
--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---