Hi,

i have begun an audit of libedit.  I will post a selection of the
most important diffs here, but in order to not overwhelm the list,
not all of them.  If you like to see all of them, please tell me
privately and you'll be in the loop.  Actually, i'd love it if one
or two additional people could speak up and help review this ongoing
work.

So, i'm starting from the bottom.  The first function is read_char().
Too much is broken in that function to fix it all at once.  Here
is the first step, fixing only the first handful of bugs, only the
CHARSET_IS_UTF8 case.  Note that fixing all this does not imply
that we enable full UTF-8 support just yet, it still remains disabled
in el_gets(3), file eln.c.  Let's first fix the bugs before even
considering to enable additional features.

For now, these fixes mainly help programs explicitly using
wide-character functions like el_wgetc(3) and el_wgets(3).  For
programs using the classic el_gets(3), the advantage merely is that
UTF-8 characters get correctly parsed and discarded when the user
has set an UTF-8 locale, instead of stomping on errno and trashing
random bytes around them.

 1. When the initial read(2) syscall fails, the function tries
    to fix up the terminal I/O state and tries to read(2) again.
    Afterwards, errno is likely clobbered and has to be restored -
    in case read__fixio() fails, to the errno set by read(2) [fix
    by Steffen Nurpmeso (2012) via NetBSD], in case read__fixio()
    succeeds, before calling read(2) again, to the initial state
    upon function entry, such that errno is unchanged if the second
    try succeeds [fix by me].

 2. When read(2) returns EOF, return that information to the caller,
    do not prod on and potentially access garbage data in the buffer
    [fix by Linas Vepstas (2013) via NetBSD].

 3. What the code tries to do is to incrementally read bytes until
    they form a valid character, then return that character, or else
    discard the invalid bytes and retry with whatever follows.
    However, that is completely impossible with mbtowc(3), however
    hard you try, because mbtowc(3) simply doesn't distinguish
    between invalid and incomplete sequences.  The particular way
    this functions gets it wrong is by discarding too many (even
    valid) bytes after an invalid sequence because it hopes the
    sequence might still become valid by adding additional bytes
    to it.

    Of course we have to use mbrtowc(3) to distinguish between
    invalid and incomplete sequences, and even then, care is needed.
    When an invalid sequence is found, we must not discard the last
    byte because it might be valid on its own, or it might start a
    valid sequence [fix by me].  Note that this kind of recovery
    from incomplete and invalid sequences only works because we
    restrict ourselves to UTF-8.  In general, with potentially
    state-dependent encodings, recovery wouldn't be feasible at all.

 4. In general, i don't object to functions in the style of
    utf8_islead().  But they are only useful in simple cases
    where they help to avoid using complicated library interfaces
    like mbrtowc(3).  In this case, mbrtowc(3) handles sequences
    starting with a continuation byte just fine and returns -1,
    so simply delete utf8_islead(), is has no effect except that
    it wrongly suppresses setting errno [fix by me].

All these fixes have already been committed to NetBSD.

OK?
  Ingo

P.S.
Just in case you wonder, from my first glance around the libedit
code in general, the bug density you witness here doesn't seem
unusual, but of a similar order of magnitude as may be expected in
other parts of the codebase, too.


Index: chartype.h
===================================================================
RCS file: /cvs/src/lib/libedit/chartype.h,v
retrieving revision 1.5
diff -u -p -r1.5 chartype.h
--- chartype.h  17 Oct 2014 06:07:50 -0000      1.5
+++ chartype.h  26 Feb 2016 09:44:31 -0000
@@ -61,8 +61,7 @@
 #warning Build environment does not support non-BMP characters
 #endif
 
-#define ct_mbtowc            mbtowc
-#define ct_mbtowc_reset      mbtowc(0,0,0)
+#define ct_mbrtowc           mbrtowc
 #define ct_wctomb            wctomb
 #define ct_wctomb_reset      wctomb(0,0)
 #define ct_wcstombs          wcstombs
@@ -110,8 +109,7 @@
 
 #else /* NARROW */
 
-#define ct_mbtowc            error
-#define ct_mbtowc_reset      
+#define ct_mbrtowc           error
 #define ct_wctomb            error
 #define ct_wctomb_reset      
 #define ct_wcstombs(a, b, c)    (strncpy(a, b, c), strlen(a))
Index: read.c
===================================================================
RCS file: /cvs/src/lib/libedit/read.c,v
retrieving revision 1.20
diff -u -p -r1.20 read.c
--- read.c      31 Jan 2016 20:42:33 -0000      1.20
+++ read.c      26 Feb 2016 09:44:32 -0000
@@ -288,18 +288,6 @@ read_getcmd(EditLine *el, el_action_t *c
        return OKCMD;
 }
 
-#ifdef WIDECHAR
-/* utf8_islead():
- *     Test whether a byte is a leading byte of a UTF-8 sequence.
- */
-private int
-utf8_islead(unsigned char c)
-{
-       return c < 0x80 ||             /* single byte char */
-              (c >= 0xc2 && c <= 0xf4); /* start of multibyte sequence */
-}
-#endif
-
 /* read_char():
  *     Read a character from the tty.
  */
@@ -311,10 +299,12 @@ read_char(EditLine *el, Char *cp)
        char cbuf[MB_LEN_MAX];
        int cbp = 0;
        int bytes = 0;
+       int save_errno = errno;
 
  again:
        el->el_signal->sig_no = 0;
        while ((num_read = read(el->el_infd, cbuf + cbp, 1)) == -1) {
+               int e = errno;
                switch (el->el_signal->sig_no) {
                case SIGCONT:
                        el_set(el, EL_REFRESH);
@@ -325,26 +315,57 @@ read_char(EditLine *el, Char *cp)
                default:
                        break;
                }
-               if (!tried && read__fixio(el->el_infd, errno) == 0)
+               if (!tried && read__fixio(el->el_infd, e) == 0) {
+                       errno = save_errno;
                        tried = 1;
-               else {
+               } else {
+                       errno = e;
                        *cp = '\0';
                        return -1;
                }
        }
 
+       /* Test for EOF */
+       if (num_read == 0) {
+               *cp = '\0';
+               return 0;
+       }
+
 #ifdef WIDECHAR
        if (el->el_flags & CHARSET_IS_UTF8) {
-               if (!utf8_islead((unsigned char)cbuf[0]))
-                       goto again; /* discard the byte we read and try again */
+               mbstate_t mbs;
+               size_t rbytes;
+again_lastbyte:
                ++cbp;
-               if ((bytes = ct_mbtowc(cp, cbuf, cbp)) == -1) {
-                       ct_mbtowc_reset;
+               /* This only works because UTF8 is stateless */
+               memset(&mbs, 0, sizeof(mbs));
+               switch (rbytes = ct_mbrtowc(cp, cbuf, cbp, &mbs)) {
+               case (size_t)-1:
+                       if (cbp > 1) {
+                               /*
+                                * Invalid sequence, discard all bytes
+                                * except the last one.
+                                */
+                               cbuf[0] = cbuf[cbp - 1];
+                               cbp = 0;
+                               goto again_lastbyte;
+                       } else {
+                               /* Invalid byte, discard it. */
+                               cbp = 0;
+                               goto again;
+                       }
+               case (size_t)-2:
                        if (cbp >= MB_LEN_MAX) { /* "shouldn't happen" */
+                               errno = EILSEQ;
                                *cp = '\0';
                                return -1;
                        }
+                       /* Incomplete sequence, read another byte. */
                        goto again;
+               default:
+                       /* Valid character, process it. */
+                       bytes = (int)rbytes;
+                       break;
                }
        } else  /* we don't support other multibyte charsets */
 #endif

Reply via email to