Hi,
i have begun an audit of libedit. I will post a selection of the
most important diffs here, but in order to not overwhelm the list,
not all of them. If you like to see all of them, please tell me
privately and you'll be in the loop. Actually, i'd love it if one
or two additional people could speak up and help review this ongoing
work.
So, i'm starting from the bottom. The first function is read_char().
Too much is broken in that function to fix it all at once. Here
is the first step, fixing only the first handful of bugs, only the
CHARSET_IS_UTF8 case. Note that fixing all this does not imply
that we enable full UTF-8 support just yet, it still remains disabled
in el_gets(3), file eln.c. Let's first fix the bugs before even
considering to enable additional features.
For now, these fixes mainly help programs explicitly using
wide-character functions like el_wgetc(3) and el_wgets(3). For
programs using the classic el_gets(3), the advantage merely is that
UTF-8 characters get correctly parsed and discarded when the user
has set an UTF-8 locale, instead of stomping on errno and trashing
random bytes around them.
1. When the initial read(2) syscall fails, the function tries
to fix up the terminal I/O state and tries to read(2) again.
Afterwards, errno is likely clobbered and has to be restored -
in case read__fixio() fails, to the errno set by read(2) [fix
by Steffen Nurpmeso (2012) via NetBSD], in case read__fixio()
succeeds, before calling read(2) again, to the initial state
upon function entry, such that errno is unchanged if the second
try succeeds [fix by me].
2. When read(2) returns EOF, return that information to the caller,
do not prod on and potentially access garbage data in the buffer
[fix by Linas Vepstas (2013) via NetBSD].
3. What the code tries to do is to incrementally read bytes until
they form a valid character, then return that character, or else
discard the invalid bytes and retry with whatever follows.
However, that is completely impossible with mbtowc(3), however
hard you try, because mbtowc(3) simply doesn't distinguish
between invalid and incomplete sequences. The particular way
this functions gets it wrong is by discarding too many (even
valid) bytes after an invalid sequence because it hopes the
sequence might still become valid by adding additional bytes
to it.
Of course we have to use mbrtowc(3) to distinguish between
invalid and incomplete sequences, and even then, care is needed.
When an invalid sequence is found, we must not discard the last
byte because it might be valid on its own, or it might start a
valid sequence [fix by me]. Note that this kind of recovery
from incomplete and invalid sequences only works because we
restrict ourselves to UTF-8. In general, with potentially
state-dependent encodings, recovery wouldn't be feasible at all.
4. In general, i don't object to functions in the style of
utf8_islead(). But they are only useful in simple cases
where they help to avoid using complicated library interfaces
like mbrtowc(3). In this case, mbrtowc(3) handles sequences
starting with a continuation byte just fine and returns -1,
so simply delete utf8_islead(), is has no effect except that
it wrongly suppresses setting errno [fix by me].
All these fixes have already been committed to NetBSD.
OK?
Ingo
P.S.
Just in case you wonder, from my first glance around the libedit
code in general, the bug density you witness here doesn't seem
unusual, but of a similar order of magnitude as may be expected in
other parts of the codebase, too.
Index: chartype.h
===================================================================
RCS file: /cvs/src/lib/libedit/chartype.h,v
retrieving revision 1.5
diff -u -p -r1.5 chartype.h
--- chartype.h 17 Oct 2014 06:07:50 -0000 1.5
+++ chartype.h 26 Feb 2016 09:44:31 -0000
@@ -61,8 +61,7 @@
#warning Build environment does not support non-BMP characters
#endif
-#define ct_mbtowc mbtowc
-#define ct_mbtowc_reset mbtowc(0,0,0)
+#define ct_mbrtowc mbrtowc
#define ct_wctomb wctomb
#define ct_wctomb_reset wctomb(0,0)
#define ct_wcstombs wcstombs
@@ -110,8 +109,7 @@
#else /* NARROW */
-#define ct_mbtowc error
-#define ct_mbtowc_reset
+#define ct_mbrtowc error
#define ct_wctomb error
#define ct_wctomb_reset
#define ct_wcstombs(a, b, c) (strncpy(a, b, c), strlen(a))
Index: read.c
===================================================================
RCS file: /cvs/src/lib/libedit/read.c,v
retrieving revision 1.20
diff -u -p -r1.20 read.c
--- read.c 31 Jan 2016 20:42:33 -0000 1.20
+++ read.c 26 Feb 2016 09:44:32 -0000
@@ -288,18 +288,6 @@ read_getcmd(EditLine *el, el_action_t *c
return OKCMD;
}
-#ifdef WIDECHAR
-/* utf8_islead():
- * Test whether a byte is a leading byte of a UTF-8 sequence.
- */
-private int
-utf8_islead(unsigned char c)
-{
- return c < 0x80 || /* single byte char */
- (c >= 0xc2 && c <= 0xf4); /* start of multibyte sequence */
-}
-#endif
-
/* read_char():
* Read a character from the tty.
*/
@@ -311,10 +299,12 @@ read_char(EditLine *el, Char *cp)
char cbuf[MB_LEN_MAX];
int cbp = 0;
int bytes = 0;
+ int save_errno = errno;
again:
el->el_signal->sig_no = 0;
while ((num_read = read(el->el_infd, cbuf + cbp, 1)) == -1) {
+ int e = errno;
switch (el->el_signal->sig_no) {
case SIGCONT:
el_set(el, EL_REFRESH);
@@ -325,26 +315,57 @@ read_char(EditLine *el, Char *cp)
default:
break;
}
- if (!tried && read__fixio(el->el_infd, errno) == 0)
+ if (!tried && read__fixio(el->el_infd, e) == 0) {
+ errno = save_errno;
tried = 1;
- else {
+ } else {
+ errno = e;
*cp = '\0';
return -1;
}
}
+ /* Test for EOF */
+ if (num_read == 0) {
+ *cp = '\0';
+ return 0;
+ }
+
#ifdef WIDECHAR
if (el->el_flags & CHARSET_IS_UTF8) {
- if (!utf8_islead((unsigned char)cbuf[0]))
- goto again; /* discard the byte we read and try again */
+ mbstate_t mbs;
+ size_t rbytes;
+again_lastbyte:
++cbp;
- if ((bytes = ct_mbtowc(cp, cbuf, cbp)) == -1) {
- ct_mbtowc_reset;
+ /* This only works because UTF8 is stateless */
+ memset(&mbs, 0, sizeof(mbs));
+ switch (rbytes = ct_mbrtowc(cp, cbuf, cbp, &mbs)) {
+ case (size_t)-1:
+ if (cbp > 1) {
+ /*
+ * Invalid sequence, discard all bytes
+ * except the last one.
+ */
+ cbuf[0] = cbuf[cbp - 1];
+ cbp = 0;
+ goto again_lastbyte;
+ } else {
+ /* Invalid byte, discard it. */
+ cbp = 0;
+ goto again;
+ }
+ case (size_t)-2:
if (cbp >= MB_LEN_MAX) { /* "shouldn't happen" */
+ errno = EILSEQ;
*cp = '\0';
return -1;
}
+ /* Incomplete sequence, read another byte. */
goto again;
+ default:
+ /* Valid character, process it. */
+ bytes = (int)rbytes;
+ break;
}
} else /* we don't support other multibyte charsets */
#endif