(sorry, didn't get time to look at this until now.) On Sun, Aug 6, 2017 at 6:38 PM, Rob Landley <r...@landley.net> wrote: > On 08/04/2017 07:54 PM, enh wrote: >> When mbrtowc returns -2, all n bytes have been processed. Bionic's >> interpretation of POSIX is that you must not re-supply those bytes >> on the next call, and should only supply the bytes needed to complete >> the character. If you re-supply the bytes on the next call, bionic >> considers that an illegal sequence and returns -1. > > I've had a headache for 3 days and it's really hard for me to make good > technical decisions right now, because I want to set fire to everything > for being pointlessly overcomplicated. > > What I really wanted was a function that didn't maintain magic internal > state between calls, but would just return if it couldn't do what I > asked and let me handle it myself or try again with more data. (When I'm > escaping invalid chars that's the semantics I need, and why have two?) > > I don't know how to make modern libc do the simple thing, it insists on > maintaining state between calls even when you pass in a NULL for the > state, because reasons. It sounds like if I want a simple converter, I > need to write my in lib.c, or maybe wrap this one with a flush after > every error. (Sending it a zero byte should flush? I think?)
yeah, this is yet another ugly corner of POSIX. you can get what you want, though it's pretty unguessable. the function mbsinit is a _predicate_ to tell you whether the mbstate_t is in the initial state or not. you're supposed to memset or moral equivalent to reset to the initial state. afaik there's no way to reset the global state. i can't think i've ever seen one. > Meanwhile, the code you just sent me is not clearing the magic internal > state, and it's user-visible: > > $ echo -ne '\xf0\xbf\xbf\xbf' | wc -m > 1 > $ echo -ne '\xf0\xbf' > one > $ echo -ne '\xbf\xbf' > two > $ wc -m one two > 0 one > 0 two > 0 total > $ ./toybox wc -m one two > 0 one > 1 two > 1 total the attached updated patch fixes that and adds the test. >> With these changes, the tests still pass on glibc and also pass on bionic. > I should test musl. After I go lie down again... > > Rob -- Elliott Hughes - http://who/enh - http://jessies.org/~enh/ Android native code/tools questions? Mail me/drop by/add me as a reviewer.
From bb97b93acb4fcbbb7164728f56c195a13d829db0 Mon Sep 17 00:00:00 2001 From: Elliott Hughes <e...@google.com> Date: Fri, 4 Aug 2017 17:41:23 -0700 Subject: [PATCH] Fix wc -m on bionic. When mbrtowc returns -2, all n bytes have been processed. Bionic's interpretation of POSIX is that you must not re-supply those bytes on the next call, and should only supply the bytes needed to complete the character. If you re-supply the bytes on the next call, bionic considers that an illegal sequence and returns -1. With these changes, the tests still pass on glibc and also pass on bionic. Also add some extra tests for UTF-8 sequences split across *files*, not just buffers within a file. --- tests/wc.test | 23 ++++++++++++++++------- toys/posix/wc.c | 35 +++++++++++++++-------------------- 2 files changed, 31 insertions(+), 27 deletions(-) diff --git a/tests/wc.test b/tests/wc.test index 012bd53..90f733b 100755 --- a/tests/wc.test +++ b/tests/wc.test @@ -24,20 +24,29 @@ testing "multiple files" "wc input - file1" \ #Tests for wc -m if printf "%s" "$LANG" | grep -q UTF-8 then + testing "-m stdin" 'cat "$FILES/utf8/test2.txt" | wc -m' "169\n" "" "" + # Test input that splits a UTF-8 encoding across buffers. echo -n " " > file1 for i in $(seq 1 8192) do echo -n "ü" >> file1 done - testing "-m" "wc -m file1" "8193 file1\n" "" "" - testing "-m 2" 'cat "$FILES/utf8/test2.txt" | wc -m' "169\n" "" "" - echo -n " " > file1 - for i in $(seq 1 8192) - do - echo -n "ü" >> file1 - done + testing "-m large" "wc -m file1" "8193 file1\n" "" "" + + # U+3FFFF is the 4-byte sequence 0xf0 0xbf 0xbf 0xbf. + # If we split that across files, wc shouldn't internally recombine it + # and claim to have seen a character in a file that only contained the + # second half. + echo -ne '\xf0\xbf\xbf\xbf' > full + echo -ne '\xf0\xbf' > part-one + echo -ne '\xbf\xbf' > part-two + testing "-m partial" "wc -m full part-one part-two" \ + "1 full\n0 part-one\n0 part-two\n1 total\n" "" "" + + # TODO: note that this doesn't actually test any invalid chars... testing "-m (invalid chars)" "wc -m file1" "8193 file1\n" "" "" + NOSPACE=1 testing "-mlw" "wc -mlw input" " 1 2 11 input\n" "hello, 世界!\n" "" else diff --git a/toys/posix/wc.c b/toys/posix/wc.c index a8c3e45..4f5a7a7 100644 --- a/toys/posix/wc.c +++ b/toys/posix/wc.c @@ -49,8 +49,8 @@ static void show_lengths(unsigned long *lengths, char *name) static void do_wc(int fd, char *name) { - int len = 0, clen = 1, space = 0; unsigned long word = 0, lengths[] = {0,0,0,0}; + mbstate_t mbs = {}; // Speed up common case: wc -c normalfile is file length. if (toys.optflags == FLAG_c) { @@ -64,28 +64,26 @@ static void do_wc(int fd, char *name) } for (;;) { - int pos, done = 0, len2 = read(fd, toybuf+len, sizeof(toybuf)-len); + int pos, len = read(fd, toybuf, sizeof(toybuf)); - if (len2<0) perror_msg_raw(name); - else len += len2; - if (len2<1) done++; + if (len<0) perror_msg_raw(name); + if (len<1) break; for (pos = 0; pos<len; pos++) { + int space; + if (toybuf[pos]=='\n') lengths[0]++; lengths[2]++; + if (toys.optflags&FLAG_m) { - // If we've consumed next wide char - if (--clen<1) { - wchar_t wchar; - - // next wide size, don't count invalid, fetch more data if necessary - clen = mbrtowc(&wchar, toybuf+pos, len-pos, 0); - if (clen == -1) continue; - if (clen == -2 && !done) break; - - lengths[3]++; - space = iswspace(wchar); - } + wchar_t wchar; + int clen = mbrtowc(&wchar, toybuf+pos, len-pos, &mbs); + + if (clen == -1) continue; // Don't count invalid. + if (clen == -2) break; // Fetch more data to complete sequence. + + lengths[3]++; + space = iswspace(wchar); } else space = isspace(toybuf[pos]); if (space) word=0; @@ -94,9 +92,6 @@ static void do_wc(int fd, char *name) word=1; } } - if (done) break; - if (pos != len) memmove(toybuf, toybuf+pos, len-pos); - len -= pos; } show: -- 2.14.0.434.g98096fd7a8-goog
_______________________________________________ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net