(sorry, didn't get time to look at this until now.)

On Sun, Aug 6, 2017 at 6:38 PM, Rob Landley <r...@landley.net> wrote:
> On 08/04/2017 07:54 PM, enh wrote:
>> When mbrtowc returns -2, all n bytes have been processed. Bionic's
>> interpretation of POSIX is that you must not re-supply those bytes
>> on the next call, and should only supply the bytes needed to complete
>> the character. If you re-supply the bytes on the next call, bionic
>> considers that an illegal sequence and returns -1.
>
> I've had a headache for 3 days and it's really hard for me to make good
> technical decisions right now, because I want to set fire to everything
> for being pointlessly overcomplicated.
>
> What I really wanted was a function that didn't maintain magic internal
> state between calls, but would just return if it couldn't do what I
> asked and let me handle it myself or try again with more data. (When I'm
> escaping invalid chars that's the semantics I need, and why have two?)
>
> I don't know how to make modern libc do the simple thing, it insists on
> maintaining state between calls even when you pass in a NULL for the
> state, because reasons. It sounds like if I want a simple converter, I
> need to write my in lib.c, or maybe wrap this one with a flush after
> every error. (Sending it a zero byte should flush? I think?)

yeah, this is yet another ugly corner of POSIX. you can get what you
want, though it's pretty unguessable. the function mbsinit is a
_predicate_ to tell you whether the mbstate_t is in the initial state
or not. you're supposed to memset or moral equivalent to reset to the
initial state. afaik there's no way to reset the global state. i can't
think i've ever seen one.

> Meanwhile, the code you just sent me is not clearing the magic internal
> state, and it's user-visible:
>
>   $ echo -ne '\xf0\xbf\xbf\xbf' | wc -m
>   1
>   $ echo -ne '\xf0\xbf' > one
>   $ echo -ne '\xbf\xbf' > two
>   $ wc -m one two
>   0 one
>   0 two
>   0 total
>   $ ./toybox wc -m one two
>   0 one
>   1 two
>   1 total

the attached updated patch fixes that and adds the test.

>> With these changes, the tests still pass on glibc and also pass on bionic.
> I should test musl. After I go lie down again...
>
> Rob



-- 
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.
From bb97b93acb4fcbbb7164728f56c195a13d829db0 Mon Sep 17 00:00:00 2001
From: Elliott Hughes <e...@google.com>
Date: Fri, 4 Aug 2017 17:41:23 -0700
Subject: [PATCH] Fix wc -m on bionic.

When mbrtowc returns -2, all n bytes have been processed. Bionic's
interpretation of POSIX is that you must not re-supply those bytes
on the next call, and should only supply the bytes needed to complete
the character. If you re-supply the bytes on the next call, bionic
considers that an illegal sequence and returns -1.

With these changes, the tests still pass on glibc and also pass on bionic.

Also add some extra tests for UTF-8 sequences split across *files*, not
just buffers within a file.
---
 tests/wc.test   | 23 ++++++++++++++++-------
 toys/posix/wc.c | 35 +++++++++++++++--------------------
 2 files changed, 31 insertions(+), 27 deletions(-)

diff --git a/tests/wc.test b/tests/wc.test
index 012bd53..90f733b 100755
--- a/tests/wc.test
+++ b/tests/wc.test
@@ -24,20 +24,29 @@ testing "multiple files" "wc input - file1" \
 #Tests for wc -m
 if printf "%s" "$LANG" | grep -q UTF-8
 then
+  testing "-m stdin" 'cat "$FILES/utf8/test2.txt" | wc -m' "169\n" "" ""
 
+  # Test input that splits a UTF-8 encoding across buffers.
   echo -n " " > file1
   for i in $(seq 1 8192)
   do
     echo -n "ü" >> file1
   done
-  testing "-m" "wc -m file1" "8193 file1\n" "" ""
-  testing "-m 2" 'cat "$FILES/utf8/test2.txt" | wc -m' "169\n" "" ""
-  echo -n " " > file1
-  for i in $(seq 1 8192)
-  do
-    echo -n "ü" >> file1
-  done
+  testing "-m large" "wc -m file1" "8193 file1\n" "" ""
+
+  # U+3FFFF is the 4-byte sequence 0xf0 0xbf 0xbf 0xbf.
+  # If we split that across files, wc shouldn't internally recombine it
+  # and claim to have seen a character in a file that only contained the
+  # second half.
+  echo -ne '\xf0\xbf\xbf\xbf' > full
+  echo -ne '\xf0\xbf' > part-one
+  echo -ne '\xbf\xbf' > part-two
+  testing "-m partial" "wc -m full part-one part-two" \
+      "1 full\n0 part-one\n0 part-two\n1 total\n" "" ""
+
+  # TODO: note that this doesn't actually test any invalid chars...
   testing "-m (invalid chars)" "wc -m file1" "8193 file1\n" "" ""
+
   NOSPACE=1 testing "-mlw" "wc -mlw input" " 1 2 11 input\n" "hello, 世界!\n" ""
 
 else
diff --git a/toys/posix/wc.c b/toys/posix/wc.c
index a8c3e45..4f5a7a7 100644
--- a/toys/posix/wc.c
+++ b/toys/posix/wc.c
@@ -49,8 +49,8 @@ static void show_lengths(unsigned long *lengths, char *name)
 
 static void do_wc(int fd, char *name)
 {
-  int len = 0, clen = 1, space = 0;
   unsigned long word = 0, lengths[] = {0,0,0,0};
+  mbstate_t mbs = {};
 
   // Speed up common case: wc -c normalfile is file length.
   if (toys.optflags == FLAG_c) {
@@ -64,28 +64,26 @@ static void do_wc(int fd, char *name)
   }
 
   for (;;) {
-    int pos, done = 0, len2 = read(fd, toybuf+len, sizeof(toybuf)-len);
+    int pos, len = read(fd, toybuf, sizeof(toybuf));
 
-    if (len2<0) perror_msg_raw(name);
-    else len += len2;
-    if (len2<1) done++;
+    if (len<0) perror_msg_raw(name);
+    if (len<1) break;
 
     for (pos = 0; pos<len; pos++) {
+      int space;
+
       if (toybuf[pos]=='\n') lengths[0]++;
       lengths[2]++;
+
       if (toys.optflags&FLAG_m) {
-        // If we've consumed next wide char
-        if (--clen<1) {
-          wchar_t wchar;
-
-          // next wide size, don't count invalid, fetch more data if necessary
-          clen = mbrtowc(&wchar, toybuf+pos, len-pos, 0);
-          if (clen == -1) continue;
-          if (clen == -2 && !done) break;
-
-          lengths[3]++;
-          space = iswspace(wchar);
-        }
+        wchar_t wchar;
+        int clen = mbrtowc(&wchar, toybuf+pos, len-pos, &mbs);
+
+        if (clen == -1) continue; // Don't count invalid.
+        if (clen == -2) break; // Fetch more data to complete sequence.
+
+        lengths[3]++;
+        space = iswspace(wchar);
       } else space = isspace(toybuf[pos]);
 
       if (space) word=0;
@@ -94,9 +92,6 @@ static void do_wc(int fd, char *name)
         word=1;
       }
     }
-    if (done) break;
-    if (pos != len) memmove(toybuf, toybuf+pos, len-pos);
-    len -= pos;
   }
 
 show:
-- 
2.14.0.434.g98096fd7a8-goog

_______________________________________________
Toybox mailing list
Toybox@lists.landley.net
http://lists.landley.net/listinfo.cgi/toybox-landley.net

Reply via email to