Re: [Toybox] [PATCH] Fix wc -m on bionic.
On 08/22/2017 04:05 PM, enh wrote: > happy with the replacement patch? Sorry, $DAYJOB is eating all my energy these days (http://lists.j-core.org/pipermail/j-core/2017-August/000646.html), and thunderbird loses open reply windows on a reboot (my netbook battery will die _while_plugged_in_ if you leave a USB thing drawing power from a suspended system; it stops charging once the battery fills up and then the USB drains it again, then it recharges the battery once it's fully "off")... Where was I on this one... > On Tue, Aug 8, 2017 at 3:21 PM, enh wrote: >>> I don't know how to make modern libc do the simple thing, it insists on >>> maintaining state between calls even when you pass in a NULL for the >>> state, because reasons. It sounds like if I want a simple converter, I >>> need to write my in lib.c, or maybe wrap this one with a flush after >>> every error. (Sending it a zero byte should flush? I think?) >> >> yeah, this is yet another ugly corner of POSIX. you can get what you >> want, though it's pretty unguessable. the function mbsinit is a >> _predicate_ to tell you whether the mbstate_t is in the initial state >> or not. you're supposed to memset or moral equivalent to reset to the >> initial state. afaik there's no way to reset the global state. i can't >> think i've ever seen one. Before you sent me the new patch I wrote my own "grab the next utf8 char without a state structure or caring about environment variables" function (it wasn't hard and it's something I've been thinking about to try to speed up "top" anyway; top level can be an inline function with the "if (*s<128) {*len = 1; return *s}" fast path and then it can call the real function for the rest), but I'm really uncomfortable replacing a libc function like this so I wanted to compare mine with musl's implementation _and_ do utf8 torture tests for all the weird corner cases, so I needed to do a toys/example/test_utf8.c and appropriate tests/test_utf8.test, and then you sent an updated patch and I went "must compare/collate" but didn't get back to it... >>> Meanwhile, the code you just sent me is not clearing the magic internal >>> state, and it's user-visible: >>> >>> $ echo -ne '\xf0\xbf\xbf\xbf' | wc -m >>> 1 >>> $ echo -ne '\xf0\xbf' > one >>> $ echo -ne '\xbf\xbf' > two >>> $ wc -m one two >>> 0 one >>> 0 two >>> 0 total >>> $ ./toybox wc -m one two >>> 0 one >>> 1 two >>> 1 total >> >> the attached updated patch fixes that and adds the test. On the bright side, I'm mostly caught up on $DAYJOB stuff from when I was out sick for a few days. :) Not remotely caught up with toybox yet, and the past few times I poked at it was trying to get the next round of lsof.c cleanup ready (which is relaxing because nobody's waiting for it)... Gimme one more day to try to finish my utf8 replacement code, and if I haven't I'll merge your patch as-is. Sorry for the delay, Rob ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net
Re: [Toybox] [PATCH] Fix wc -m on bionic.
happy with the replacement patch? On Tue, Aug 8, 2017 at 3:21 PM, enh wrote: > (sorry, didn't get time to look at this until now.) > > On Sun, Aug 6, 2017 at 6:38 PM, Rob Landley wrote: >> On 08/04/2017 07:54 PM, enh wrote: >>> When mbrtowc returns -2, all n bytes have been processed. Bionic's >>> interpretation of POSIX is that you must not re-supply those bytes >>> on the next call, and should only supply the bytes needed to complete >>> the character. If you re-supply the bytes on the next call, bionic >>> considers that an illegal sequence and returns -1. >> >> I've had a headache for 3 days and it's really hard for me to make good >> technical decisions right now, because I want to set fire to everything >> for being pointlessly overcomplicated. >> >> What I really wanted was a function that didn't maintain magic internal >> state between calls, but would just return if it couldn't do what I >> asked and let me handle it myself or try again with more data. (When I'm >> escaping invalid chars that's the semantics I need, and why have two?) >> >> I don't know how to make modern libc do the simple thing, it insists on >> maintaining state between calls even when you pass in a NULL for the >> state, because reasons. It sounds like if I want a simple converter, I >> need to write my in lib.c, or maybe wrap this one with a flush after >> every error. (Sending it a zero byte should flush? I think?) > > yeah, this is yet another ugly corner of POSIX. you can get what you > want, though it's pretty unguessable. the function mbsinit is a > _predicate_ to tell you whether the mbstate_t is in the initial state > or not. you're supposed to memset or moral equivalent to reset to the > initial state. afaik there's no way to reset the global state. i can't > think i've ever seen one. > >> Meanwhile, the code you just sent me is not clearing the magic internal >> state, and it's user-visible: >> >> $ echo -ne '\xf0\xbf\xbf\xbf' | wc -m >> 1 >> $ echo -ne '\xf0\xbf' > one >> $ echo -ne '\xbf\xbf' > two >> $ wc -m one two >> 0 one >> 0 two >> 0 total >> $ ./toybox wc -m one two >> 0 one >> 1 two >> 1 total > > the attached updated patch fixes that and adds the test. > >>> With these changes, the tests still pass on glibc and also pass on bionic. >> I should test musl. After I go lie down again... >> >> Rob > > > > -- > Elliott Hughes - http://who/enh - http://jessies.org/~enh/ > Android native code/tools questions? Mail me/drop by/add me as a reviewer. -- Elliott Hughes - http://who/enh - http://jessies.org/~enh/ Android native code/tools questions? Mail me/drop by/add me as a reviewer. ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net
Re: [Toybox] [PATCH] Fix wc -m on bionic.
(sorry, didn't get time to look at this until now.) On Sun, Aug 6, 2017 at 6:38 PM, Rob Landley wrote: > On 08/04/2017 07:54 PM, enh wrote: >> When mbrtowc returns -2, all n bytes have been processed. Bionic's >> interpretation of POSIX is that you must not re-supply those bytes >> on the next call, and should only supply the bytes needed to complete >> the character. If you re-supply the bytes on the next call, bionic >> considers that an illegal sequence and returns -1. > > I've had a headache for 3 days and it's really hard for me to make good > technical decisions right now, because I want to set fire to everything > for being pointlessly overcomplicated. > > What I really wanted was a function that didn't maintain magic internal > state between calls, but would just return if it couldn't do what I > asked and let me handle it myself or try again with more data. (When I'm > escaping invalid chars that's the semantics I need, and why have two?) > > I don't know how to make modern libc do the simple thing, it insists on > maintaining state between calls even when you pass in a NULL for the > state, because reasons. It sounds like if I want a simple converter, I > need to write my in lib.c, or maybe wrap this one with a flush after > every error. (Sending it a zero byte should flush? I think?) yeah, this is yet another ugly corner of POSIX. you can get what you want, though it's pretty unguessable. the function mbsinit is a _predicate_ to tell you whether the mbstate_t is in the initial state or not. you're supposed to memset or moral equivalent to reset to the initial state. afaik there's no way to reset the global state. i can't think i've ever seen one. > Meanwhile, the code you just sent me is not clearing the magic internal > state, and it's user-visible: > > $ echo -ne '\xf0\xbf\xbf\xbf' | wc -m > 1 > $ echo -ne '\xf0\xbf' > one > $ echo -ne '\xbf\xbf' > two > $ wc -m one two > 0 one > 0 two > 0 total > $ ./toybox wc -m one two > 0 one > 1 two > 1 total the attached updated patch fixes that and adds the test. >> With these changes, the tests still pass on glibc and also pass on bionic. > I should test musl. After I go lie down again... > > Rob -- Elliott Hughes - http://who/enh - http://jessies.org/~enh/ Android native code/tools questions? Mail me/drop by/add me as a reviewer. From bb97b93acb4fcbbb7164728f56c195a13d829db0 Mon Sep 17 00:00:00 2001 From: Elliott Hughes Date: Fri, 4 Aug 2017 17:41:23 -0700 Subject: [PATCH] Fix wc -m on bionic. When mbrtowc returns -2, all n bytes have been processed. Bionic's interpretation of POSIX is that you must not re-supply those bytes on the next call, and should only supply the bytes needed to complete the character. If you re-supply the bytes on the next call, bionic considers that an illegal sequence and returns -1. With these changes, the tests still pass on glibc and also pass on bionic. Also add some extra tests for UTF-8 sequences split across *files*, not just buffers within a file. --- tests/wc.test | 23 --- toys/posix/wc.c | 35 +++ 2 files changed, 31 insertions(+), 27 deletions(-) diff --git a/tests/wc.test b/tests/wc.test index 012bd53..90f733b 100755 --- a/tests/wc.test +++ b/tests/wc.test @@ -24,20 +24,29 @@ testing "multiple files" "wc input - file1" \ #Tests for wc -m if printf "%s" "$LANG" | grep -q UTF-8 then + testing "-m stdin" 'cat "$FILES/utf8/test2.txt" | wc -m' "169\n" "" "" + # Test input that splits a UTF-8 encoding across buffers. echo -n " " > file1 for i in $(seq 1 8192) do echo -n "ü" >> file1 done - testing "-m" "wc -m file1" "8193 file1\n" "" "" - testing "-m 2" 'cat "$FILES/utf8/test2.txt" | wc -m' "169\n" "" "" - echo -n " " > file1 - for i in $(seq 1 8192) - do -echo -n "ü" >> file1 - done + testing "-m large" "wc -m file1" "8193 file1\n" "" "" + + # U+3 is the 4-byte sequence 0xf0 0xbf 0xbf 0xbf. + # If we split that across files, wc shouldn't internally recombine it + # and claim to have seen a character in a file that only contained the + # second half. + echo -ne '\xf0\xbf\xbf\xbf' > full + echo -ne '\xf0\xbf' > part-one + echo -ne '\xbf\xbf' > part-two + testing "-m partial" "wc -m full part-one part-two" \ + "1 full\n0 part-one\n0 part-two\n1 total\n" "" "" + + # TODO: note that this doesn't actually test any invalid chars... testing "-m (invalid chars)" "wc -m file1" "8193 file1\n" "" "" + NOSPACE=1 testing "-mlw" "wc -mlw input" " 1 2 11 input\n" "hello, 世界!\n" "" else diff --git a/toys/posix/wc.c b/toys/posix/wc.c index a8c3e45..4f5a7a7 100644 --- a/toys/posix/wc.c +++ b/toys/posix/wc.c @@ -49,8 +49,8 @@ static void show_lengths(unsigned long *lengths, char *name) static void do_wc(int fd, char *name) { - int len = 0, clen = 1, space = 0; unsigned long word = 0, lengths[] = {0,0,0,0}; + mbstate_t mbs = {}; // Speed up common case: wc -c normalfile i
Re: [Toybox] [PATCH] Fix wc -m on bionic.
On 08/04/2017 07:54 PM, enh wrote: > When mbrtowc returns -2, all n bytes have been processed. Bionic's > interpretation of POSIX is that you must not re-supply those bytes > on the next call, and should only supply the bytes needed to complete > the character. If you re-supply the bytes on the next call, bionic > considers that an illegal sequence and returns -1. I've had a headache for 3 days and it's really hard for me to make good technical decisions right now, because I want to set fire to everything for being pointlessly overcomplicated. What I really wanted was a function that didn't maintain magic internal state between calls, but would just return if it couldn't do what I asked and let me handle it myself or try again with more data. (When I'm escaping invalid chars that's the semantics I need, and why have two?) I don't know how to make modern libc do the simple thing, it insists on maintaining state between calls even when you pass in a NULL for the state, because reasons. It sounds like if I want a simple converter, I need to write my in lib.c, or maybe wrap this one with a flush after every error. (Sending it a zero byte should flush? I think?) Meanwhile, the code you just sent me is not clearing the magic internal state, and it's user-visible: $ echo -ne '\xf0\xbf\xbf\xbf' | wc -m 1 $ echo -ne '\xf0\xbf' > one $ echo -ne '\xbf\xbf' > two $ wc -m one two 0 one 0 two 0 total $ ./toybox wc -m one two 0 one 1 two 1 total > With these changes, the tests still pass on glibc and also pass on bionic. I should test musl. After I go lie down again... Rob ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net