Re: [Toybox] [PATCH] Fix wc -m on bionic.

2017-08-23 Thread Rob Landley
On 08/22/2017 04:05 PM, enh wrote:
> happy with the replacement patch?

Sorry, $DAYJOB is eating all my energy these days
(http://lists.j-core.org/pipermail/j-core/2017-August/000646.html), and
thunderbird loses open reply windows on a reboot (my netbook battery
will die _while_plugged_in_ if you leave a USB thing drawing power from
a suspended system; it stops charging once the battery fills up and then
the USB drains it again, then it recharges the battery once it's fully
"off")...

Where was I on this one...

> On Tue, Aug 8, 2017 at 3:21 PM, enh  wrote:
>>> I don't know how to make modern libc do the simple thing, it insists on
>>> maintaining state between calls even when you pass in a NULL for the
>>> state, because reasons. It sounds like if I want a simple converter, I
>>> need to write my in lib.c, or maybe wrap this one with a flush after
>>> every error. (Sending it a zero byte should flush? I think?)
>>
>> yeah, this is yet another ugly corner of POSIX. you can get what you
>> want, though it's pretty unguessable. the function mbsinit is a
>> _predicate_ to tell you whether the mbstate_t is in the initial state
>> or not. you're supposed to memset or moral equivalent to reset to the
>> initial state. afaik there's no way to reset the global state. i can't
>> think i've ever seen one.
Before you sent me the new patch I wrote my own "grab the next utf8 char
without a state structure or caring about environment variables"
function (it wasn't hard and it's something I've been thinking about to
try to speed up "top" anyway; top level can be an inline function with
the "if (*s<128) {*len = 1; return *s}" fast path and then it can call
the real function for the rest), but I'm really uncomfortable replacing
a libc function like this so I wanted to compare mine with musl's
implementation _and_ do utf8 torture tests for all the weird corner
cases, so I needed to do a toys/example/test_utf8.c and appropriate
tests/test_utf8.test, and then you sent an updated patch and I went
"must compare/collate" but didn't get back to it...

>>> Meanwhile, the code you just sent me is not clearing the magic internal
>>> state, and it's user-visible:
>>>
>>>   $ echo -ne '\xf0\xbf\xbf\xbf' | wc -m
>>>   1
>>>   $ echo -ne '\xf0\xbf' > one
>>>   $ echo -ne '\xbf\xbf' > two
>>>   $ wc -m one two
>>>   0 one
>>>   0 two
>>>   0 total
>>>   $ ./toybox wc -m one two
>>>   0 one
>>>   1 two
>>>   1 total
>>
>> the attached updated patch fixes that and adds the test.

On the bright side, I'm mostly caught up on $DAYJOB stuff from when I
was out sick for a few days. :)

Not remotely caught up with toybox yet, and the past few times I poked
at it was trying to get the next round of lsof.c cleanup ready (which is
relaxing because nobody's waiting for it)...

Gimme one more day to try to finish my utf8 replacement code, and if I
haven't I'll merge your patch as-is. Sorry for the delay,

Rob
___
Toybox mailing list
Toybox@lists.landley.net
http://lists.landley.net/listinfo.cgi/toybox-landley.net


Re: [Toybox] [PATCH] Fix wc -m on bionic.

2017-08-22 Thread enh
happy with the replacement patch?

On Tue, Aug 8, 2017 at 3:21 PM, enh  wrote:
> (sorry, didn't get time to look at this until now.)
>
> On Sun, Aug 6, 2017 at 6:38 PM, Rob Landley  wrote:
>> On 08/04/2017 07:54 PM, enh wrote:
>>> When mbrtowc returns -2, all n bytes have been processed. Bionic's
>>> interpretation of POSIX is that you must not re-supply those bytes
>>> on the next call, and should only supply the bytes needed to complete
>>> the character. If you re-supply the bytes on the next call, bionic
>>> considers that an illegal sequence and returns -1.
>>
>> I've had a headache for 3 days and it's really hard for me to make good
>> technical decisions right now, because I want to set fire to everything
>> for being pointlessly overcomplicated.
>>
>> What I really wanted was a function that didn't maintain magic internal
>> state between calls, but would just return if it couldn't do what I
>> asked and let me handle it myself or try again with more data. (When I'm
>> escaping invalid chars that's the semantics I need, and why have two?)
>>
>> I don't know how to make modern libc do the simple thing, it insists on
>> maintaining state between calls even when you pass in a NULL for the
>> state, because reasons. It sounds like if I want a simple converter, I
>> need to write my in lib.c, or maybe wrap this one with a flush after
>> every error. (Sending it a zero byte should flush? I think?)
>
> yeah, this is yet another ugly corner of POSIX. you can get what you
> want, though it's pretty unguessable. the function mbsinit is a
> _predicate_ to tell you whether the mbstate_t is in the initial state
> or not. you're supposed to memset or moral equivalent to reset to the
> initial state. afaik there's no way to reset the global state. i can't
> think i've ever seen one.
>
>> Meanwhile, the code you just sent me is not clearing the magic internal
>> state, and it's user-visible:
>>
>>   $ echo -ne '\xf0\xbf\xbf\xbf' | wc -m
>>   1
>>   $ echo -ne '\xf0\xbf' > one
>>   $ echo -ne '\xbf\xbf' > two
>>   $ wc -m one two
>>   0 one
>>   0 two
>>   0 total
>>   $ ./toybox wc -m one two
>>   0 one
>>   1 two
>>   1 total
>
> the attached updated patch fixes that and adds the test.
>
>>> With these changes, the tests still pass on glibc and also pass on bionic.
>> I should test musl. After I go lie down again...
>>
>> Rob
>
>
>
> --
> Elliott Hughes - http://who/enh - http://jessies.org/~enh/
> Android native code/tools questions? Mail me/drop by/add me as a reviewer.



-- 
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.
___
Toybox mailing list
Toybox@lists.landley.net
http://lists.landley.net/listinfo.cgi/toybox-landley.net


Re: [Toybox] [PATCH] Fix wc -m on bionic.

2017-08-08 Thread enh
(sorry, didn't get time to look at this until now.)

On Sun, Aug 6, 2017 at 6:38 PM, Rob Landley  wrote:
> On 08/04/2017 07:54 PM, enh wrote:
>> When mbrtowc returns -2, all n bytes have been processed. Bionic's
>> interpretation of POSIX is that you must not re-supply those bytes
>> on the next call, and should only supply the bytes needed to complete
>> the character. If you re-supply the bytes on the next call, bionic
>> considers that an illegal sequence and returns -1.
>
> I've had a headache for 3 days and it's really hard for me to make good
> technical decisions right now, because I want to set fire to everything
> for being pointlessly overcomplicated.
>
> What I really wanted was a function that didn't maintain magic internal
> state between calls, but would just return if it couldn't do what I
> asked and let me handle it myself or try again with more data. (When I'm
> escaping invalid chars that's the semantics I need, and why have two?)
>
> I don't know how to make modern libc do the simple thing, it insists on
> maintaining state between calls even when you pass in a NULL for the
> state, because reasons. It sounds like if I want a simple converter, I
> need to write my in lib.c, or maybe wrap this one with a flush after
> every error. (Sending it a zero byte should flush? I think?)

yeah, this is yet another ugly corner of POSIX. you can get what you
want, though it's pretty unguessable. the function mbsinit is a
_predicate_ to tell you whether the mbstate_t is in the initial state
or not. you're supposed to memset or moral equivalent to reset to the
initial state. afaik there's no way to reset the global state. i can't
think i've ever seen one.

> Meanwhile, the code you just sent me is not clearing the magic internal
> state, and it's user-visible:
>
>   $ echo -ne '\xf0\xbf\xbf\xbf' | wc -m
>   1
>   $ echo -ne '\xf0\xbf' > one
>   $ echo -ne '\xbf\xbf' > two
>   $ wc -m one two
>   0 one
>   0 two
>   0 total
>   $ ./toybox wc -m one two
>   0 one
>   1 two
>   1 total

the attached updated patch fixes that and adds the test.

>> With these changes, the tests still pass on glibc and also pass on bionic.
> I should test musl. After I go lie down again...
>
> Rob



-- 
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.
From bb97b93acb4fcbbb7164728f56c195a13d829db0 Mon Sep 17 00:00:00 2001
From: Elliott Hughes 
Date: Fri, 4 Aug 2017 17:41:23 -0700
Subject: [PATCH] Fix wc -m on bionic.

When mbrtowc returns -2, all n bytes have been processed. Bionic's
interpretation of POSIX is that you must not re-supply those bytes
on the next call, and should only supply the bytes needed to complete
the character. If you re-supply the bytes on the next call, bionic
considers that an illegal sequence and returns -1.

With these changes, the tests still pass on glibc and also pass on bionic.

Also add some extra tests for UTF-8 sequences split across *files*, not
just buffers within a file.
---
 tests/wc.test   | 23 ---
 toys/posix/wc.c | 35 +++
 2 files changed, 31 insertions(+), 27 deletions(-)

diff --git a/tests/wc.test b/tests/wc.test
index 012bd53..90f733b 100755
--- a/tests/wc.test
+++ b/tests/wc.test
@@ -24,20 +24,29 @@ testing "multiple files" "wc input - file1" \
 #Tests for wc -m
 if printf "%s" "$LANG" | grep -q UTF-8
 then
+  testing "-m stdin" 'cat "$FILES/utf8/test2.txt" | wc -m' "169\n" "" ""
 
+  # Test input that splits a UTF-8 encoding across buffers.
   echo -n " " > file1
   for i in $(seq 1 8192)
   do
 echo -n "ü" >> file1
   done
-  testing "-m" "wc -m file1" "8193 file1\n" "" ""
-  testing "-m 2" 'cat "$FILES/utf8/test2.txt" | wc -m' "169\n" "" ""
-  echo -n " " > file1
-  for i in $(seq 1 8192)
-  do
-echo -n "ü" >> file1
-  done
+  testing "-m large" "wc -m file1" "8193 file1\n" "" ""
+
+  # U+3 is the 4-byte sequence 0xf0 0xbf 0xbf 0xbf.
+  # If we split that across files, wc shouldn't internally recombine it
+  # and claim to have seen a character in a file that only contained the
+  # second half.
+  echo -ne '\xf0\xbf\xbf\xbf' > full
+  echo -ne '\xf0\xbf' > part-one
+  echo -ne '\xbf\xbf' > part-two
+  testing "-m partial" "wc -m full part-one part-two" \
+  "1 full\n0 part-one\n0 part-two\n1 total\n" "" ""
+
+  # TODO: note that this doesn't actually test any invalid chars...
   testing "-m (invalid chars)" "wc -m file1" "8193 file1\n" "" ""
+
   NOSPACE=1 testing "-mlw" "wc -mlw input" " 1 2 11 input\n" "hello, 世界!\n" ""
 
 else
diff --git a/toys/posix/wc.c b/toys/posix/wc.c
index a8c3e45..4f5a7a7 100644
--- a/toys/posix/wc.c
+++ b/toys/posix/wc.c
@@ -49,8 +49,8 @@ static void show_lengths(unsigned long *lengths, char *name)
 
 static void do_wc(int fd, char *name)
 {
-  int len = 0, clen = 1, space = 0;
   unsigned long word = 0, lengths[] = {0,0,0,0};
+  mbstate_t mbs = {};
 
   // Speed up common case: wc -c normalfile i

Re: [Toybox] [PATCH] Fix wc -m on bionic.

2017-08-06 Thread Rob Landley
On 08/04/2017 07:54 PM, enh wrote:
> When mbrtowc returns -2, all n bytes have been processed. Bionic's
> interpretation of POSIX is that you must not re-supply those bytes
> on the next call, and should only supply the bytes needed to complete
> the character. If you re-supply the bytes on the next call, bionic
> considers that an illegal sequence and returns -1.

I've had a headache for 3 days and it's really hard for me to make good
technical decisions right now, because I want to set fire to everything
for being pointlessly overcomplicated.

What I really wanted was a function that didn't maintain magic internal
state between calls, but would just return if it couldn't do what I
asked and let me handle it myself or try again with more data. (When I'm
escaping invalid chars that's the semantics I need, and why have two?)

I don't know how to make modern libc do the simple thing, it insists on
maintaining state between calls even when you pass in a NULL for the
state, because reasons. It sounds like if I want a simple converter, I
need to write my in lib.c, or maybe wrap this one with a flush after
every error. (Sending it a zero byte should flush? I think?)

Meanwhile, the code you just sent me is not clearing the magic internal
state, and it's user-visible:

  $ echo -ne '\xf0\xbf\xbf\xbf' | wc -m
  1
  $ echo -ne '\xf0\xbf' > one
  $ echo -ne '\xbf\xbf' > two
  $ wc -m one two
  0 one
  0 two
  0 total
  $ ./toybox wc -m one two
  0 one
  1 two
  1 total

> With these changes, the tests still pass on glibc and also pass on bionic.
I should test musl. After I go lie down again...

Rob
___
Toybox mailing list
Toybox@lists.landley.net
http://lists.landley.net/listinfo.cgi/toybox-landley.net