Re: [Toybox] More expand cleanups

2012-12-01 Thread Felix Janda
On 12/01/12 at 12:42am, Rob Landley wrote:
 On 11/30/2012 12:56:06 PM, Felix Janda wrote:
  On 11/30/12 at 02:51am, Rob Landley wrote:
   I updated wc to theoretically deal with buffer wraps better. In  
  reality
   I haven't got UTF8 test data to run through this, and should  
  probably
   find some at some point.
  
  Didn't the original version already deal well with it. (That's what  
  the
  strange tests are about.) Notice that the C library keeps an internal
  state which is updated each time mbrtowc is called. If there's only  
  part
  of a character on the end mbrtowc will return -2, but remember the  
  part.
  To continue you just have to read a new toybuf and point mbrtowc to  
  the
  _new_ data. (The r in mbrtowc stands for restartable.)
 
 Huh, I didn't know that was what it was doing. Let me go back to your  
 code then.
 
  Do the tests still work?
 
 Yes, but nothing in the test suite tests a buffer wrap. Still, if libc  
 already handles this I'm happy to let it do so.

The $(seq 1 8192)s are exactly in order to produce something bigger than
sizeof(toybuf). The first test uses a string with the first byte a normal
char and then 8192 double byte chars. By some luck the test still produces
the right results with your version. If you build a debug statement to detect
invalid chars into wc, you will see that mbrtowc encounters an invalid byte
sequence although the input is valid.

(As you have written in your commit message invalid byte sequences are
ignored in both versions.)

   This also assumes that all characters are the same width, which is
   probably wrong and I need help with if so. (I dunno how to do
   fontmetrics here?)
  
  I think that this depends on the terminal emulator. Look for example  
  at the
  -cjk_width option of xterm.
 
 I mean can UTF-8 produce characters that are different sizes in the  
 terminal? If the answer is no then i don't have to worry. If it can  
 produce double-wide characters or similar, I'd have to deal with that.

Yes, look at that option of xterm. Depending on whether it is set CJK
characters take 1 or 2 columns in xterm.

I wouldn't try to detect this...

 I know the terminal can run right to left, but that's symmetrical and I  
 don't have to care from a programming standpoint, I think...
 
  Felix
 
 Rob
___
Toybox mailing list
Toybox@lists.landley.net
http://lists.landley.net/listinfo.cgi/toybox-landley.net


Re: [Toybox] More expand cleanups

2012-11-30 Thread Rob Landley

On 11/28/2012 03:34:59 AM, Jonathan Clairembault wrote:
 Back to expand_file(). The downside of using readall() is that  
interactive
 granularity goes way down. I had this problem with tee once upon  
a time,
 it meant that piping the output of anything through tee made it  
appear in 4k
 chunks, which meant if you logged the result of a build you  
couldn't really
 see what the build was doing. I'm not sure expand has the same use  
cases,

 but that's why I did xread().

Well it seems like gnu/damnit version does buffering as well at least
it does not process input as a line by line basis. I don't see why
using xread changes anything, you probably need fgets here. Though I
think we can safely buffer until someone comes in and raises
interactivity need. wdyt?


I was thinking more along the lines of letting fputc() write data into  
the stdio.h buffer and letting that worry about when to flush it, and  
then we don't have to keep track of two positions.


 Ah, hang on. Internationalization. This thing is going to need  
multibyte
 support for utf8, isn't it? (The same general logic as wc -m. Hmmm,  
I wonder

 if they can share code?)

Ah! I thought toybox was not dealing with internationalization. Though
that's a good thing to have internationalization.


I'm not doing full internationalization with date formats and having  
sort come up with different orders depending on locale, but UTF8  
support is worth doing (with a top level config symbol, a bit like  
floating point support).



 Ok, I'll have to come back to this in the morning.


And it is... no longer morning!  (We'll ignore the two missed days in  
there.)


I updated wc to theoretically deal with buffer wraps better. In reality  
I haven't got UTF8 test data to run through this, and should probably  
find some at some point.


I redid the actual expand function to be simpler: read data into toybuf  
and then write it to stdout using either fputc(char, stdout) or  
xprintf(%*c, len, ' ') depending on whether it's a tab or something  
else. It checks for tab (trigger the space behavior) and newline (reset  
counters).


What it does _not_ currently do is track spaces advanced separately  
from bytes advanced, that needs the utf8 stuff to grab groups of  
bytes that represent a single character, and to make _that_ work I need  
to copy the logic I just added to wc, which means maybe I should  
genericize it into lib/lib.c somehow? Needs more thought.


This also assumes that all characters are the same width, which is  
probably wrong and I need help with if so. (I dunno how to do  
fontmetrics here?)


I need to catch up on doing the test suite, because I've been testing  
by hand. My scrollback buffer says:


echo -e 'blah\tblah' | ./toybox expand | hexdump -C
echo -e 'blah\tblah' | ./toybox expand -t 11 | hexdump -C
echo -e 'blah\tblah and then some more because\tblah' | \
  ./toybox expand -t 11 | hexdump -C
echo -e 'blah\tblah and then some more because\tblah\n\tand' | \
  ./toybox expand -t 11 | hexdump -C
echo -e 'blah\tblah and then some more because\tblah\n\tand' | \
  ./toybox expand -t 3,11,11 | hexdump -C
echo -e 'blah\tblah and then some more because\tblah\n\tand' | \
  ./toybox expand -t 3,11,22,33 | hexdump -C
echo -e 'blah\tblah and then some more because\tblah\n\tand' | \
  ./toybox expand -t 3,11,22,33,44 | hexdump -C

Possibly I should turn that into an actual automated testy thing.

Sleep time now.

Rob
___
Toybox mailing list
Toybox@lists.landley.net
http://lists.landley.net/listinfo.cgi/toybox-landley.net


Re: [Toybox] More expand cleanups

2012-11-30 Thread Jonathan Clairembault
 Possibly I should turn that into an actual automated testy thing.


I shared a test script with expand code. Though I forgot to make it
executable. Everything pass except backspace handling with the current
version. Of course it does not test for utf-8 compatibility yet.

Jonathan
___
Toybox mailing list
Toybox@lists.landley.net
http://lists.landley.net/listinfo.cgi/toybox-landley.net


Re: [Toybox] More expand cleanups

2012-11-30 Thread Felix Janda
On 11/30/12 at 02:51am, Rob Landley wrote:
 I updated wc to theoretically deal with buffer wraps better. In reality  
 I haven't got UTF8 test data to run through this, and should probably  
 find some at some point.

Didn't the original version already deal well with it. (That's what the
strange tests are about.) Notice that the C library keeps an internal
state which is updated each time mbrtowc is called. If there's only part
of a character on the end mbrtowc will return -2, but remember the part.
To continue you just have to read a new toybuf and point mbrtowc to the
_new_ data. (The r in mbrtowc stands for restartable.)

Do the tests still work?

 I redid the actual expand function to be simpler: read data into toybuf  
 and then write it to stdout using either fputc(char, stdout) or  
 xprintf(%*c, len, ' ') depending on whether it's a tab or something  
 else. It checks for tab (trigger the space behavior) and newline (reset  
 counters).
 
 What it does _not_ currently do is track spaces advanced separately  
 from bytes advanced, that needs the utf8 stuff to grab groups of  
 bytes that represent a single character, and to make _that_ work I need  
 to copy the logic I just added to wc, which means maybe I should  
 genericize it into lib/lib.c somehow? Needs more thought.

 This also assumes that all characters are the same width, which is  
 probably wrong and I need help with if so. (I dunno how to do  
 fontmetrics here?)

I think that this depends on the terminal emulator. Look for example at the
-cjk_width option of xterm.

Felix
___
Toybox mailing list
Toybox@lists.landley.net
http://lists.landley.net/listinfo.cgi/toybox-landley.net


Re: [Toybox] More expand cleanups

2012-11-28 Thread Jonathan Clairembault
On Wed, Nov 28, 2012 at 8:51 AM, Rob Landley r...@landley.net wrote:

 I don't like adding new types to global headers which are only used in one
 command, so change global tablist to be a void *...

That will be used in unexpand but void * is fine anyway.

 Also, in the help text I'm tempted to just say a comma separated list
 instead of having comma or space awkwardly explained in the paranethetical.
 Yes we should support space, but since expand -n 1 2 3 treats 2 and 3 as
 filenames, you need quotes around it to use spaces as a separator which is
 just awkward. If they're going to read the docs to see how to use it, comma
 is fine. If an existing script already provides spaces, we work anyway...
 Wordsmithing a bit.

Make sense.

 Now let's look at expand_file(). Collate all same declaration types, so just
 one line of ssize_t declarations instead of several... and why are we using
 ssize_t for something that it's a file position offset?

Right, that's a mistake.

 Ah, you're using toybuf already, for file input. Reasonable choice. Ok, I'll
 break the parsing logic out into a function that returns the number of
 entries and takes the array to write into as its argument, and have it not
 write anything to the array if the argument is NULL. Then I can call it
 twice, once to count the number of entries and the second time to fill out
 the array.

 Back to expand_file(). The downside of using readall() is that interactive
 granularity goes way down. I had this problem with tee once upon a time,
 it meant that piping the output of anything through tee made it appear in 4k
 chunks, which meant if you logged the result of a build you couldn't really
 see what the build was doing. I'm not sure expand has the same use cases,
 but that's why I did xread().

Well it seems like gnu/damnit version does buffering as well at least
it does not process input as a line by line basis. I don't see why
using xread changes anything, you probably need fgets here. Though I
think we can safely buffer until someone comes in and raises
interactivity need. wdyt?

 Ah, hang on. Internationalization. This thing is going to need multibyte
 support for utf8, isn't it? (The same general logic as wc -m. Hmmm, I wonder
 if they can share code?)

Ah! I thought toybox was not dealing with internationalization. Though
that's a good thing to have internationalization.


 Ok, I'll have to come back to this in the morning.

 Rob

Jonathan
___
Toybox mailing list
Toybox@lists.landley.net
http://lists.landley.net/listinfo.cgi/toybox-landley.net