Re: [Toybox] More expand cleanups
On 12/01/12 at 12:42am, Rob Landley wrote: On 11/30/2012 12:56:06 PM, Felix Janda wrote: On 11/30/12 at 02:51am, Rob Landley wrote: I updated wc to theoretically deal with buffer wraps better. In reality I haven't got UTF8 test data to run through this, and should probably find some at some point. Didn't the original version already deal well with it. (That's what the strange tests are about.) Notice that the C library keeps an internal state which is updated each time mbrtowc is called. If there's only part of a character on the end mbrtowc will return -2, but remember the part. To continue you just have to read a new toybuf and point mbrtowc to the _new_ data. (The r in mbrtowc stands for restartable.) Huh, I didn't know that was what it was doing. Let me go back to your code then. Do the tests still work? Yes, but nothing in the test suite tests a buffer wrap. Still, if libc already handles this I'm happy to let it do so. The $(seq 1 8192)s are exactly in order to produce something bigger than sizeof(toybuf). The first test uses a string with the first byte a normal char and then 8192 double byte chars. By some luck the test still produces the right results with your version. If you build a debug statement to detect invalid chars into wc, you will see that mbrtowc encounters an invalid byte sequence although the input is valid. (As you have written in your commit message invalid byte sequences are ignored in both versions.) This also assumes that all characters are the same width, which is probably wrong and I need help with if so. (I dunno how to do fontmetrics here?) I think that this depends on the terminal emulator. Look for example at the -cjk_width option of xterm. I mean can UTF-8 produce characters that are different sizes in the terminal? If the answer is no then i don't have to worry. If it can produce double-wide characters or similar, I'd have to deal with that. Yes, look at that option of xterm. Depending on whether it is set CJK characters take 1 or 2 columns in xterm. I wouldn't try to detect this... I know the terminal can run right to left, but that's symmetrical and I don't have to care from a programming standpoint, I think... Felix Rob ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net
Re: [Toybox] More expand cleanups
On 11/28/2012 03:34:59 AM, Jonathan Clairembault wrote: Back to expand_file(). The downside of using readall() is that interactive granularity goes way down. I had this problem with tee once upon a time, it meant that piping the output of anything through tee made it appear in 4k chunks, which meant if you logged the result of a build you couldn't really see what the build was doing. I'm not sure expand has the same use cases, but that's why I did xread(). Well it seems like gnu/damnit version does buffering as well at least it does not process input as a line by line basis. I don't see why using xread changes anything, you probably need fgets here. Though I think we can safely buffer until someone comes in and raises interactivity need. wdyt? I was thinking more along the lines of letting fputc() write data into the stdio.h buffer and letting that worry about when to flush it, and then we don't have to keep track of two positions. Ah, hang on. Internationalization. This thing is going to need multibyte support for utf8, isn't it? (The same general logic as wc -m. Hmmm, I wonder if they can share code?) Ah! I thought toybox was not dealing with internationalization. Though that's a good thing to have internationalization. I'm not doing full internationalization with date formats and having sort come up with different orders depending on locale, but UTF8 support is worth doing (with a top level config symbol, a bit like floating point support). Ok, I'll have to come back to this in the morning. And it is... no longer morning! (We'll ignore the two missed days in there.) I updated wc to theoretically deal with buffer wraps better. In reality I haven't got UTF8 test data to run through this, and should probably find some at some point. I redid the actual expand function to be simpler: read data into toybuf and then write it to stdout using either fputc(char, stdout) or xprintf(%*c, len, ' ') depending on whether it's a tab or something else. It checks for tab (trigger the space behavior) and newline (reset counters). What it does _not_ currently do is track spaces advanced separately from bytes advanced, that needs the utf8 stuff to grab groups of bytes that represent a single character, and to make _that_ work I need to copy the logic I just added to wc, which means maybe I should genericize it into lib/lib.c somehow? Needs more thought. This also assumes that all characters are the same width, which is probably wrong and I need help with if so. (I dunno how to do fontmetrics here?) I need to catch up on doing the test suite, because I've been testing by hand. My scrollback buffer says: echo -e 'blah\tblah' | ./toybox expand | hexdump -C echo -e 'blah\tblah' | ./toybox expand -t 11 | hexdump -C echo -e 'blah\tblah and then some more because\tblah' | \ ./toybox expand -t 11 | hexdump -C echo -e 'blah\tblah and then some more because\tblah\n\tand' | \ ./toybox expand -t 11 | hexdump -C echo -e 'blah\tblah and then some more because\tblah\n\tand' | \ ./toybox expand -t 3,11,11 | hexdump -C echo -e 'blah\tblah and then some more because\tblah\n\tand' | \ ./toybox expand -t 3,11,22,33 | hexdump -C echo -e 'blah\tblah and then some more because\tblah\n\tand' | \ ./toybox expand -t 3,11,22,33,44 | hexdump -C Possibly I should turn that into an actual automated testy thing. Sleep time now. Rob ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net
Re: [Toybox] More expand cleanups
Possibly I should turn that into an actual automated testy thing. I shared a test script with expand code. Though I forgot to make it executable. Everything pass except backspace handling with the current version. Of course it does not test for utf-8 compatibility yet. Jonathan ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net
Re: [Toybox] More expand cleanups
On 11/30/12 at 02:51am, Rob Landley wrote: I updated wc to theoretically deal with buffer wraps better. In reality I haven't got UTF8 test data to run through this, and should probably find some at some point. Didn't the original version already deal well with it. (That's what the strange tests are about.) Notice that the C library keeps an internal state which is updated each time mbrtowc is called. If there's only part of a character on the end mbrtowc will return -2, but remember the part. To continue you just have to read a new toybuf and point mbrtowc to the _new_ data. (The r in mbrtowc stands for restartable.) Do the tests still work? I redid the actual expand function to be simpler: read data into toybuf and then write it to stdout using either fputc(char, stdout) or xprintf(%*c, len, ' ') depending on whether it's a tab or something else. It checks for tab (trigger the space behavior) and newline (reset counters). What it does _not_ currently do is track spaces advanced separately from bytes advanced, that needs the utf8 stuff to grab groups of bytes that represent a single character, and to make _that_ work I need to copy the logic I just added to wc, which means maybe I should genericize it into lib/lib.c somehow? Needs more thought. This also assumes that all characters are the same width, which is probably wrong and I need help with if so. (I dunno how to do fontmetrics here?) I think that this depends on the terminal emulator. Look for example at the -cjk_width option of xterm. Felix ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net
Re: [Toybox] More expand cleanups
On Wed, Nov 28, 2012 at 8:51 AM, Rob Landley r...@landley.net wrote: I don't like adding new types to global headers which are only used in one command, so change global tablist to be a void *... That will be used in unexpand but void * is fine anyway. Also, in the help text I'm tempted to just say a comma separated list instead of having comma or space awkwardly explained in the paranethetical. Yes we should support space, but since expand -n 1 2 3 treats 2 and 3 as filenames, you need quotes around it to use spaces as a separator which is just awkward. If they're going to read the docs to see how to use it, comma is fine. If an existing script already provides spaces, we work anyway... Wordsmithing a bit. Make sense. Now let's look at expand_file(). Collate all same declaration types, so just one line of ssize_t declarations instead of several... and why are we using ssize_t for something that it's a file position offset? Right, that's a mistake. Ah, you're using toybuf already, for file input. Reasonable choice. Ok, I'll break the parsing logic out into a function that returns the number of entries and takes the array to write into as its argument, and have it not write anything to the array if the argument is NULL. Then I can call it twice, once to count the number of entries and the second time to fill out the array. Back to expand_file(). The downside of using readall() is that interactive granularity goes way down. I had this problem with tee once upon a time, it meant that piping the output of anything through tee made it appear in 4k chunks, which meant if you logged the result of a build you couldn't really see what the build was doing. I'm not sure expand has the same use cases, but that's why I did xread(). Well it seems like gnu/damnit version does buffering as well at least it does not process input as a line by line basis. I don't see why using xread changes anything, you probably need fgets here. Though I think we can safely buffer until someone comes in and raises interactivity need. wdyt? Ah, hang on. Internationalization. This thing is going to need multibyte support for utf8, isn't it? (The same general logic as wc -m. Hmmm, I wonder if they can share code?) Ah! I thought toybox was not dealing with internationalization. Though that's a good thing to have internationalization. Ok, I'll have to come back to this in the morning. Rob Jonathan ___ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net