On 05/08/2016 01:06 PM, Andy Chu wrote: > On Fri, May 6, 2016 at 9:11 PM, Rob Landley <r...@landley.net> wrote: >> (The end in sight for _busybox_ in my own use cases is next up on my >> todo list. Really not looking forward to implementing awk, but it's >> gotta be done...) > > > I'm curious what research you've done on awk?
Well, way back when I tried to make sense of busybox's awk implementation, which is around 3000 lines of C. More recently, I read about half the posix awk description and dug up a copy of the original "The AWK Programming Language" book by Aho, Kernighan, and Weinberger from 1988, which I've read the introduction of. >>From my research, it seems like a significant easier problem than the > shell. Yeah, probably. The shell is actually about 30 commands integrated together, several of which are literal commands and several of which are implicit "expand this environment variable, which could be $RANDOM or $PPID or $SECONDS which actually invokes a function but you still need to support ${#SECONDS} or ${RANDOM:1:3}. Some of them literally are external commands where [ is "test" and : is "true" and "help" I already did, and I already did ulimit, some are $(( )) is kinda like expr but not quite, "trap" is its own thing, "read" is actually fairly elaborate, command history navigation (and "history expansion" which I've never personally used)... Job control is a whole subsystem (and pipes and redirection are integrated into that; you suspend a _pipeline_ not a process, and kill needs job control integration when run from toysh). $PWD != abspath although it looks like getcwd() returns what we need there, but I need to adjust cd to strip directory entries instead ofa ctually traversing the filesystem .. (but only for _leading_ .. in the path, I think? Need to test). There's all that loop and test logic, shell functions and alias, pushd/popd/dirs, I have NEVER understood what the "getopts" command is for but need to try again, don't get me STARTED on the dozens of different things "set" does let alone "set -o"... > Without interactive parsing and a completion system, it's > probably 2-3x simpler, and if you account for that, it's probably 5x > simpler. A) I need to do _both_, B) The shell I use extensively on a regularish basis. Awk I just pipe data into '{print $5}' and that's literally all I ever use it for. > Once thing that I didn't realize is that Ubuntu and Debian use mawk > instead of gawk as their default awk. So I assume all their package > building scripts run with mawk? That's good because mawk is a lot > smaller than gawk. Everything I've tried works ok with busybox awk. Back when I maintained that there was a responsive awk developer who would fix stuff if I made puppy eyes at them about a specific test case, and once I got it to support all the linux from scratch packages that turned out to be everything anybody ever actually used, that I've noticed since. > And I think Aboriginal Linux runs with busybox awk? That's also good > because busybox awk is much smaller than mawk! > > I took a peek at 4 implementations: > > - gawk - GPLv3 - 66 K lines + 14K lines of extensions. Yacc grammar. > (This has a C extension interface, profiler and debugger, a somewhat > ugly networking library built-in, etc.) > > - mawk (updated 2015) - GPLv2 - 21K lines. Yacc grammar. (It's > supposed to be fast because it's based on a byte-code interpreter > rather than walking a tree?) > > - busybox awk - GPLv2 - ~3300 lines in editors/awk.c, though it's not > clear to me how much library code is used. It includes xregex.h > although also uses libc regexec(). Hand-written parser. > > - Kernighan Awk (updated 2012) - 8K lines. Lucent BSD? license. Yacc > grammar. > > (Some of the line counts may be a bit off because I didn't really > tease out the source parse.y file vs the generated .c and .h files) > > All of them use Yacc except busybox, which isn't that surprising > because I heard Kernighan say that Yacc was foundational in developing > awk. They designed the language with it. I have that youtube video bookmarked on my phone. (The "computerphiles" channel interview with kernighan, if I recall...) > Busybox awk is impressively small. I thought you said there was a lot > of hairy awk in binutils or something, so I'm guessing that all runs > under busybox awk? It didn't when I started looking, but by the end of my maintainership I'd run out of test cases that broke it, yes. > I'm guessing it's not possible for toybox to borrow code from it > because of the license, Correct. > but I wonder about the Lucent license. I don't: we use a public domain equivalent license, that isn't. > The lexer is 582 lines of clean looking C code (it's Kernighan, so I guess > we all know his style :) ), which is not insignificant! I'm not adding yacc as a build dependency. > Andy Rob _______________________________________________ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net