Re: [Toybox] awk seen in the wild

Rob Landley Mon, 18 Jul 2016 12:25:11 -0700

On 07/17/2016 02:08 PM, Andy Chu wrote:
> On Sun, Jul 10, 2016 at 10:28 AM, Rob Landley <[email protected]> wrote:
>> Awk's better than bc.
> 
> That's interesting... I had no idea bc was a language with functions and 
> loops!


Neither did I until I tried to implement it.

> https://github.com/jck/822kernel/blob/master/kernel/time/timeconst.bc
> 
> This is the problem with DSLs... shell, make, awk, and presumably bc
> all started out as very specific languages, for different purposes.
> Over time, they all grew a C-like imperative language.  And nobody
> wants to remember 3 or 4 different syntaxes for that:
> 
> f() { echo hi $1; }
> f "bob"
> 
> function f(name) { print "hi" name }
> f("bob")
> 
> define f
> hi %1
> end
> $(call f,"bob")
> 
> (And repeat this mess for every other construct in a language...)
>   
> It does seem that if you rule out Python/Perl,

And ruby, php, java, javascript, tcl, lithp, go, swift, rust... A friend
of mine was doing a job programming haskell a few years ago. I follow
somebody on twitter maintaining a snobol compiler. Microsoft created C#
because Java hadn't been invented there (and visual basic got taken out
by internal politics during
http://www.joelonsoftware.com/articles/APIWar.html), back under OS/2
there was something called Rexx, Apple had AppleScript...

Awk is in posix and actually gets used. Heck, the linux kernel top level
Makefile has:

  $(Q)$(AWK) '!x[$$0]++' $(vmlinux-dirs:%=$(objtree)/%/modules.order) >
    $(objtree)/modules.order
  $(Q)$(AWK) '!x[$$0]++' $^ > $(objtree)/modules.builtin

And of course:
  $ find . -name "*.awk"
  ./tools/perf/util/intel-pt-decoder/gen-insn-attr-x86.awk
  ./tools/perf/arch/x86/tests/gen-insn-x86-dat.awk
  ./tools/objtool/arch/x86/insn/gen-insn-attr-x86.awk
  ./arch/x86/tools/gen-insn-attr-x86.awk
  ./arch/x86/tools/distill.awk
  ./arch/x86/tools/chkobjdump.awk
  ./Documentation/arm/Samsung/clksrc-change-registers.awk
  ./lib/raid6/unroll.awk
  ./net/wireless/genregdb.awk

> awk is the winner with
> respect to code generation, based on the fact that a lot of C code
> uses it (many shells, Android, FreeBSD, etc.)

And half of autoconf, apparently.

> And I agree with the idea of minimizing build dependencies.
> 
> However, I did a bunch of research and hacking on Kernighan's Awk.  I
> was trying to morph it into a "proper" modern language.

Another one?

why?

Presumably you wouldn't remove anything significant from the base
language, since that would break compatability with existing awk
scripts, so your reaction to awk was "how could I fork this to make it
bigger"?

This feels like a variant of https://xkcd.com/927/ somehow.

> For example,
> you could imagine writing "ls" or "xargs" or even a shell in Awk, sort
> of like the idea to write tools in Lua.

awk can readlink()? (ls -l needs it.)

The lua thing fell apart trying to write mount, ifconfig, netcat,
losetup, nsenter, ionice, chroot, swapon, setsid, insmod, taskset,
dmesg... The language just didn't have the bindings.

(Then again java 1.1 didn't have any way to truncate a file until I
reported the lack to a guy name Mark English and he added it to 1.2.
Languages get usable when they get used. Most code has to be broken in.)

> But then I ran into some big limitations, like you can't return
> associative arrays from a function, or pass or return functions
> to/from functions.  Awk looks very similar to JavaScript -- C syntax
> with associative arrays, but is semantically much more limited.

There are an awful lot of scripting languages.

> I lost interest in awk because of these limitations.  awk is used, but
> seems to be waning, and it's not really evolving.  (But I haven't lost
> interest in the shell.)

Linus Torvalds recently said (https://lwn.net/Articles/687916/):

  Yeah, I know, I should have used 'awk' for this. Sue me. It's been
  too long since I did awk state machines. There's a reason there's a
  "git grep" but not a "git awk" command.

One of the reasons some of these tools fall out of use is the
documentation for them is terrible. (Gnu man pages often point to "info"
pages nobody will EVER read and which sometimes _still_ aren't online.)
I'm trying to write --help text for each command that's sufficient to
learn to use the command just from that. It's not easy and I'm not happy
with a lot of the results (terse vs complete vs easy to read, pick 2).
Working on it...

It would be nice if there were some youtube clips on "an introduction to
sed", "an introduction to awk", and so on. I might wind up doing them
someday if nobody else beats me to it.

> I did however automate and slightly rewrite Kernighan's EXTENSIVE test
> suite here, which is AFAICT is not in the other Git reconstructions:
> 
> https://www.cs.princeton.edu/~bwk/btl.mirror/
> 
> https://github.com/onetrueawk/awk

The git account isn't the test suite, it seems to be the same
https://www.cs.princeton.edu/~bwk/btl.mirror/awk.tar.gz source file from
the first page. Commit history is two commits, last touched 2012.

Did you post the automated version anyway?

> I think you mentioned you were looking for an awk test suite.  Well
> there it is -- there are hundreds or thousands of test cases,
> including for the regex language.

Which is provided by libc.

Let's see... https://www.cs.princeton.edu/~bwk/btl.mirror/awktest.a is
an ar archive, ar x awktest.a gives  a directory full of files,
README.TESTS says REGRESS controls the testing process, running that does...

  $ sh ./REGRESS
  Linux driftwood 4.2.0-38-generic #45~14.04.1-Ubuntu SMP Thu Jun 9
   09:27:51 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
  echo compiled
  oldawk=awk, awk=../a.out
  ./REGRESS: 11: ./REGRESS: Compare.t: not found
  167 tests

  ./REGRESS: 14: ./REGRESS: Compare.p: not found
  58 tests

  ./REGRESS: 17: ./REGRESS: Compare.T: not found
  252 tests

  ./REGRESS: 20: ./REGRESS: Compare.tt: not found
  21 tests

Right, maybe I'll dig into this later but it's not obvious to me how to
get it to work.

> I actually ran it under LLVM
> sanitizers (ASAN/MSAN/etc.), just as I did for toybox, and it revealed
> the expected C coding bugs, in this code being maintained by one
> person for 30 years... (BTW you never responded to my last message
> about that)

My laptop rebooted during txlf and I lost my open windows. I have a todo
item to look at your test suite suggestions, but when I glanced at the
start of it, it was things like adding "only run these tests as root"
guards to some files which are part of any testing triage, so I just
started doing test suite triage until I ran out of time that day, and
haven't gotten back to it yet...

> I will publish the combined repo at some point, and if anyone has a
> burning need I can accelerate that.

I'm interested.

> I should make a blog post at some
> point, demonstrating the sanitizers on old C code... though
> unfortunately writing about code takes just as much time as coding
> itself.

Often more. :)

> And I didn't actually fix any of the memory problems that I
> found, as I did for toybox, since I don't have any plans for that code
> in the future.

I tried to link to your March 7 2016 email about the sed -f segfault and
found out the mailing list archive is down again. Has it been a year
already? (Answer: no, just it's been 7 months since
http://landley.net/toybox/#12-21-2015 .)

I wonder if Dreamhost will delete another chunk of history restoring
from a stale backup? (My actual WEBSITE is still up. But of course they
won't let me run mailman on that.)

> The bottom line is that LLVM sanitizers are mandatory if you care
> about bugs...

You said the sed -f thing was "literally the first thing you tried" and
was found with a fuzzer.

The other thing you found outside of pending was commit c73947814aab
(which was a thinko on my part, I was trusting the -1/2 to be zero, but
was testing <= not = so it still went through the loop body then), which
I can't find your submission email for (might have been on irc?) so I
dunno how you found it.

The other stuff you patched was in pending so hadn't BEEN reviewed.

> nobody is careful enough (even Kernighan, with his
> astoundingly thorough tests, much more thorough than toybox!).  toybox
> wget, tar, and crypto/compression libraries especially need this,
> because they process untrusted input.

Feel free to run it. I've never had much interest in false positive
generators myself.

> The other point about Kernighan's Awk is that if I were building
> something like Aboriginal Linux, I would just use that for now, and
> put toybox awk at low priority.  Elliot showed me that the Android NDK
> actually uses a copy of Kernighan's Awk and not the system awk for its
> builds.

Understood.

Busybox had awk back when I maintained it so I'd get comparisons if I
didn't have one in toybox, and I need it for dependency reduction for my
mythical "four packages" goal, but I need make for that too and that's
not even in the 1.0 goal list. :)

> I get why you don't like GNU stuff.  But Kernighan's Awk is like 7
> files of pure ANSI C, POSIX yacc, POSIX makefile, etc.

Closing the circle means you have yacc as a dependency. (Although he
mitigates it by shipping the yacc output...)

> that builds
> anywhere.  Kernighan also expanded the yacc to c, so you don't need
> yacc as a build dependency... that is a little "unprincipled" but I
> think fine given that awk is changing so slowly and will likely not
> need any maintenance.

It's pretty common to ship generated files for prerequisites you don't
want to demand from the end user. Everybody and their dog has a
./configure produced by autoconf, the Linux kernel has
scripts/kconfig/zconfig*_shipped and so on.

And of course Android's toybox git has the generated/ directory checked
in. :)

> Andy

Rob
_______________________________________________
Toybox mailing list
[email protected]
http://lists.landley.net/listinfo.cgi/toybox-landley.net

Re: [Toybox] awk seen in the wild

Reply via email to