Re: [Toybox] [PATCH] awk -- fixes and cleanups

Rob Landley Fri, 25 Oct 2024 09:45:12 -0700

I was writing this when I got distracted by your new patches, I shouldfinish up and send it...


On 10/18/24 18:08, Ray Gardner wrote:

Is random() not good? Maybe not the best, but what is not remotely about it?
Period about 16*(2**31-1) not enough?


OK, it's not good enough. I don't know how bad random() is aside from
the period, but that is inadequate. I can put in a Bays-Durham shuffle
to fix that. I can also put in my own PRNG in place of random(), but it
seems you would prefer to keep toybox small.

In general if you want real randomness, I ask the kernel for randombits. If you want repeatable prng streams where I can specify a seed toget reproducible behavior, randomly changing the PRNG out from under itis a bad thing.


You seem to be talking about something in between...?

Your patch went from initializing the random number generator from a fraction of
a second to an even second (which is forever in computer timme), and I don't
know why you did that. What's the benefit?

The patch only addresses the seed when no arg is given to srand(). Yes it's
for compatibility,

Compatibility with what?


I should have said compliance -- posix compliance. It's in the spec.
But also compatibility: gawk, nawk, mawk, goawk, busybox awk, all
support it.

To what end? Which of the above two use cases is being supported here?Actual randomness (which the kernel is better at than userspace), orreproducible PRNG sequences from a known seed?

This is the test that tests that srand() is inited to the current
second. srand() returns the previous seed.


And getrandom() doesn't need to be seeded.

(Well, ok, it does but that's not an individual userspace command'sproblem. That's an init script and hardware config question.)

[...]

Patch 3: You've made your input path pretty complicated to fix an issue that
seems to boil down to "gracefully handle short reads". If getdelim() out of libc

[...]

Eh, applied anyway because it's local to a pending/ command but can't say I'm
happy about it.

I'm not too happy with it either. I'd be happy with some help on the entire
input mechanism. But it's complicated.

[...]

I've lost track of "record separator" vs "field separator" vs "line separator" 
here.


There is no line separator.


So newline is hardwired there?

Records and fields. RS is record
separator, FS is field separator. FS can be a regex. Regex RS
is "unspecified" in posix awk, but every awk supports regex RS.
The RS="" case is multiline records. Runs of blank lines separate
records, with leading and trailing blank lines ignored.

The "interactive" awk input is supported by all awks I test (gawk,
nawk, mawk, goawk, busybox awk).

[...]

But now every awk allows RS to be a regex, which is "unspecified" by POSIX.

Have you found anything using that yet? (Did it break a build script?)


I don't know what uses it, but every awk supports it.

I'm surprised the "one true awk" that Android's been building supports afeature that DIDN'T make it into Posix 2024...

When did busybox add this? Hmmm... (Oh the test cases you'll need:busybox commits 5323af7f5180 af1bd0962527 patched up this logic becauseit's corner case city...) Looks like it was there when Glenn copied thisfile into a different directory in 2002 (commit 545106f8db3e) sopresumably they copied it from gnu/dammit in the first place. (For all Iknow their code _was_ from some ancient version of gnu awk, since it wasa GPLv2 project at the time...)

I am trying to
be posix compliant (modulo some locale stuff), but also handling
common extensions and non-posix behavior of other awks.

I've completely re-written the wonky record-reading code to be much
cleaner. Was difficult. The regex RS is used to support the multiline
record feature. Is in the next batch of patches.

I deleted my local cleanups to apply your patches. (It's man.c all overagain...)

[...]

Anyway, here is a fix and a bunch of tests. BTW only toybox awk and a recent
gawk (5.3.0 in my case) pass all the tests.

I'm testing against gawk 5.3.1 now, and it does NOT pass all the
tests any longer. Not sure why it's failing two of the new ones:
"split() utf8" and "split fields utf8". Both do essentially the same
as:

BEGIN{n = split("aβc", a, ""); printf "%d %d", n, length(a);for (e =
1; e <= n; e++) printf " %s %s", e, "(" a[e] ")";print ""}

BEGIN{FS=""; $0 = "aβc"; printf "%d", NF; for (e = 1; e <= NF; e++)
printf " %s %s", e, "(" $e ")"; print ""}

And both of those work in gawk when run directly.

Part of the problem writing tests is figuring out how much to test. Ikeep having test skew running bash through "TEST_HOST=1 make test_sh"although part of the reason is I keep emailing chet and he FIXES stuff.But if there's no standard and no stable reference implementation thetest might not mean anything.

My preferred test case is running as many package builds through it aspossible, since that's the real world data. Over in busybox-land alpinelinux and friends did that, but toybox wasn't ready when that ecosystemlaunched and it's not designed to allow drop-in replacement.

I've been vaguely poking athttps://wiki.debian.org/HelmutGrohne/rebootstrap andhttps://forums.debian.net/viewtopic.php?t=66936 and so on, but thedebian repo isn't exactly designed to let you do this either. (ErikAndersen did it for busybox+uclibc many moons ago, but that's lost tohistory and he never went that far...)


My todo.txt says I should watch:

watch (from https://peertube.debian.social/videos/local?s=1)
  Automating architecture bootstrap:
    https://peertube.debian.social/w/45XroN9CnbYLNLKQH3GD9F
  Building a busybox based debian
    https://peertube.debian.social/w/chzkKrMvEczG7qQyjbMKPr
  Android tools BOF
    https://peertube.debian.social/w/n57xBrJnLZQS1L112cV6wi
  spontaneous cross compiling BOF
    https://peertube.debian.social/w/3kJncnbRoXaCS1MSQqRixT
  Running debian desktop in containers
    https://peertube.debian.social/w/gpVkreH3CUXGfLd45c7ULt

And I did watch the "building a busybox based debian" one but his answerwas basically "smash the repo dependencies, just tell it to extractthese packages into an empty directory with nothing satisfied and thenmanually fix it up with a text editor and some symlinks"...

(I emailed him before the move and he said he had plans to fix thedownload links as part of a software upgrade of the site...?)

When gnu adds a new creepy thing nobody else supports yet, I generally wait for
something to break due to its absence rather than jumping through hoops to chase
their taillights. Right now, busybox awk is used to build rather a lot of
packages, and if they haven't made the decision to implement this weirdness than
packages are likely to continue to support NOT doing that...


I'm not doing anything busybox isn't already doing, except I handle utf-8
and have fewer bugs. The tests busybox can't pass are bugs. It has rather
a lot, I think.

Because as far as I can tell busybox's version might be based off a forkof gnu awk from the 1990's? It's hard to tell _where_ it came from(other than the comment at the top of the file which doesn't say howmuch was original code and how much was repurposed from elsewhere)because the conversion from svn to git seems to have broken this historyat commit 545106f8db3e and the old svn is long gone. I could dig throughold tarballs, and did for https://busybox.net/~landley/forensics.txtwhich doesn't mention awk so it probably wasn't there in 0.25...

Ray


Rob
_______________________________________________
Toybox mailing list
Toybox@lists.landley.net
http://lists.landley.net/listinfo.cgi/toybox-landley.net

Re: [Toybox] [PATCH] awk -- fixes and cleanups

Reply via email to