I've been banging on dd, and here's the first section of my notes:

I'm uncomfortable having a command repeat the same information in
multiple places. In this case, something like C_SEEK is #defined, that
define is used in an array, then it's used in a case statement during
option parsing, and in an if statement performing the actual work. So if
you add or remove an option, you have to do so in four places.

We actually need the information in _maybe_ two places, and even that's
technically an optimization by not having the option parsing inside the
loop doing the actual work, thus cacheing the results of command line
option parsing. In theory there's common infrastructure to handle the
command line option parsing (lib/args.c), but it doesn't handle
keyword=value arguments which is why we have to implement something
different here. Given how common keyword=value arguments are, maybe we
can factor some of that out into lib in the future. We can _definitely_
factor out the "comma separated list" part (conv=), which is also needed
by "mount -o", but let's implement the potentially common infrastructure
locally for now, and then peel off into lib/lib.c later.

So: I want to massage this code to remove the need for us to #define
information before we can use it, and to make at least some of the new
option parsing infrastructure reusable by other commands.

The size suffices are also kind of weird weird. We already have atolx()
which is parsing suffixes, but these are _different_ suffixes, and it's
got a "d" option to do decimal instead of binary units (so multibyte
suffixes). It's using "c" is 1, "w" is 2, "b" is 512 (presumably meaning
disk block size, except it's 4096 these days)... It has mega and giga
but not tera? I guess you don't want to malloc a terabyte any time soon,
but that's fairly domain specific and might change within 10 years.

Is this really what posix specifies?

Nope: posix specifies "k" and "b", and that's it. This seems to have
come from gnu, and busybox seems to have copied gnu, and this is just
implementing what gnu did. Except busybox added "swab" last year, so the
set of options this command is implementing is basically a snapshot of
what busybox was doing at some point in the past.

(Why did busybox implement swab? The mailing list has somebody just
submitting it out of the blue without actually explaining their use case
for it. I'd think this option was on the obsolete side: the "nuxi"
problem was fairly specific to 16 bit systems, 32 bit systems have the
"xinu" problem instead. :)

Alright, back to the size suffix issue: trying to extend atolx() doesn't
seem helpful, that's returning "long" and this is "long long" (different
on 32 bit systems, still relevant for about the rest of this decade).
The suffixes are different, with at least one outright conflict (b=block
vs b=byte). The decimal sizes is sort of interesting but nobody's
actually _asked_ for it.

So yeah, we probably need a new function in to dd to do this parsing,
but we can at least move the table into the function. And we can use the
"one\0two\0three\0" trick to avoid an array of pointers (8 bytes each on
64 bit systems) to the actual string data, and have the names and values
both local variables. (I think locality of reference trumps collating
'em, but that's a judgement call bordering on
opinion.)

As for option parsing corner cases... why is it skipping spaces at the
beginning?

  $ dd " count=1"
  dd: unrecognized operand ` count=1'
  $ ~/busybox/busybox/busybox dd "count= 1"
  dd: invalid number ' 1'

In the submitted toybox implementation, "dd count=" with no argument
treats it as 0, both of those say invalid number. What does posix say?
It... doesn't seem to specify...

Wait, posix says number parsing should support:

  Two or more positive decimal numbers (with or without k or b)
  separated by x, specifying the product of the indicated values

What, _really_? Um, wow: gnu accepts it. (Busybox doesn't.) And the
_sad_ part is that posix only says you have to do that for bs, cbs, ibs,
and obs, but not count= (and indeed, gnu won't take it for count).

Right, I need to go closely read what posix actually says for this
command, and at least document our deviations from it...

Meanwhile, attached is a first stab at new parsing functions for dd that
might be reusable in lib. It doesn't actually work yet (I have several
test cases that break it and I'm not sure this snapshot even compiles)
but just FYI on the direction I'm going...

Rob
/* dd.c - data duplicator
 *
 * Copyright 2014 Rob Landley <[email protected]>
 *
 * See http://pubs.opengroup.org/onlinepubs/9699919799/utilities/dd.html
 *
 * Deltas from posix:
 *   no ebcdic (conv=ascii,ebcdic,ibm)
 *   no punchcard records (conv=block,unblock cbs=)
 *   no lcase/ucase (because utf8 and fixed block sizes don't mix)

USE_DD(NEWTOY(dd, 0, TOYFLAG_USR|TOYFLAG_BIN))

config DD
  bool "dd"
  default n
  help
    usage: dd [if|of=FILE] [ibs|obs|bs|count|skip|seek=N] [conv=TYPE[,TYPE...]]

    Read and write data with specified granularity and formatting.

    if=FILE  Read from FILE (stdin)        of=FILE  Write to FILE (stdout)
    bs=N     Read/write block size (512b)  count=N  Stop after copying N blocks
    ibs=N    Read N byte blocks (=bs)      obs=N    Write N byte blocks (=bs)
    skip=N   Skip N input blocks           seek=N   Skip N output blocks

    conv=TYPE one of:
    notrunc  Don't truncate output         noerror Continue after read errors
    sync     Pad blocks with zeros         fsync   Don't exit until output done

    Numbers [N] can end with unit size [cwbkmg] meaning 1, 2, 512, kilo, mega,
    or gigabytes, plus an optional d meaning decimal (base 10) units.
*/

#define FOR_dd
#include "toys.h"

GLOBALS(
  char *arg;
)

// If *a starts with b, advance *a past it and return 1, else return 0;
int strstart(char **a, char *b)
{
  int len = strlen(b), i = !strncmp(*a, b, len);

  if (i) *a += len;

  return i;
}

// A bit like atolx() but does long long and takes different suffixes,
// and we might as well handle the type= while we're at it
static int dd_atolx(char *match, long long *val, int min)
{
  long long obsolete = 1;
  int mul[] = {500, 1000, 1000000, 1000000000, 1, 2, 512, 1<<10, 1<<20, 1<<30},
      i;
  char *arg = TT.arg, *suf, *suffixes = "bd\0kd\0md\0gd\0c\0w\0b\0k\0m\0g";

  if (!strstart(&arg, match) || *(arg++)!='=') return 1;

  // Loop for posix's insane legacy "1x2x3x4" syntax predating $(( )) in shells.
  for (;;) {
    if (!isdigit(*arg)) goto fail;
    *val = strtoll(arg, &arg, 0);

    if (!*arg) break;

    // Loop trying each suffix
    for (suf = suffixes, i = 0; ; i++) {
      int len;

      if (i == ARRAY_LEN(mul)) goto fail;
      if (!strncasecmp(arg, suf, len = strlen(suf))) {
        *val *= mul[i];
        arg += len;

        break;
      }
      suf += strlen(suf) + 1;
    }

    // posix's insane legacy multiplier syntax
    if (*arg == 'x') {
      obsolete *= *val;
      arg++;
    } else if (!*arg) break;
printf("remaining=%s\n", arg);
  }
  *val *= obsolete;
  if (*val >= min) return 0;

  // Matched type but couldn't parse number, die loudly.
fail:
  error_exit("bad %s", TT.arg);
}

// Note: modifies "string" to remove matches (because mount needs that)
void comma_collect(char *string, char *list, unsigned *flags)
{
  for (;;) {
    int no, bit = 0;
    char *ss, *s = string;

    // Skip leading/trailing/duplicate commas
    while (*s == ',') s++;
    if (!*s) {
      *string = 0;
      return;
    }

    no = strstart(&s, "no");
    for (ss = list; *ss; ss += strlen(ss) + 1) {
      char *test = s;

      if (strstart(&test, ss) && (!*test || *test == ',')) {
        bit = 1<<bit;
        if (no) *flags &= ~bit;
        else *flags |= bit;
        s += 1+test-ss;

        break;
      } else bit++;
    }
    if (!*ss) {
      if (s != string) strcpy(string, s);
      string = strchr(string, ',');
      if (!string++) break;
    }
  }
}

void dd_main(void)
{
  long long ibs = 512, obs = 512, skip = 0, seek = 0, count = -1;
  char *ifname = 0, *ofname = 0, *commalist = "swab\0error\0trunc\0sync\0";
  unsigned commas = 6;
  int i, ifd = 0, ofd = 1;

  for (i=0; toys.optargs[i]; i++) {
    char *s = TT.arg = toys.optargs[i];

    if (strstart(&s, "if=")) ifname = s;
    else if (strstart(&s, "of=")) ofname = s;
    else if (strstart(&s, "conv=")) {
      comma_collect(s, commalist, &commas);
      if (*s) goto fail;
    } else if (!dd_atolx("bs", &ibs, 1)) obs = ibs;
    else if (dd_atolx("ibs", &ibs, 1) && dd_atolx("obs", &obs, 1)
      && dd_atolx("skip", &skip, 0) && dd_atolx("seek", &seek, 0)
      && dd_atolx("count", &count, 0))
    {
fail:
      error_exit("bad %s", TT.arg);
    }
  }

  // Defer creating output file until we know there weren't parsing errors
  if (ofname) ofd = xcreate(ofname, O_CREAT|O_WRONLY, 0666);
  if (ifname) ifd = xopen(ifname, O_RDONLY);

  printf("parsed ibs=%lld obs=%lld skip=%lld\n", ibs, obs, skip);
}
_______________________________________________
Toybox mailing list
[email protected]
http://lists.landley.net/listinfo.cgi/toybox-landley.net

Reply via email to