Re: [OT] of the merit of using awk for performance or who's got the fastest quote() (Was: sh(1): is roundtripping of the positional parameter stack possible?)

2017-05-17 Thread Joerg Schilling
Stephane Chazelas  wrote:

> 2017-05-16 17:29:13 +0100, Stephane Chazelas:

...
> szsh a 0.763
> szsh b 0.622
> ksh93 a 0.667
> ksh93 b 0.464
> yash a 0.738
> yash b 0.429
>
> In mksh, printf is not built-in which doesn't help. In all but
> ksh93, that still does 5 forks because of the $(set +o).
>
> dash is the only one that manages to be quicker (not if I use
> mawk instead of gawk though).


dash is the one of the slowest shells. It appears to be fast because it does 
not support multi-byte characters, which disqualifies it for use in a normal 
UNIX environment.

I made benchmarks with various shells and I know that in the Bourne Shell, 30% 
of the total CPU time is spent in the multibyte -> wide -> multibyte 
conversions that are needed to support stateful encodings.

As a result, it is expected to see dash slower than bash in case it was 
enhanced to support multi byte chars.

Here are my results:

dash a 0,833
dash b 0,684
bash a 1,839
bash b 1,136
bosh a 1,441
bosh b 0,852
mksh a 3,601
mksh b 1,034
szsh a 1,791
szsh b 1,533
ksh a 2,970
ksh b 0,935
ksh93 a 1,303
ksh93 b 1,176
yash a 1,809
yash b 0,892

dash a 2,384
dash b 0,035
bash a 13,491
bash b 0,099
bosh a 4,400
bosh b 0,049
mksh a 20,927
mksh b 0,040
szsh a 4,771
szsh b 0,047
ksh a 13,634
ksh b 0,061
ksh93 a 1,318
./b: line 2: 21650: Terminated
ksh93 b 26,434
yash a 7,036
yash b 0,068

ksh88 is ksh...

For some reason, the awk based example does not terminate with ksh93 in your 
second example. I had to kill it.

As you see, if you only look at multi-byte enabled shells, the recent POSIX 
compliant Bourne Shell (bosh) and ksh93 are the fastest.

Both ksh93 and bosh avoid forks and try to use vfork instead of fork whenever 
possible. Note that on a platform with a real vfork implementation (Solaris), 
vfork is aprox. 3x faster than fork even though fork on Solaris is a copy on 
write fork already.

Linux does not implement a real vfork, as it just gives you the disadvantages 
of vfork and the shared data, while on Solaris, the vfork child borrows the 
address space description from the parent.

Jörg

-- 
 EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin
joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'



Re: [OT] of the merit of using awk for performance or who's got the fastest quote() (Was: sh(1): is roundtripping of the positional parameter stack possible?)

2017-05-17 Thread Robert Elz
Date:Tue, 16 May 2017 18:46:39 +0100
From:Stephane Chazelas 
Message-ID:  <20170516174639.ga19...@chaz.gmail.com>

  | Actually, even for a handful of arguments, and even with gawk,
  | it seems it's generally quicker to use awk in my tests:

One can make benchmarks that produce whatever result one wants,
of course, here's mine for what I expect to be a more life-like
application that simply quoting zillions of strings, all of which
just happen to contain multiple single quote characters, by accident...

. "$1"  # this supplies the definition of "quote" and no more

for file in *
do
x=$(quote "$file")
done

Of course, while that is using "quote" the way I imagined it being
used, the real work of a real-life script is missing (what we actually
do with x once we have it, and everything else that happens) - but since
we're only interested in benchmarking quote implementations, not the "real"
part of the "real" application, and really, not the shell either, that's
fine I think.

This test means that we're not just measuring the startup time of the
shell concerned (by running the shell over and over.)

Steffen (and others) can decide if my test, or yours, is more relevant
given the use he had in mind originally.

I am running this (using bash for this, as the "time" reserved word, and
TIMEFORMAT are not posix, I don't believe) using this script.  I ran it
3 times in three different directories.

printf "In a directory with %s files...\n\n" $(ls | wc -l)

for shell in sh dash bash fbsh mksh zsh yash
do
for script in $(ls /tmp/T)
do
TIMEFORMAT="Shell $shell using script $script:  %U"
(time "$shell" /tmp/timer "/tmp/T/${script}") 2>&1 |
 expand -t 40
done
printf '\n'
done

/tmp/timer is the first script I included above - though I also tested
a version of that which checked the results (by expanding $x and
verifying that the result was the same as $file - just to check that
the quote functions worked correctly .. both did.)

There is (or should be, no idea what the mailer will do) a tab before %U
in TIMEFORMAT, that and the expand are just to make the results look pretty...
This doesn't affect the results at all, nor does the subshell in that, and
while there is probably a better way to make pretty results (without
yet another awk script!), this worked well enough.)

In that list of shells, "sh" is (a quite old version of) the NetBSD shell,
"fbsh" is the FreeBSD shell (about a year old as are most of the others)
and the rest you recognise. No ksh93 in there, as (for reasons not relevant
here) I don't have that one installed on the system I'm running this test,
and I don't have most of the others installed on a system where I do have
ksh93.

There are just 2 scripts in /tmp/T, yours (exactly as you gave it in the
last message) and mine (modified from the version I had, which dealt with
leading/trailing ' chars by processing them specially, though that version
was never posted here, instead now using Jilles' better solution to that
problem .. though his generates even more redundant quotes in the cases
it makes a difference - for a leading ' in the input, I would have
output a leading \' in the result, now we get ''\' instead.  Both are
correct of course.)

Incidentally, to refer to another point you made, by "redundant quotes"
I did not really mean ones that actually quote something (though the example
I gave was not a good one) but extra '' pairs inserted into the output,
just because it is convenient for the implementation.

They accomplish nothing, hurt nothing (except my eye-balls), and add a
little extra processing time for the shell when it processes the result,
(more chars to deal with) but really don't matter one way or the other.

I'll include the two quote functions used, for completeness, at the end.

I did try a version of mine, modified in a non-posix way to use "local",
rather than explicit saving and restoring (etc) but that only makes any
difference at all when the arg to quote contains a single-quote char, which
is likely very rare, so there was no real difference between the results
for it, and the posix version of my script, it just cluttered the output,
so I removed that one (even though if I were actually using this function
that would be the version I'd use -- all sane shells support "local")

Now we can expect that most directories will not contain files with '
chars in their names, in fact, there are most likely none - but this is
also what is most likely to occur in any real life application.  We
have to deal with the possibility of it occurring, and handle it properly,
but we do not have to expect it and optimise for it.

I also realise that you could optimise your function in the same way I
did mine (from the first version) to deal with that case, in which case
the results from these tests would probably show almost the 

[OT] of the merit of using awk for performance or who's got the fastest quote() (Was: sh(1): is roundtripping of the positional parameter stack possible?)

2017-05-16 Thread Stephane Chazelas
2017-05-16 17:29:13 +0100, Stephane Chazelas:
[...]
> >   | Here, I'd fire awk and quote more than one arg at a time:
> > 
> > Hmm - you're really aiming for maximum sluggishness...  I could beat that
> > by just adding a couple of sleeps ...
> 
> Depends. If quoting only a handful a arguments, then that call
> to awk might cost you you a couple of milliseconds indeed. But
> if processing thousands, you might find that it saves a few
> seconds.

Actually, even for a handful of arguments, and even with gawk,
it seems it's generally quicker to use awk in my tests:

With 5 arguments (where "a" uses your quote() and "b" uses mine, see below):

(zsh syntax below)

$ szsh() (exec -a sh zsh "$@")
$ for shell (dash bash mksh szsh ksh93 yash) for file (a b) (TIMEFMT="$shell 
$file %*E"; time (repeat 100 $shell ./$file "a'b'c"{1..5}) > /dev/null)
dash a 0.329
dash b 0.407
bash a 0.942
bash b 0.528
mksh a 1.598
mksh b 0.540
szsh a 0.763
szsh b 0.622
ksh93 a 0.667
ksh93 b 0.464
yash a 0.738
yash b 0.429

In mksh, printf is not built-in which doesn't help. In all but
ksh93, that still does 5 forks because of the $(set +o).

dash is the only one that manages to be quicker (not if I use
mawk instead of gawk though).

For 3000 arguments, that's where we see the real advantage of using a real
programming language instead of inadequate features of a shell :-b:

$ for shell (dash bash mksh szsh ksh93 yash) for file (a b) (TIMEFMT="$shell 
$file %*E"; time ($shell ./$file "a'b'c"{1..3000}) > /dev/null)
dash a 0.827
dash b 0.019
bash a 7.712
bash b 0.080
mksh a 9.928
mksh b 0.022
szsh a 2.274
szsh b 0.028
ksh93 a 1.184
ksh93 b 0.022
yash a 2.655
yash b 0.035



the scripts under test:



$ cat a
quote() {
case "$1" in
*\'*)   ;;   # the harder case, we will get to below.
*)  printf "'%s'" "$1"; return 0;;
esac

_save_IFS="${IFS}" # if possible just make IFS "local"
_save_OPTS="$(set +o)"  # quotes there not really needed.
IFS=\'
set -f
set -- $1
_result_="${1}"# we know at least $1 and $2 exist, as there
shift  # was one quote in the input.

for __arg__
do
_result_="${_result_}'\\''${__arg__}"
done
printf "'%s'" "${_result_}"

# now clean up

IFS="${_save_IFS}"  #none of this is needed with a good 
"local"...
eval "${_save_OPTS}"
unset __arg__ _result_ _save_IFS _save_OPTS

return 0;
}
s=$(
  for i do
quote "$i"; printf ' '
  done
)
printf '%s\n' "$s"

$ cat b
quote() {
  LC_ALL=C awk -v q="'" -v b='\\' '
function quote(s) {
  gsub(q, q b q q, s)
  return q s q
}
BEGIN {
  sep = ""
  for (i = 1; i < ARGC; i++) {
printf "%s", sep quote(ARGV[i])
sep = " "
  }
  if (sep) print ""
}' "$@"
}
s=$(quote "$@")
printf '%s\n' "$s"

-- 
Stephane