Re: [OT] of the merit of using awk for performance or who's got the fastest quote() (Was: sh(1): is roundtripping of the positional parameter stack possible?)
Stephane Chazelaswrote: > 2017-05-16 17:29:13 +0100, Stephane Chazelas: ... > szsh a 0.763 > szsh b 0.622 > ksh93 a 0.667 > ksh93 b 0.464 > yash a 0.738 > yash b 0.429 > > In mksh, printf is not built-in which doesn't help. In all but > ksh93, that still does 5 forks because of the $(set +o). > > dash is the only one that manages to be quicker (not if I use > mawk instead of gawk though). dash is the one of the slowest shells. It appears to be fast because it does not support multi-byte characters, which disqualifies it for use in a normal UNIX environment. I made benchmarks with various shells and I know that in the Bourne Shell, 30% of the total CPU time is spent in the multibyte -> wide -> multibyte conversions that are needed to support stateful encodings. As a result, it is expected to see dash slower than bash in case it was enhanced to support multi byte chars. Here are my results: dash a 0,833 dash b 0,684 bash a 1,839 bash b 1,136 bosh a 1,441 bosh b 0,852 mksh a 3,601 mksh b 1,034 szsh a 1,791 szsh b 1,533 ksh a 2,970 ksh b 0,935 ksh93 a 1,303 ksh93 b 1,176 yash a 1,809 yash b 0,892 dash a 2,384 dash b 0,035 bash a 13,491 bash b 0,099 bosh a 4,400 bosh b 0,049 mksh a 20,927 mksh b 0,040 szsh a 4,771 szsh b 0,047 ksh a 13,634 ksh b 0,061 ksh93 a 1,318 ./b: line 2: 21650: Terminated ksh93 b 26,434 yash a 7,036 yash b 0,068 ksh88 is ksh... For some reason, the awk based example does not terminate with ksh93 in your second example. I had to kill it. As you see, if you only look at multi-byte enabled shells, the recent POSIX compliant Bourne Shell (bosh) and ksh93 are the fastest. Both ksh93 and bosh avoid forks and try to use vfork instead of fork whenever possible. Note that on a platform with a real vfork implementation (Solaris), vfork is aprox. 3x faster than fork even though fork on Solaris is a copy on write fork already. Linux does not implement a real vfork, as it just gives you the disadvantages of vfork and the shared data, while on Solaris, the vfork child borrows the address space description from the parent. Jörg -- EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'
Re: [OT] of the merit of using awk for performance or who's got the fastest quote() (Was: sh(1): is roundtripping of the positional parameter stack possible?)
Date:Tue, 16 May 2017 18:46:39 +0100 From:Stephane ChazelasMessage-ID: <20170516174639.ga19...@chaz.gmail.com> | Actually, even for a handful of arguments, and even with gawk, | it seems it's generally quicker to use awk in my tests: One can make benchmarks that produce whatever result one wants, of course, here's mine for what I expect to be a more life-like application that simply quoting zillions of strings, all of which just happen to contain multiple single quote characters, by accident... . "$1" # this supplies the definition of "quote" and no more for file in * do x=$(quote "$file") done Of course, while that is using "quote" the way I imagined it being used, the real work of a real-life script is missing (what we actually do with x once we have it, and everything else that happens) - but since we're only interested in benchmarking quote implementations, not the "real" part of the "real" application, and really, not the shell either, that's fine I think. This test means that we're not just measuring the startup time of the shell concerned (by running the shell over and over.) Steffen (and others) can decide if my test, or yours, is more relevant given the use he had in mind originally. I am running this (using bash for this, as the "time" reserved word, and TIMEFORMAT are not posix, I don't believe) using this script. I ran it 3 times in three different directories. printf "In a directory with %s files...\n\n" $(ls | wc -l) for shell in sh dash bash fbsh mksh zsh yash do for script in $(ls /tmp/T) do TIMEFORMAT="Shell $shell using script $script: %U" (time "$shell" /tmp/timer "/tmp/T/${script}") 2>&1 | expand -t 40 done printf '\n' done /tmp/timer is the first script I included above - though I also tested a version of that which checked the results (by expanding $x and verifying that the result was the same as $file - just to check that the quote functions worked correctly .. both did.) There is (or should be, no idea what the mailer will do) a tab before %U in TIMEFORMAT, that and the expand are just to make the results look pretty... This doesn't affect the results at all, nor does the subshell in that, and while there is probably a better way to make pretty results (without yet another awk script!), this worked well enough.) In that list of shells, "sh" is (a quite old version of) the NetBSD shell, "fbsh" is the FreeBSD shell (about a year old as are most of the others) and the rest you recognise. No ksh93 in there, as (for reasons not relevant here) I don't have that one installed on the system I'm running this test, and I don't have most of the others installed on a system where I do have ksh93. There are just 2 scripts in /tmp/T, yours (exactly as you gave it in the last message) and mine (modified from the version I had, which dealt with leading/trailing ' chars by processing them specially, though that version was never posted here, instead now using Jilles' better solution to that problem .. though his generates even more redundant quotes in the cases it makes a difference - for a leading ' in the input, I would have output a leading \' in the result, now we get ''\' instead. Both are correct of course.) Incidentally, to refer to another point you made, by "redundant quotes" I did not really mean ones that actually quote something (though the example I gave was not a good one) but extra '' pairs inserted into the output, just because it is convenient for the implementation. They accomplish nothing, hurt nothing (except my eye-balls), and add a little extra processing time for the shell when it processes the result, (more chars to deal with) but really don't matter one way or the other. I'll include the two quote functions used, for completeness, at the end. I did try a version of mine, modified in a non-posix way to use "local", rather than explicit saving and restoring (etc) but that only makes any difference at all when the arg to quote contains a single-quote char, which is likely very rare, so there was no real difference between the results for it, and the posix version of my script, it just cluttered the output, so I removed that one (even though if I were actually using this function that would be the version I'd use -- all sane shells support "local") Now we can expect that most directories will not contain files with ' chars in their names, in fact, there are most likely none - but this is also what is most likely to occur in any real life application. We have to deal with the possibility of it occurring, and handle it properly, but we do not have to expect it and optimise for it. I also realise that you could optimise your function in the same way I did mine (from the first version) to deal with that case, in which case the results from these tests would probably show almost the
[OT] of the merit of using awk for performance or who's got the fastest quote() (Was: sh(1): is roundtripping of the positional parameter stack possible?)
2017-05-16 17:29:13 +0100, Stephane Chazelas: [...] > > | Here, I'd fire awk and quote more than one arg at a time: > > > > Hmm - you're really aiming for maximum sluggishness... I could beat that > > by just adding a couple of sleeps ... > > Depends. If quoting only a handful a arguments, then that call > to awk might cost you you a couple of milliseconds indeed. But > if processing thousands, you might find that it saves a few > seconds. Actually, even for a handful of arguments, and even with gawk, it seems it's generally quicker to use awk in my tests: With 5 arguments (where "a" uses your quote() and "b" uses mine, see below): (zsh syntax below) $ szsh() (exec -a sh zsh "$@") $ for shell (dash bash mksh szsh ksh93 yash) for file (a b) (TIMEFMT="$shell $file %*E"; time (repeat 100 $shell ./$file "a'b'c"{1..5}) > /dev/null) dash a 0.329 dash b 0.407 bash a 0.942 bash b 0.528 mksh a 1.598 mksh b 0.540 szsh a 0.763 szsh b 0.622 ksh93 a 0.667 ksh93 b 0.464 yash a 0.738 yash b 0.429 In mksh, printf is not built-in which doesn't help. In all but ksh93, that still does 5 forks because of the $(set +o). dash is the only one that manages to be quicker (not if I use mawk instead of gawk though). For 3000 arguments, that's where we see the real advantage of using a real programming language instead of inadequate features of a shell :-b: $ for shell (dash bash mksh szsh ksh93 yash) for file (a b) (TIMEFMT="$shell $file %*E"; time ($shell ./$file "a'b'c"{1..3000}) > /dev/null) dash a 0.827 dash b 0.019 bash a 7.712 bash b 0.080 mksh a 9.928 mksh b 0.022 szsh a 2.274 szsh b 0.028 ksh93 a 1.184 ksh93 b 0.022 yash a 2.655 yash b 0.035 the scripts under test: $ cat a quote() { case "$1" in *\'*) ;; # the harder case, we will get to below. *) printf "'%s'" "$1"; return 0;; esac _save_IFS="${IFS}" # if possible just make IFS "local" _save_OPTS="$(set +o)" # quotes there not really needed. IFS=\' set -f set -- $1 _result_="${1}"# we know at least $1 and $2 exist, as there shift # was one quote in the input. for __arg__ do _result_="${_result_}'\\''${__arg__}" done printf "'%s'" "${_result_}" # now clean up IFS="${_save_IFS}" #none of this is needed with a good "local"... eval "${_save_OPTS}" unset __arg__ _result_ _save_IFS _save_OPTS return 0; } s=$( for i do quote "$i"; printf ' ' done ) printf '%s\n' "$s" $ cat b quote() { LC_ALL=C awk -v q="'" -v b='\\' ' function quote(s) { gsub(q, q b q q, s) return q s q } BEGIN { sep = "" for (i = 1; i < ARGC; i++) { printf "%s", sep quote(ARGV[i]) sep = " " } if (sep) print "" }' "$@" } s=$(quote "$@") printf '%s\n' "$s" -- Stephane