Re: type errors, command length limits, and Awk (was: portability of xargs)

2022-02-15 Thread Dan Kegel
FWIW, commandline length limits are a real thing, I've run into them
with Make, CMake, and Meson.
I did some work to help address them in Meson, see e.g.
https://github.com/mesonbuild/meson/issues/7212

And just for fun, here's a vaguely related changelog entry from long
ago, back when things were much worse:

Tue Jun  8 15:24:14 1993  Paul Eggert  (egg...@twinsun.com)
* inp.c (plan_a): Check that RCS and working files are not the
same.  This check is needed on hosts that do not report file
name length limits and have short limits.

:-)



Re: type errors, command length limits, and Awk (was: portability of xargs)

2022-02-15 Thread Mike Frysinger
On 15 Feb 2022 21:17, Jacob Bachmeyer wrote:
> Mike Frysinger wrote:
> > context: https://bugs.gnu.org/53340
> >   
> Looking at the highlighted line in the context:

thanks for getting into the weeds with me

> > >   echo "$$py_files" | $(am__pep3147_tweak) | $(am__base_list) | \
> It seems that the problem is that am__base_list expects ListOf/File (and 
> produces ChunkedListOf/File) but am__pep3147_tweak emits ListOf/Glob.  
> This works in the usual case because the shell implicitly converts Glob 
> -> ListOf/File and implicitly flattens argument lists, but results in 
> the overall command line being longer than expected if the globs expand 
> to more filenames than expected, as described there.
> 
> It seems that the proper solution to the problem at hand is to have 
> am__pep3147_tweak expand globs itself somehow and thus provide 
> ListOf/File as am__base_list expects.
> 
> Do I misunderstand?  Is there some other use for xargs?

if i did not care about double expansion, this might work.  the pipeline
quoted here handles the arguments correctly (other than whitespace splitting
on the initial input, but that's a much bigger task) before passing them to
the rest of the pipeline.  so the full context:

  echo "$$py_files" | $(am__pep3147_tweak) | $(am__base_list) | \
  while read files; do \
$(am__uninstall_files_from_dir) || st=$$?; \
  done || exit $$?; \
...
am__uninstall_files_from_dir = { \
  test -z "$$files" \
|| { test ! -d "$$dir" && test ! -f "$$dir" && test ! -r "$$dir"; } \
|| { echo " ( cd '$$dir' && rm -f" $$files ")"; \
 $(am__cd) "$$dir" && rm -f $$files; }; \
  }

leveraging xargs would allow me to maintain a single shell expansion.
the pathological situation being:
  bar.py
  __pycache__/
bar.pyc
bar*.pyc
bar**.pyc

py_files="bar.py" which turns into "__pycache__/bar*.pyc" by the pipeline,
and then am__uninstall_files_from_dir will expand it when calling `rm -f`.

if the pipeline expanded the glob, it would be:
  __pycache__/bar.pyc __pycache__/bar*.pyc __pycache__/bar**.pyc
and then when calling rm, those would expand a 2nd time.

i would have to change how the pipeline outputs the list of files such that
the final subshell could safely consume & expand.  since this is portable
shell, i don't have access to arrays & fancy things like readarray.  if the
pipeline switched to newline delimiting, and i dropped $(am__base_list), i
could use positionals to construct an array and safely expand that.  but i
strongly suspect that it's not going to be as performant, and i might as
well just run `rm` once per file :x.

  echo "$$py_files" | $(am__pep3147_tweak) | \
  ( set --
while read file; do
  set -- "$@" "$file"
  if test $# -ge 40; then
rm -f "$@"
set --
  fi
done
if test $# -gt 0; then
  rm -f "$@"
fi
  )

which at this point i've written `xargs -n40`, but not as fast :p.

> > automake jumps through some hoops to try and limit the length of generated
> > command lines, like deleting output objects in a non-recursive build.  it's
> > not perfect -- it breaks arguments up into 40 at a time (akin to xargs -n40)
> > and assumes that it won't have 40 paths with long enough names to exceed the
> > command line length.  it also has some logic where it's deleting paths by
> > globs, but the process to partition the file list into groups of 40 happens
> > before the glob is expanded, so there are cases where it's 40 globs that can
> > expand into many many more files and then exceed the command line length.
> 
> First, I thought that GNU-ish systems were not supposed to have such 
> arbitrary limits,

one person's "arbitrary limits" is another person's "too small limit" :).
i'm most familiar with Linux, so i'll focus on that.

xargs --show-limits on my Linux-5.15 system says:
Your environment variables take up 5934 bytes
POSIX upper limit on argument length (this system): 2089170
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2083236

2MB ain't too shabby.  but if we consult execve(2), it has more details:
https://man7.org/linux/man-pages/man2/execve.2.html
   On Linux prior to kernel 2.6.23, the memory used to store the
   environment and argument strings was limited to 32 pages (defined
   by the kernel constant MAX_ARG_PAGES).  On architectures with a
   4-kB page size, this yields a maximum size of 128 kB.

i've def seen "Argument list too long" errors in Gentoo from a variety of
packages due to this 128 kB limit (which includes the environ strings).
users are very hostile to packages :p.

   On kernel 2.6.23 and later, most architectures support a size
   limit derived from the soft RLIMIT_STACK resource limit (see
   getrlimit(2)) that is in force at the time of the execve() call.
   (Architectures with no memory management unit are excepted: they
   maintain the limit that was in effect before