Re: posix_spawn (was: Bugs with execline/s6 documentation and skalibs functions using posix_spawn())
On Thu, Jun 29, 2023 at 11:04:33AM +, Laurent Bercot wrote: > Indeed, /dev should work; but using it still makes me queasier than > crafting a nonexistent path. Actually deliberately supplying a directory to posix_spawn*() is a slightly deeper test, as the path exists but is unexecutable. But surely this does not constitute a decisive advantage. > Of course, it doesn't matter for glibc, and it doesn't matter for > s6 which needs fork anyway. And chances are that platforms that > implement posix_spawn() with internals that are *not* fork+exec will > not make it return before the spawning has really succeeded. But still, > it's nice to make sure it can be used wherever it exists. Now I see, thanks... > If you don't like the workaround, nobody's preventing you from using > --with-sysdep-posixspawn=no manually. ;) Yes, and I am keeping my origin solution at work intact :) -- My current OpenPGP key: RSA4096/0x227E8CAAB7AA186C (expires: 2024.09.30) 7077 7781 B859 5166 AE07 0286 227E 8CAA B7AA 186C
Re: posix_spawn (was: Bugs with execline/s6 documentation and skalibs functions using posix_spawn())
Actually I mean a *directory* that is guaranteed to exist (and meanwhile unexecutable): so /dev here. Indeed, /dev should work; but using it still makes me queasier than crafting a nonexistent path. The mkstemp thing works, so, not going to change it to save a couple of syscalls in a configure test. :) Well I was intending to suggest that we simpliy avoided posix_spawn*() where it disagreed with posix_spawn(3p); that is to say simply replacing all previous `#ifdef HASPOSIXSPAWN' conditions with `#if (defined HASPOSIXSPAWN) && (!defined SKALIBS_HASPOSIXSPAWNEARLYRETURN)'. After all it seems to me child_spawn*() is not used that prevalently, so the performance penalty is really minor; of course, feel free to correct me. Yes, falling back to fork+exec when posix_spawn is bad is an option, and I would probably have done just that if I hadn't been pointed to the existence of waitid() to achieve the "test whether a child is dead without reaping it" thing, without which there can be no workaround. But posix_spawn is more than a performance thing. The point of this interface is that its implementation doesn't have to be vfork+exec internally; it was precisely designed to allow spawning processes on nommu machines, where vfork and fork are basically impossible. So, using posix_spawn wherever possible helps with portability as well. Of course, it doesn't matter for glibc, and it doesn't matter for s6 which needs fork anyway. And chances are that platforms that implement posix_spawn() with internals that are *not* fork+exec will not make it return before the spawning has really succeeded. But still, it's nice to make sure it can be used wherever it exists. If you don't like the workaround, nobody's preventing you from using --with-sysdep-posixspawn=no manually. ;) -- Laurent
Re: posix_spawn (was: Bugs with execline/s6 documentation and skalibs functions using posix_spawn())
On Thu, Jun 29, 2023 at 08:21:19AM +, Laurent Bercot wrote: > POSIX doesn't mandate any path other than /dev/null and /dev/console > and I'd rather not try executing them, who knows what weird permissions > they may have on obscure OSes. Actually I mean a *directory* that is guaranteed to exist (and meanwhile unexecutable): so /dev here. > I agree it's a lot of work for not much, but as you said, the > behaviour is arguably conformant, and your experience proves that old > glibcs are still around, so I'd rather make posix_spawn usable where > it exists instead of placing the burden of --with-sysdep-posixspawn=no > on users who have a bad version. Well I was intending to suggest that we simpliy avoided posix_spawn*() where it disagreed with posix_spawn(3p); that is to say simply replacing all previous `#ifdef HASPOSIXSPAWN' conditions with `#if (defined HASPOSIXSPAWN) && (!defined SKALIBS_HASPOSIXSPAWNEARLYRETURN)'. After all it seems to me child_spawn*() is not used that prevalently, so the performance penalty is really minor; of course, feel free to correct me. And I probably also need to bring the incorrectness of posix_spawn(3) to the attention of its (Linux manpages) maintainers. -- My current OpenPGP key: RSA4096/0x227E8CAAB7AA186C (expires: 2024.09.30) 7077 7781 B859 5166 AE07 0286 227E 8CAA B7AA 186C
Re: posix_spawn (was: Bugs with execline/s6 documentation and skalibs functions using posix_spawn())
Fixes pushed to git, thanks! When given an unexecutable path, child_spawn() returns 0, but errno is unset... that's on purpose. Unfortunately, in the parent there is no way to know the child's execve() error code; all we have is the exit status, 127, and we cannot report the reason for the failure. Rather than set errno to something that may be wrong and prompt the caller to take inadequate measures, I'd rather set it to 0, which glibc reports as "success" but really means "no error information" except in a few, well-known contexts; and let the caller deal with the lack of more accurate reporting. I know it's not satisfying, but we can't do any better. I have realised that a simpler unexecutable path can be, for example, /etc (is it mandated in POSIX?); this can save the mkstemp() call in the sysdep test. POSIX doesn't mandate any path other than /dev/null and /dev/console and I'd rather not try executing them, who knows what weird permissions they may have on obscure OSes. It's a sysdep test, it's not performance-critical; I'd rather use mkstemp() to be *sure* we have a path that does not exist. (Of course the user could always race the program, but we're not trying to harden against stupidity here.) (And frankly I personally do not really find it much worthwhile to introduce this amount of complexity for the broken dependency of a quite minor performance optimisation...) I agree it's a lot of work for not much, but as you said, the behaviour is arguably conformant, and your experience proves that old glibcs are still around, so I'd rather make posix_spawn usable where it exists instead of placing the burden of --with-sysdep-posixspawn=no on users who have a bad version. As shown by the qemu bug I linked above, this impacts s6-svscan, which relies on correct child_spawn() reporting when running custom signal handlers, so not working around bad posix_spawn QoI may lead to buggy signal management in s6-svscan, and nobody wants that. A cursory web search appears to say that glibc-2.27 is when they fixed the posix_spawn QoI; 2.17 being bad is consistent with that. But I can't be bothered to go spelunk in glibc code to check and/or bisect, so if someone could confirm, thank you, otherwise, no big deal. -- Laurent
Re: posix_spawn (was: Bugs with execline/s6 documentation and skalibs functions using posix_spawn())
I have realised that a simpler unexecutable path can be, for example, /etc (is it mandated in POSIX?); this can save the mkstemp() call in the sysdep test. (And frankly I personally do not really find it much worthwhile to introduce this amount of complexity for the broken dependency of a quite minor performance optimisation...) -- My current OpenPGP key: RSA4096/0x227E8CAAB7AA186C (expires: 2024.09.30) 7077 7781 B859 5166 AE07 0286 227E 8CAA B7AA 186C
Re: posix_spawn (was: Bugs with execline/s6 documentation and skalibs functions using posix_spawn())
On Wed, Jun 28, 2023 at 09:40:48PM +, Laurent Bercot wrote: > I pushed a workaround to the skalibs git. > Could you please try a build on a machine that exhibits the early > return behaviour and tell me if > - the behaviour is correctly detected by ./configure (the last sysdep) Yes, after changing `attrp' in tryposixspawnearlyreturn.c to `attr'. > - the child_spawn*() family of functions now works properly even on > this machine? Partially, after an additional WEXITED is OR-ed with the options for waitid() in child_spawn_workaround.c. When given an unexecutable path, child_spawn() returns 0, but errno is unset. > Also, can you please tell me what version of glibc these distribution > versions are running? glibc-2.17-317.el7.x86_64, from CentOS 7.9.2009. -- My current OpenPGP key: RSA4096/0x227E8CAAB7AA186C (expires: 2024.09.30) 7077 7781 B859 5166 AE07 0286 227E 8CAA B7AA 186C
posix_spawn (was: Bugs with execline/s6 documentation and skalibs functions using posix_spawn())
I pushed a workaround to the skalibs git. Could you please try a build on a machine that exhibits the early return behaviour and tell me if - the behaviour is correctly detected by ./configure (the last sysdep) - the child_spawn*() family of functions now works properly even on this machine? Also, can you please tell me what version of glibc these distribution versions are running? Thanks! -- Laurent
Re: Bugs with execline/s6 documentation and skalibs functions using posix_spawn()
On Wed, Jun 28, 2023 at 01:33:57AM +, Laurent Bercot wrote: > They're not the only possible behaviours: for instance, [1] shows that > under some buggy qemu, posix_spawn() always returns early. That seems to at least inadvertently comply with posix_spawn(3)... -- My current OpenPGP key: RSA4096/0x227E8CAAB7AA186C (expires: 2024.09.30) 7077 7781 B859 5166 AE07 0286 227E 8CAA B7AA 186C
Re: Bugs with execline/s6 documentation and skalibs functions using posix_spawn()
Actually I copied the fragment of posix_spawn(3) from a Devuan Chimaera machine, so the problem may be not specific to CentOS 7. I did not test CentOS 6 or other distro (version)s, for example; but on Rocky Linux 8, which I unfortunately also need to support at work, the behaviour is as expected. Attached is a simple test. It may be a bug in some old glibc, then. If we assume posix_spawn(3) and posix_spawn(3p) were the only possible behaviours (which is frankly not that reliable, judging from how neither manpage noted the violation of conformance), then the two behaviours could be distinguished with the attached test. They're not the only possible behaviours: for instance, [1] shows that under some buggy qemu, posix_spawn() always returns early. But that behaviour can also be caught by the same workaround as the glibc behaviour you're observing, so it's fine. Since the bug is more widespread than "one old version of one distro", is visible in production environments used at large, and seems constrained to "posix_spawn succeeds even if exec fails", which is testable, I'll add a sysdep to detect it and a workaround in child_spawn*, but it will mean additional manual --with-sysdep-foobar=blah noise for skalibs cross-builds. [1]: https://skarnet.org/lists/skaware/1658.html -- Laurent
Re: Bugs with execline/s6 documentation and skalibs functions using posix_spawn()
On Tue, Jun 27, 2023 at 09:27:17PM +, Laurent Bercot wrote: > Testing the behaviour may be challenging, however, because I suspect > the CentOS 7 implementation of posix_spawn() is just racy, and they > simply documented that they don't care. Actually I copied the fragment of posix_spawn(3) from a Devuan Chimaera machine, so the problem may be not specific to CentOS 7. I did not test CentOS 6 or other distro (version)s, for example; but on Rocky Linux 8, which I unfortunately also need to support at work, the behaviour is as expected. Attached is a simple test. >From the posix_spawn(3), which had the similar statement across the three different versions above, there is also a statement two paragraphs later: "The posix_spawn() and posix_spawnp() functions fail only in the case where the underlying fork(2), vfork(2) or clone(2) call fails". So if the manpage is followed to the word, the CentOS 7 behaviour is expected; if posix_spawn(3p) is followed instead, any exec() error must also be caught and result in returning 0, which we really expect. If we assume posix_spawn(3) and posix_spawn(3p) were the only possible behaviours (which is frankly not that reliable, judging from how neither manpage noted the violation of conformance), then the two behaviours could be distinguished with the attached test. -- My current OpenPGP key: RSA4096/0x227E8CAAB7AA186C (expires: 2024.09.30) 7077 7781 B859 5166 AE07 0286 227E 8CAA B7AA 186C /* ISC license. */ #include #include #include #include #include pid_t child_spawn0 (char const *prog, char const *const *argv, char const *const *envp) { pid_t pid ; posix_spawnattr_t attr ; int e ; e = posix_spawnattr_init() ; if (e) goto err ; e = posix_spawnp(, prog, 0, , (char *const *)argv, (char *const *)envp) ; posix_spawnattr_destroy() ; if (e) goto err ; return pid ; err: errno = e ; return 0 ; } int main (int argc, char const *const *argv, char const *const *envp) { if (argc < 2) return 1 ; printf ("%d\n", child_spawn0(argv[1], [1], envp)) ; return 0 ; }
Re: Bugs with execline/s6 documentation and skalibs functions using posix_spawn()
Testing the behaviour may be challenging, however, because I suspect the CentOS 7 implementation of posix_spawn() is just racy, and they simply documented that they don't care. Thinking about it more, I'm afraid it's not a testable behaviour. Not only isn't there any way to force the race since it entirely happens inside a libc function, but also, the test would require running code on the build machine, which doesn't work for cross-builds and people would have to manually set the sysdep anyway. It seems like --with-sysdep-posixspawn=no, as you did, is the easiest workaround. -- Laurent
Re: Bugs with execline/s6 documentation and skalibs functions using posix_spawn()
As a more general fix, I think tryposixspawn.c should at least try spawning a probably unexecutable path (like the one above) as well, which corrects the sysdep on systems where the expected conformance is broken. Adding a sysdep to detect that case is a good idea indeed! Rather than pretending it doesn't exist, though, I'd rather add a different sysdep that tests its behaviour, so it can still be used with the proper workaround. Testing the behaviour may be challenging, however, because I suspect the CentOS 7 implementation of posix_spawn() is just racy, and they simply documented that they don't care. -- Laurent
Re: Bugs with execline/s6 documentation and skalibs functions using posix_spawn()
On Tue, Jun 27, 2023 at 08:50:35AM +, Laurent Bercot wrote: > I believe the correct setting is actually KillMode=mixed; and the > ExecStop= line is incorrect as well since ExecStop expects a synchronous > command, not an asynchronous one. Better let systemd just send a SIGTERM > to s6-svscan, wait for the supervision tree to exit on its own, and > SIGKILL the stragglers. I pushed a fix accordingly. Yes, I did my change in a haste and did not notice the `mixed' option which is indeed better. As you see, systemd waits for the stragglers so that is at least correct, although the timeout (150s by default on CentOS 7) is really exorbitant. > Yeah, well, tough for non-conforming systems. > That said, I also pushed a change last week that should have fixed > this issue as a side effect, so it's all good. If you feel like it, > you can try the s6-svscan version in the latest s6 git. :) s6-svscan is quite a minor issue; I added .s6-svscan/SIGTERM for all those scandirs which may unfortunately need to run on CentOS 7, as an extra line of defence -- in addition to `--with-sysdep-posixspawn=no' to skalibs. The latter also fixes other programs that involve child_spawn*(), including home-made ones (iotrap, discussed a few days ago on the supervision list) and upstream ones (for instance trap). I personally think `--with-sysdep-posixspawn=no' is the really correct fix on CentOS 7, which fixes silent failures with eg.: #!/bin/execlineb -P trap { SIGHUP { echo test } } /dev/xyzzy As a more general fix, I think tryposixspawn.c should at least try spawning a probably unexecutable path (like the one above) as well, which corrects the sysdep on systems where the expected conformance is broken. -- My current OpenPGP key: RSA4096/0x227E8CAAB7AA186C (expires: 2024.09.30) 7077 7781 B859 5166 AE07 0286 227E 8CAA B7AA 186C
Re: Bugs with execline/s6 documentation and skalibs functions using posix_spawn()
* In `trap.html', there is a reference to the removed `timeout' keyword. Fixed. * In `s6-svscan-not-1.html', the systemd unit (traumatic experience with it, as you may easily expect) lacks a `KillMode = process'. I believe the correct setting is actually KillMode=mixed; and the ExecStop= line is incorrect as well since ExecStop expects a synchronous command, not an asynchronous one. Better let systemd just send a SIGTERM to s6-svscan, wait for the supervision tree to exit on its own, and SIGKILL the stragglers. I pushed a fix accordingly. * The child_spawn*() family of functions, depending on using posix_spawn or not, exhibit different behaviours on CentOS 7 (trauma again), as posix_spawnp() may return 0 with argv pointing to unexecutable paths. This, for example, results in s6-svscan not exiting on SIGTERM when .s6-svscan/SIGTERM is absent. The behaviour of posix_spawnp() on CentOS 7 does not conform to posix_spawn(3p), but is documented in posix_spawn(3): "Even when these functions return a success status, the child process may still fail for a plethora of reasons related to its pre-exec() initialization. In addition, the exec(3) may fail." Yeah, well, tough for non-conforming systems. That said, I also pushed a change last week that should have fixed this issue as a side effect, so it's all good. If you feel like it, you can try the s6-svscan version in the latest s6 git. :) > --with-sysdep-devurandom Also fixed. Thanks a lot for these reports! -- Laurent
Re: Bugs with execline/s6 documentation and skalibs functions using posix_spawn()
On Tue, Jun 27, 2023 at 10:16:10AM +0800, Casper Ti. Vector wrote: > * The child_spawn*() family of functions, [...] Well, just found another one when handling the above issue: * `flags.html' from skalibs references `--enable-sysdep-devurandom', which should now be `--with-sysdep-devurandom'. -- My current OpenPGP key: RSA4096/0x227E8CAAB7AA186C (expires: 2024.09.30) 7077 7781 B859 5166 AE07 0286 227E 8CAA B7AA 186C
Bugs with execline/s6 documentation and skalibs functions using posix_spawn()
* In `trap.html', there is a reference to the removed `timeout' keyword. * In `s6-svscan-not-1.html', the systemd unit (traumatic experience with it, as you may easily expect) lacks a `KillMode = process'. * The child_spawn*() family of functions, depending on using posix_spawn or not, exhibit different behaviours on CentOS 7 (trauma again), as posix_spawnp() may return 0 with argv pointing to unexecutable paths. This, for example, results in s6-svscan not exiting on SIGTERM when .s6-svscan/SIGTERM is absent. The behaviour of posix_spawnp() on CentOS 7 does not conform to posix_spawn(3p), but is documented in posix_spawn(3): "Even when these functions return a success status, the child process may still fail for a plethora of reasons related to its pre-exec() initialization. In addition, the exec(3) may fail." -- My current OpenPGP key: RSA4096/0x227E8CAAB7AA186C (expires: 2024.09.30) 7077 7781 B859 5166 AE07 0286 227E 8CAA B7AA 186C