Re: [ccache] gmake and ccache conspiring together in creating gremlins

2021-02-08 Thread Bernd Petrovitsch via ccache
Hi all!

On 07/02/2021 16:45, Sam Varshavchik via ccache wrote:
[...]
> In the interim I discovered a workaround that seems to work, for
> anyone who's also affected by this. Instead of setting CCACHE_PREFIX
> to distcc, I pointed CCACHE_PREFIX to this:
> 
> #! /bin/sh
> 
> unset MFLAGS ${!MAKE@}
> exec distcc "$@"
> 
> This clears make variables from the environment and when the linker
> invokes make for ...whatever it needs to invoke it for, against a
> manufactured makefile it generates on the fly... it doesn't go look
> for the job server file descriptors, and go sideways.

Hmm, with the whole thread: Shouldn't better the linker (or whatever
tool) cleanup the environment variables before invoking `make` (or
whatever programs) to fix the - IMHO - core cause of the bug?

MfG,
Bernd
-- 
Bernd Petrovitsch  Email : be...@petrovitsch.priv.at
 LUGA : http://www.luga.at

___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache


Re: [ccache] gmake and ccache conspiring together in creating gremlins

2021-02-08 Thread Sam Varshavchik via ccache
On Mon, Feb 8, 2021 at 2:38 PM Paul Smith  wrote:
>
> On Mon, 2021-02-08 at 10:43 +, Edward Welbourne wrote:
> > Sounds to me like that's a bug: when the descriptors are closed, the
> > part of MAKEFLAGS that claims they're make's jobserver file
> > descriptors should be removed, since that's when the claim stops
> > being true.
>
> I believe there have been other similar issues reported recently.
>
> Certainly fixing MAKEFLAGS when we run without jobserver available is
> something that could be done.
>
> There is a loss of debugging information if we make this change: today
> make can detect if it was invoked in a way that _should_ expect to
> receive a jobserver context, but _didn't_ receive that context.  That
> is, if make sees that jobserver-auth is set but it can't open the
> jobserver pipes it can warn the user that most likely there's a problem
> in their environment or with the setup of their makefiles.
>
> Without this warning there's no way to know when this situation occurs.
>  It's easy to create a situation where every sub-make will create its
> own completely unique jobserver domain.  So you start the top make with
> -j4 and run 4 sub-makes; if you do it wrong then each of 4 sub-makes
> could create a new jobserver domain, and now you're running 16 jobs in
> parallel instead of 4... there's no way for make to warn you about this
> situation.

One thought occurred to me. Specifically: when make executes what it
believes to be something other than a recursive invocation of $(MAKE),
and it closes the job server pipe file descriptors for that, it can
also:

1) Add an additional parameter to MAKEFLAGS, let's call it
"--no-jobserver", and perhaps remove the --jobserver-auth parameter
completely. It might be easier just to append something there, instead
of surgically removing this.

2) Make checks for a --no-jobserver in MAKEFLAGS when it starts. If
it's there it does NOT attempt to validate the file descriptors that
are given in --jobserver-auth (if this parameter is preserved). It's a
given that they're not there:

  if (!FD_OK (job_fds[0]) || !FD_OK (job_fds[1]) || make_job_rfd () < 0)

Don't even do that. What happens right now a warning message gets
printed and make runs without a job server. This change should have
the same result, print the warning but skip the FD_OK tests.

This will result in the same warning, but it should avoid triggering
the bug that I found.

However that might cause a minor regression in LTO linking. I think
that this prevents the LTO linker's internal invocation of make from
finding that it can attach to the original make process's job server.

>From sifting through strace dumps, I see that a linker-invoked make
gets its own -j flag. It appears that the linker is courteous enough
to count how many CPUs it has and use it to construct its own -j flag.

How about this, safe approach: once --no-jobserver is there it stays
there, and gets propagated to all recursively invoked makes. If an
invoke make finds that it has both a --no-jobserver and a -j flag,
it'll warn and refuse to create its own job server, and then proceed
executing one command at a time.

This prevents an arithmetic proliferation of job worker processes if
the original job server's file descriptors get lost. Currently
recursively-invoked makes will find, and attach themselves to, an
existing job server. This is nice; but this is vulnerable to an edge
case that I think I'm hitting: a false positive involving a leaked
file descriptor. This change encourages fixing whatever's causing make
to fail to detect a recursive invocation.

___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache


Re: [ccache] gmake and ccache conspiring together in creating gremlins

2021-02-08 Thread Sam Varshavchik via ccache
On Sun, Feb 7, 2021 at 7:59 AM Joel Rosdahl  wrote:

> From my perspective, I don't see anything wrong with ccache dup-ing stderr – I
> guess the bug can be triggered by the user running something like
>
>   exec 3>&2
>
> in the shell as well.
>
> I can think of an easy workaround for the problem: only dup2 stderr when 
> running
> the preprocessor or compiler. Done in
> .

In the interim I discovered a workaround that seems to work, for
anyone who's also affected by this. Instead of setting CCACHE_PREFIX
to distcc, I pointed CCACHE_PREFIX to this:

#! /bin/sh

unset MFLAGS ${!MAKE@}
exec distcc "$@"

This clears make variables from the environment and when the linker
invokes make for ...whatever it needs to invoke it for, against a
manufactured makefile it generates on the fly... it doesn't go look
for the job server file descriptors, and go sideways.

___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache


Re: [ccache] gmake and ccache conspiring together in creating gremlins

2021-02-07 Thread Joel Rosdahl via ccache
On Sun, 7 Feb 2021 at 12:26, Sam Varshavchik via ccache
 wrote:

> I've been chasing down non-deterministic, random build fails.

Thanks for the thorough investigation!

> So, now what? Both ccache, and the linker are contributing to this situation.
> ccache leaks the file descriptor (to distcc apparently?).

Yes, or rather for anyone who wants to use it. Icecream also uses it.

> The linker invokes make for its own use, but it doesn't sanitize the
> environment. make closes the job server file descriptors, but not removing
> them from MAKEFLAGS.

>From my perspective, I don't see anything wrong with ccache dup-ing stderr – I
guess the bug can be triggered by the user running something like

  exec 3>&2

in the shell as well.

I can think of an easy workaround for the problem: only dup2 stderr when running
the preprocessor or compiler. Done in
.

-- Joel

On Sun, 7 Feb 2021 at 12:26, Sam Varshavchik via ccache
 wrote:
>
> I've been chasing down non-deterministic, random build fails. make
> randomly blows up with
>
> write error: stdout
>
> A lot of stracing, experimentation, and pains, determined that this is
> due to a sequence of the following events:
>
> 1) how make uses the job server file descriptor
>
> 2) an intentional leak of the standard error file descriptor in ccache
>
> 3) LTO linking results in linker recursively invoking make (that was a
> new one to me)
>
> 4) A rather nuanced behavior of the Linux PTY driver that can be best
> explained with the following short demo:
>
> #include 
> #include 
> #include 
> #include 
> #include 
>
> int main()
> {
> struct stat state_buf;
>
> for (int i=0; i<8; ++i)
> {
> if (fstat(i, &state_buf) == 0)
> {
> printf("%d: %d/%d\n", i,
>(int)state_buf.st_dev,
>(int)state_buf.st_ino);
> }
> }
>
> printf("NON BLOCKING: %s\n", fcntl(0, F_GETFL) & O_NONBLOCK ? 
> "y":"n");
>
> fcntl(2, F_SETFL, O_NONBLOCK);
>
> printf("NON BLOCKING: %s\n", fcntl(0, F_GETFL) & O_NONBLOCK ? 
> "y":"n");
> return 0;
> }
>
> I'm sshed into my server, and the output of the above is, for me:
>
> 0: 25/3
> 1: 25/3
> 2: 25/3
> NON BLOCKING: n
> NON BLOCKING: y
>
> My stdin, stdout, and stderr is the same dev/ino, and setting non-blocking
> mode one any one of them puts them all in non-blocking mode. I'm setting non-
> blocking on standard error, and my standard input also becomes non-blocking.
>
> After sifting through sources, and straces, this is what I see is happening.
>
> Starting with make, and how it sets up its job file descriptors:
>
> jobserver_parse_auth (const char *auth)
> {
>   /* Given the command-line parameter, parse it.  */
>   if (sscanf (auth, "%d,%d", &job_fds[0], &job_fds[1]) != 2)
>
>   // …
>
>   /* Make sure our pipeline is valid, and (possibly) create a duplicate pipe,
>  that will be closed in the SIGCHLD handler.  If this fails with EBADF,
>  the parent has closed the pipe on us because it didn't think we were a
>  submake.  If so, warn and default to -j1.  */
>
>   if (!FD_OK (job_fds[0]) || !FD_OK (job_fds[1]) || make_job_rfd () < 0)
> }
>
> The TLDR of the above: make reads the job server's file descriptors
> from the MAKEFLAGS environment variable, then checks here if they
> actually exist. If they don't exist, make will create the job server
> pipe. Important: by default they will be file descriptors 3 and 4.
> This becomes a key player in this mystery, a little bit later.
>
> When make spawns a child job (other than a recursive make) the job
> server file descriptors get closed (marked O_CLOEXEC before the actual
> execve), but MAKEFLAGS remains in the environment. They remain open
> for a recursive make call, and the recursive instance picks up the
> ball from here.
>
> The problem with this approach is that if, somehow, a child process
> runs make, and there happens to be a pair of file descriptors here,
> make will assume that they're the job server file descriptors.
>
> Another thing that make does is that it sets one of the file
> descriptors to non-blocking mode. This is in jobserver_setup:
>
>   /* When using pselect() we want the read to be non-blocking.  */
>   set_blocking (job_fds[0], 0);
>
> Now we get to ccache which, amongst things, does the following, in its
> ccache.cpp:
>
> // Make a copy of stderr that will not be cached, so things like distcc can
> // send networking errors to it.
> static void
> set_up_uncached_err()
> {
>   int uncached_fd =
> dup(STDERR_FILENO); // The file descriptor is intentionally leaked.
>   if (uncached_fd == -1) {
> log("dup(2) failed: {}", strerror(errno));
> throw Failure(Statistic::internal_error);
>   }
>
>   Util::setenv("UNCACHED_ERR_FD", fmt::format("{}", uncached_fd));
> }
>
> TLDR: it intentionally lea