Re: O_NONBLOCK: should I blame git or ssh?

2019-09-09 Thread Jeff King
On Sun, Sep 08, 2019 at 02:18:15PM +, Douglas Graham wrote:

> When I collected that strace output, I had stdout redirected to a pipe to my
> workaround program, but I did not redirect stderr.  So ssh made stdout 
> non-blocking,
> but since stderr was still connected to my terminal, it didn't touch that. 
> But when
> this build is run under Jenkins, both stdout and stderr are connected to a 
> pipe that
> Jenkins creates to collect output from the build. I assume that when git runs 
> ssh, it
> does not redirect ssh's stderr to its own pipe, it only redirects stdout. So 
> I think
> ssh will be messing with both pipes when this build is run under Jenkins.

OK, that makes sense.

> Now that I have a fairly good understanding of what's happening, I think I 
> can work
> around this occasional error by redirecting git's stderr to a file or 
> something like
> that, but it's taken us a long time to figure this out, so I wonder if a more 
> permanent
> fix shouldn't be implement, so others don't run into the same problem.  A 
> google for
> "make: write error" indicates that we're not the first to have this problem 
> with
> parallel builds, although in the other cases I've found, a specific version 
> of the
> Linux kernel was being blamed.  Maybe that was a different problem.
> 
> I guess git could workaround this by redirecting stderr, but the real problem 
> is probably
> with ssh, although it's not clear to me what it should do differently. It 
> does some
> somehow backwards to me that that it only makes a descriptor non-blocking if 
> it doesn't
> refer to a TTY, but it does the same thing in at least three different places 
> so I guess
> that's  not a mistake.

Where would git redirect the stderr to? We definitely want to make sure
it goes to our original stderr, since it can have useful content for the
user to see. We could make a new pipe and then pump the output back to
our original stderr. But besides being complex, that fools the
downstream programs about whether stderr is a tty (I don't know whether
ssh cares, but certainly git itself uses that to decide on some elements
of the output, mostly progress meters).

So I think it would make more sense to talk to ssh folks about why this
momentary O_NONBLOCK setting happens, and if it can be avoided.

-Peff


RE: O_NONBLOCK: should I blame git or ssh?

2019-09-09 Thread Douglas Graham
I think I've just figured out why my workaround works.  When ssh sets O_NONBLOCK
on stdout, it's setting it on the write side of the pipe that git created.  
Nothing
else writes to that pipe, so this does not cause a problem.  Presumably, ssh 
itself
is able to deal with writing to a non-blocking "file description". When ssh sets
O_NONBLOCK on stderr, it's setting it on the pipe created by Jenkins.  But 
that's
the same pipe that my workaround program does the select() on, so my workaround
can compensate. When I run "make | workaround", the write side of *that* pipe is
never made non-blocking, and that's the one that make itself writes to.

Ok, I think all mysteries are solved, but I'm still not sure what the moral of
the story is. This seems like it could lead to some really difficult to debug
problems, but it's not clear to me what the right long-term fix should be. For
now, we'll just leave our workaround in place and think about tweaking our
makefile to redirect git's stderr.



RE: O_NONBLOCK: should I blame git or ssh?

2019-09-08 Thread Douglas Graham
Thanks for giving this some thought.

> I don't think this can directly be the culprit, because that ssh's stdout 
> will be
> hooked to a pipe talking to Git, not to the original stdout of "git fetch". It
> should not have even received a descriptor that is a copy of the original
> stdout (nor stdin), since those would have been closed as part of the
> fork+exec.
> 
> The child ssh _does_ have access to the original stderr, which could plausibly
> be a dup of the original stdout. But your strace shows ssh setting the flag
> only for stdin/stdout.

I wondered about that too.  I also wondered why we only have this problem
when doing builds with Jenkins.  The same error has never happened when doing
builds manually as far as I know.  However, stracing the build while it is
running under Jenkins is difficult, so my strace output is from a manual run.
It turns out that ssh only sets non-blocking mode on a descriptor if that
descriptor does not refer to a TTY.  The code in function ssh_session2_open()
looks like:

if (stdin_null_flag) {
in = open(_PATH_DEVNULL, O_RDONLY);
} else {
in = dup(STDIN_FILENO);
}
out = dup(STDOUT_FILENO);
err = dup(STDERR_FILENO);

/* enable nonblocking unless tty */
if (!isatty(in))
set_nonblock(in);
if (!isatty(out))
set_nonblock(out);
if (!isatty(err))
set_nonblock(err);

When I collected that strace output, I had stdout redirected to a pipe to my
workaround program, but I did not redirect stderr.  So ssh made stdout 
non-blocking,
but since stderr was still connected to my terminal, it didn't touch that. But 
when
this build is run under Jenkins, both stdout and stderr are connected to a pipe 
that
Jenkins creates to collect output from the build. I assume that when git runs 
ssh, it
does not redirect ssh's stderr to its own pipe, it only redirects stdout. So I 
think
ssh will be messing with both pipes when this build is run under Jenkins.

Now that I have a fairly good understanding of what's happening, I think I can 
work
around this occasional error by redirecting git's stderr to a file or something 
like
that, but it's taken us a long time to figure this out, so I wonder if a more 
permanent
fix shouldn't be implement, so others don't run into the same problem.  A 
google for
"make: write error" indicates that we're not the first to have this problem with
parallel builds, although in the other cases I've found, a specific version of 
the
Linux kernel was being blamed.  Maybe that was a different problem.

I guess git could workaround this by redirecting stderr, but the real problem 
is probably
with ssh, although it's not clear to me what it should do differently. It does 
some
somehow backwards to me that that it only makes a descriptor non-blocking if it 
doesn't
refer to a TTY, but it does the same thing in at least three different places 
so I guess
that's  not a mistake.


Re: O_NONBLOCK: should I blame git or ssh?

2019-09-08 Thread Jeff King
On Sat, Sep 07, 2019 at 11:50:41PM +, Douglas Graham wrote:

> We have a parallel build that occasionally fails with the error message
> "make: write error".  Make prints that error message as it is exiting when
> it detects that it has seen errors while writing to stdout.  The error it
> is enountering is an EAGAIN error, which implies that something has made
> its stdout non-blocking.  As far as I've been able to tell so far, this is
> occurring while make is running the command "git fetch --quiet --tags".
> Once that command finishes, stdout goes back to being blocking but since
> this is a parallel build, make is doing other work while this git command
> is running, and may attempt to write to stdout during that time.

As I understand it, the mechanism you describe here for your symptoms is
plausible, because O_NONBLOCK is not attached to a particular
descriptor, but a POSIX "file description". So a sub-process setting
flags on its stdout can impact the parent process.

But...

> By stracing this git command, I can see it running subcommand
> 
> ssh -p 29418 user@gerrit.domain "git-upload-pack '/repo'"

I don't think this can directly be the culprit, because that ssh's
stdout will be hooked to a pipe talking to Git, not to the original
stdout of "git fetch". It should not have even received a descriptor
that is a copy of the original stdout (nor stdin), since those would
have been closed as part of the fork+exec.

The child ssh _does_ have access to the original stderr, which could
plausibly be a dup of the original stdout. But your strace shows ssh
setting the flag only for stdin/stdout.

> Should this be considered a git bug or an ssh bug or something else?

I'm not convinced that what you've found is the root cause of your
problem. But even if it is, I don't think you can fault Git. We never
touch the descriptor flags ourselves, and ssh is a black box to us.

> I thought I had finally figured out exactly what is happening but while 
> writing this,
> now I'm not sure why my workaround appears to be working.  My workaround is 
> to pipe
> make's stdout into a simple program that reads make's output and writes it to 
> where
> make uses to write to, except it does a select() on descriptor 1 before 
> writing, and
> it makes sure to handle short counts.  But now I'm thinking that if it's the 
> ssh started
> indirectly by make that is messing with O_NONBLOCK, presumably it would be 
> messing
> with O_NONBLOCK on the write side of the pipe that make writes to, so make 
> should
> still be encountering EAGAIN errors. And yet my workaround does seem to work.

It's possible that it helps because your pipe-reading program is
(almost) always trying to read, and thus "make" on the writing end of
the pipe never sees EAGAIN. As long as your program reads the output
before the pipe buffer fills up, you'd be fine. Where did the output go
in your original? If to a regular file, I'd expect O_NONBLOCK to have no
effect anyway.

-Peff