On Mon, Sep 10, 2007 at 03:16:53PM +0100, Darren J Moffat wrote:
> James Carlson wrote:
> > I'm finding that ssh-agent (launched from my .login) just disappears
> > from my system (Solaris Nevada build 71 on amd64) without apparent
> > provocation.
> > 
> > There are no core files generated (I have coreadm set up to put all
> > core files in a central directory, and I have global setid dumps
> > enabled).  There are no log messages generated.
> 
> So we need to "trap" its exit and find out how what is causing it to 
> exit "apparently" cleanly.

Something like this D script:

pid$1::exit:entry
{
        stop();
        ustack();
        system("prun `pgrep ssh-agent`");
}

> It exists on SIGTERM (which is what ssh-agent -k actually does).

And on SIGHUP.  And when the parent process is gone.

Actually, it's more than that.  ssh-agent is brittle in that it's too
willing to exit in the face of misbehaving clients, whereas it should
instead just slam the door^H^H^H^Hsocket on them.

See calls to fatal() in ssh-agent.c

> > It just plain disappears.  The only way I notice it is that subsequent
> > invocations of ssh require me to enter my passphrase to unlock my
> > local identity file.
> 
> Good for you then that you are doing it from .login rather than using 
> gdm because ssh-agent "disappearing" would cause the whole session to 
> logout (since gdm runs the session "under" ssh-agent).
> 
> > It doesn't happen every time -- I can go through a whole work day
> > without it exiting.  It *does* seem to be load-related.  (That is, if
> > I do something that really stresses the system, such as ::findleaks on
> > a full kernel dump, then ssh-agent is more likely to depart.)
> 
> That is strange.

ENOMEM -> fatal() in $SRC/cmd/ssh/libssh/common/buf*.c.

So maybe it shouldn't be so strange.

$SRC/cmd/ssh/ssh-keysign/ssh-keysign.c has its own implementation of
fatal(), distinct from the one in libssh, that longjmp()s out of trouble
when it can.

I think we should consider the same approach for ssh-agent.

The whole approach of having buffer functions that don't fail but exit
is a very good one for things like ssh(1) and sshd(1M), where exiting is
failing safe, but not so much for ssh-agent, where exiting is not so
safe.

> > Could this be CR 5004146?  I don't see others that are related, and
> > that seems like a stretch.  Is there anything I should be doing to
> > debug this?
> 
> It if is 5004146 you should be able to verify, using DTrace, and looking 
> to see if ssh-agent has died just after getting a request.

5004146 is almost certainly the main issue here.

Nico
-- 

Reply via email to