On Mon, Sep 10, 2007 at 03:16:53PM +0100, Darren J Moffat wrote: > James Carlson wrote: > > I'm finding that ssh-agent (launched from my .login) just disappears > > from my system (Solaris Nevada build 71 on amd64) without apparent > > provocation. > > > > There are no core files generated (I have coreadm set up to put all > > core files in a central directory, and I have global setid dumps > > enabled). There are no log messages generated. > > So we need to "trap" its exit and find out how what is causing it to > exit "apparently" cleanly.
Something like this D script: pid$1::exit:entry { stop(); ustack(); system("prun `pgrep ssh-agent`"); } > It exists on SIGTERM (which is what ssh-agent -k actually does). And on SIGHUP. And when the parent process is gone. Actually, it's more than that. ssh-agent is brittle in that it's too willing to exit in the face of misbehaving clients, whereas it should instead just slam the door^H^H^H^Hsocket on them. See calls to fatal() in ssh-agent.c > > It just plain disappears. The only way I notice it is that subsequent > > invocations of ssh require me to enter my passphrase to unlock my > > local identity file. > > Good for you then that you are doing it from .login rather than using > gdm because ssh-agent "disappearing" would cause the whole session to > logout (since gdm runs the session "under" ssh-agent). > > > It doesn't happen every time -- I can go through a whole work day > > without it exiting. It *does* seem to be load-related. (That is, if > > I do something that really stresses the system, such as ::findleaks on > > a full kernel dump, then ssh-agent is more likely to depart.) > > That is strange. ENOMEM -> fatal() in $SRC/cmd/ssh/libssh/common/buf*.c. So maybe it shouldn't be so strange. $SRC/cmd/ssh/ssh-keysign/ssh-keysign.c has its own implementation of fatal(), distinct from the one in libssh, that longjmp()s out of trouble when it can. I think we should consider the same approach for ssh-agent. The whole approach of having buffer functions that don't fail but exit is a very good one for things like ssh(1) and sshd(1M), where exiting is failing safe, but not so much for ssh-agent, where exiting is not so safe. > > Could this be CR 5004146? I don't see others that are related, and > > that seems like a stretch. Is there anything I should be doing to > > debug this? > > It if is 5004146 you should be able to verify, using DTrace, and looking > to see if ssh-agent has died just after getting a request. 5004146 is almost certainly the main issue here. Nico --