I encountered something similar a while ago — though I believe it was in version 2.2.1. Basically, if vcld sent an ssh command at a particular moment as sshd is first starting up on the windows VM, the command could hang and derail the entire workflow (image capture, image load, etc). This hasn’t happened in a while, and I believe that it was fixed in version 2.3.
Well, at least, there is code in version 2.3 and later that can kill any ssh commands if they exceed a certain length of time (by using the ‘timeout’ option in the run_ssh_command() function call) Are you able to figure out what the ssh command is? Does it vary? (Not all commands are sent with timeout values). If you encounter a hung ssh command, you can usually find it by examining the processlist on the management node and then make sure that that call was executed with a timeout value. You may also want to verify that the ssh option -o ConnectTimeout=X is part of the command passed to the VM. Aaron On Jan 28, 2014, at 4:25 PM, Cameron Mann <[email protected]<mailto:[email protected]>> wrote: Hi Aaron, I haven't seen a case of one becoming unresponsive after running for a while, it's always been from the moment they come online. We're running VCL 2.3. Cameron On Tue, Jan 28, 2014 at 12:37 PM, Aaron Coburn <[email protected]<mailto:[email protected]>> wrote: Cameron, When this issue emerges, is it with VMs that have been running for a while and then become unresponsive, or are they unresponsive from the moment they come on line? Also, which version of the VCL are you using? Aaron -- Aaron Coburn System Administrator / Programmer Web Services, Amherst College On Jan 28, 2014, at 1:30 PM, Cameron Mann <[email protected]<mailto:[email protected]>> wrote: Hi all, We've been running into an issue intermittently with sshd on some of our Windows images where it appears to be running but stops responding. Symptoms: - vm is pingable - ssh attempts hang, no error message - packet capture on the vm shows syn from client, syn ack from sshd, ack from client, then nothing - sshd.log appears normal - sshd process does not respond to stop/restart and must be killed manually, but starts accepting connections after being started again (full reboot also works) There's no apparent pattern between the failures that I've been able to find, even using the same image the failure doesn't happen reliably. I also haven't been able to isolate the problem to a specific subset of our images so I haven't been able to compare a broken installation with a working one. I've also tried updating all the Cygwin packages and re-running the cygwin-sshd-config.sh script which made no difference. Has anyone run into something similar? Thanks, Cameron
