I encountered something similar a while ago — though I believe it was in 
version 2.2.1. Basically, if vcld sent an ssh command at a particular moment as 
sshd is first starting up on the windows VM, the command could hang and derail 
the entire workflow (image capture, image load, etc). This hasn’t happened in a 
while, and I believe that it was fixed in version 2.3.

Well, at least, there is code in version 2.3 and later that can kill any ssh 
commands if they exceed a certain length of time (by using the ‘timeout’ option 
in the run_ssh_command() function call)

Are you able to figure out what the ssh command is? Does it vary? (Not all 
commands are sent with timeout values). If you encounter a hung ssh command, 
you can usually find it by examining the processlist on the management node and 
then make sure that that call was executed with a timeout value. You may also 
want to verify that the ssh option -o ConnectTimeout=X is part of the command 
passed to the VM.



Aaron


On Jan 28, 2014, at 4:25 PM, Cameron Mann 
<[email protected]<mailto:[email protected]>> wrote:

Hi Aaron,

I haven't seen a case of one becoming unresponsive after running for a while, 
it's always been from the moment they come online.

We're running VCL 2.3.

Cameron


On Tue, Jan 28, 2014 at 12:37 PM, Aaron Coburn 
<[email protected]<mailto:[email protected]>> wrote:
Cameron,

When this issue emerges, is it with VMs that have been running for a while and 
then become unresponsive, or are they unresponsive from the moment they come on 
line?

Also, which version of the VCL are you using?

Aaron




--
Aaron Coburn
System Administrator / Programmer
Web Services, Amherst College




On Jan 28, 2014, at 1:30 PM, Cameron Mann 
<[email protected]<mailto:[email protected]>> wrote:

Hi all,

We've been running into an issue intermittently with sshd on some of our 
Windows images where it appears to be running but stops responding.

Symptoms:
- vm is pingable
- ssh attempts hang, no error message
- packet capture on the vm shows syn from client, syn ack from sshd, ack from 
client, then nothing
- sshd.log appears normal
- sshd process does not respond to stop/restart and must be killed manually, 
but starts accepting connections after being started again (full reboot also 
works)

There's no apparent pattern between the failures that I've been able to find, 
even using the same image the failure doesn't happen reliably. I also haven't 
been able to isolate the problem to a specific subset of our images so I 
haven't been able to compare a broken installation with a working one. I've 
also tried updating all the Cygwin packages and re-running the 
cygwin-sshd-config.sh script which made no difference.

Has anyone run into something similar?

Thanks,
Cameron



Reply via email to