[While describing the problem below I concluded it is a bug because it results in being charged for apparent dead on arrival nodes; it looks like WHIRR-378, and if it is the same problem, then WHIRR-378 needs a workaround. The Following was added as a comment to WHIRR-378.]

I frequently see net.schmizz.sshj.userauth.UserAuthException when using Whirr 0.6.0 on EC2; there are many causes for an SSH error, and I dug into one run and found that 2 out of 5 nodes were seemingly dead on arrival. That is a high rate, so I wonder whether it was a false positive DOA (in another case I saw 2 out of 7). The trimmed whirr.log below shows (last 3 digits of i-number):

1. starting 3 instances/nodes (fbe, fc0, fc2) at 3:37:19
2. problem with a node (fc2) at 3:38:46 or 87 sec. after node start
3. starting a new instance/node (01c) at 3:40:14
4. problem with a another node (01c) at 3:41:19, or 65sec after node start
5. start a new instance/node (040) at 3:41:22
6. delete nodes (01c, fc2) at 3:44:34

The most caused-by ssh error is "net.schmizz.sshj.userauth.UserAuthException: publickey auth failed". It looks like the overall error "problem applying options to node" is occurring 10 seconds after opening the socket, so that node is alive to some extent and it does not appear to be an ssh timeout. That this happens about 1 minute after instance start makes me think there could be an implicit timer awaiting boot-up. (These instances are all using the same private ami from instance-store and no EBS volumes.)

The failed nodes appear to be deleted after sufficient nodes are started up, not when they are determined to be failed. Looking at billing records, I noticed that I am being charged for these failed nodes, so I think this is an important bug.


-----whirr.log excerpt-------
03:37:19,043 DEBUG [jclouds.compute] << started instances([region=us-west-1, name=i-f9914fbe]) 03:37:19,133 DEBUG [jclouds.compute] << present instances([region=us-west-1, name=i-f9914fbe]) 03:37:19,332 DEBUG [jclouds.compute] << started instances([region=us-west-1, name=i-87914fc0],[region=us-west-1, name=i-85914fc2]) 03:37:19,495 DEBUG [jclouds.compute] << present instances([region=us-west-1, name=i-87914fc0],[region=us-west-1, name=i-85914fc2])

03:38:46,153 ERROR [jclouds.compute] << problem applying options to node(us-west-1/i-85914fc2)

03:40:14,460 DEBUG [jclouds.compute] << started instances([region=us-west-1, name=i-5b8e501c]) 03:40:14,547 DEBUG [jclouds.compute] << present instances([region=us-west-1, name=i-5b8e501c])

03:41:19,691 ERROR [jclouds.compute] << problem applying options to node(us-west-1/i-5b8e501c)

03:41:22,738 DEBUG [jclouds.compute] << started instances([region=us-west-1, name=i-078e5040]) 03:41:22,831 DEBUG [jclouds.compute] << present instances([region=us-west-1, name=i-078e5040]) 03:44:34,257 INFO [org.apache.whirr.actions.BootstrapClusterAction] Deleting failed node node us-west-1/i-5b8e501c 03:44:34,259 INFO [org.apache.whirr.actions.BootstrapClusterAction] Deleting failed node node us-west-1/i-85914fc2 03:46:27,948 INFO [org.apache.whirr.service.FileClusterStateStore] (main) Wrote instances file instances

The instances file ends up containing:   i-f9914fbe i-87914fc0 i-078e5040
And not containing: i-5b8e501c  i-85914fc2

Reply via email to