possible false-positive DOA on EC2 us-west-1 and WHIRR-378

Paul Baclace Mon, 19 Sep 2011 17:04:57 -0700

[While describing the problem below I concluded it is a bug because itresults in being charged for apparent dead on arrival nodes; it lookslike WHIRR-378, and if it is the same problem, then WHIRR-378 needs aworkaround. The Following was added as a comment to WHIRR-378.]

I frequently see net.schmizz.sshj.userauth.UserAuthException when usingWhirr 0.6.0 on EC2; there are many causes for an SSH error, and I duginto one run and found that 2 out of 5 nodes were seemingly dead onarrival. That is a high rate, so I wonder whether it was a falsepositive DOA (in another case I saw 2 out of 7). The trimmed whirr.logbelow shows (last 3 digits of i-number):


1. starting 3 instances/nodes (fbe, fc0, fc2) at 3:37:19
2. problem with a node (fc2) at 3:38:46 or 87 sec. after node start
3. starting a new instance/node (01c) at 3:40:14
4. problem with a another node (01c) at 3:41:19, or 65sec after node start
5. start a new instance/node (040) at 3:41:22
6. delete nodes (01c, fc2) at 3:44:34

The most caused-by ssh error is"net.schmizz.sshj.userauth.UserAuthException: publickey auth failed".It looks like the overall error "problem applying options to node" isoccurring 10 seconds after opening the socket, so that node is alive tosome extent and it does not appear to be an ssh timeout. That thishappens about 1 minute after instance start makes me think there couldbe an implicit timer awaiting boot-up. (These instances are all usingthe same private ami from instance-store and no EBS volumes.)

The failed nodes appear to be deleted after sufficient nodes are startedup, not when they are determined to be failed. Looking at billingrecords, I noticed that I am being charged for these failed nodes, so Ithink this is an important bug.



-----whirr.log excerpt-------

03:37:19,043 DEBUG [jclouds.compute] << startedinstances([region=us-west-1, name=i-f9914fbe])03:37:19,133 DEBUG [jclouds.compute] << presentinstances([region=us-west-1, name=i-f9914fbe])03:37:19,332 DEBUG [jclouds.compute] << startedinstances([region=us-west-1, name=i-87914fc0],[region=us-west-1,name=i-85914fc2])03:37:19,495 DEBUG [jclouds.compute] << presentinstances([region=us-west-1, name=i-87914fc0],[region=us-west-1,name=i-85914fc2])

03:38:46,153 ERROR [jclouds.compute] << problem applying options tonode(us-west-1/i-85914fc2)

03:40:14,460 DEBUG [jclouds.compute] << startedinstances([region=us-west-1, name=i-5b8e501c])03:40:14,547 DEBUG [jclouds.compute] << presentinstances([region=us-west-1, name=i-5b8e501c])

03:41:19,691 ERROR [jclouds.compute] << problem applying options tonode(us-west-1/i-5b8e501c)

03:41:22,738 DEBUG [jclouds.compute] << startedinstances([region=us-west-1, name=i-078e5040])03:41:22,831 DEBUG [jclouds.compute] << presentinstances([region=us-west-1, name=i-078e5040])03:44:34,257 INFO [org.apache.whirr.actions.BootstrapClusterAction]Deleting failed node node us-west-1/i-5b8e501c03:44:34,259 INFO [org.apache.whirr.actions.BootstrapClusterAction]Deleting failed node node us-west-1/i-85914fc203:46:27,948 INFO [org.apache.whirr.service.FileClusterStateStore](main) Wrote instances file instances


The instances file ends up containing:   i-f9914fbe i-87914fc0 i-078e5040
And not containing: i-5b8e501c  i-85914fc2

possible false-positive DOA on EC2 us-west-1 and WHIRR-378

Reply via email to