[While describing the problem below I concluded it is a bug because it
results in being charged for apparent dead on arrival nodes; it looks
like WHIRR-378, and if it is the same problem, then WHIRR-378 needs a
workaround. The Following was added as a comment to WHIRR-378.]
I frequently see net.schmizz.sshj.userauth.UserAuthException when using
Whirr 0.6.0 on EC2; there are many causes for an SSH error, and I dug
into one run and found that 2 out of 5 nodes were seemingly dead on
arrival. That is a high rate, so I wonder whether it was a false
positive DOA (in another case I saw 2 out of 7). The trimmed whirr.log
below shows (last 3 digits of i-number):
1. starting 3 instances/nodes (fbe, fc0, fc2) at 3:37:19
2. problem with a node (fc2) at 3:38:46 or 87 sec. after node start
3. starting a new instance/node (01c) at 3:40:14
4. problem with a another node (01c) at 3:41:19, or 65sec after node start
5. start a new instance/node (040) at 3:41:22
6. delete nodes (01c, fc2) at 3:44:34
The most caused-by ssh error is
"net.schmizz.sshj.userauth.UserAuthException: publickey auth failed".
It looks like the overall error "problem applying options to node" is
occurring 10 seconds after opening the socket, so that node is alive to
some extent and it does not appear to be an ssh timeout. That this
happens about 1 minute after instance start makes me think there could
be an implicit timer awaiting boot-up. (These instances are all using
the same private ami from instance-store and no EBS volumes.)
The failed nodes appear to be deleted after sufficient nodes are started
up, not when they are determined to be failed. Looking at billing
records, I noticed that I am being charged for these failed nodes, so I
think this is an important bug.
-----whirr.log excerpt-------
03:37:19,043 DEBUG [jclouds.compute] << started
instances([region=us-west-1, name=i-f9914fbe])
03:37:19,133 DEBUG [jclouds.compute] << present
instances([region=us-west-1, name=i-f9914fbe])
03:37:19,332 DEBUG [jclouds.compute] << started
instances([region=us-west-1, name=i-87914fc0],[region=us-west-1,
name=i-85914fc2])
03:37:19,495 DEBUG [jclouds.compute] << present
instances([region=us-west-1, name=i-87914fc0],[region=us-west-1,
name=i-85914fc2])
03:38:46,153 ERROR [jclouds.compute] << problem applying options to
node(us-west-1/i-85914fc2)
03:40:14,460 DEBUG [jclouds.compute] << started
instances([region=us-west-1, name=i-5b8e501c])
03:40:14,547 DEBUG [jclouds.compute] << present
instances([region=us-west-1, name=i-5b8e501c])
03:41:19,691 ERROR [jclouds.compute] << problem applying options to
node(us-west-1/i-5b8e501c)
03:41:22,738 DEBUG [jclouds.compute] << started
instances([region=us-west-1, name=i-078e5040])
03:41:22,831 DEBUG [jclouds.compute] << present
instances([region=us-west-1, name=i-078e5040])
03:44:34,257 INFO [org.apache.whirr.actions.BootstrapClusterAction]
Deleting failed node node us-west-1/i-5b8e501c
03:44:34,259 INFO [org.apache.whirr.actions.BootstrapClusterAction]
Deleting failed node node us-west-1/i-85914fc2
03:46:27,948 INFO [org.apache.whirr.service.FileClusterStateStore]
(main) Wrote instances file instances
The instances file ends up containing: i-f9914fbe i-87914fc0 i-078e5040
And not containing: i-5b8e501c i-85914fc2