[
https://issues.apache.org/jira/browse/WHIRR-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991171#comment-12991171
]
Andrei Savu commented on WHIRR-167:
-----------------------------------
Looks great and it's almost ready. I have noticed some minor issues:
- {{MAX_STARTUP_RETRIES}} has a fixed value. should we make this a config
option?
- integration tests depend on {{~/.ssh/id_rsa}}. You could use the
{{ClusterSpec.withTemporaryKeys}} factory method to avoid this.
Also, thanks for fixing the short role names I missed in WHIRR-199. I haven't
run the integration tests, I'm waiting for WHIRR-227.
> Improve bootstrapping and configuration to be able to isolate and repair or
> evict failing nodes on EC2
> ------------------------------------------------------------------------------------------------------
>
> Key: WHIRR-167
> URL: https://issues.apache.org/jira/browse/WHIRR-167
> Project: Whirr
> Issue Type: Improvement
> Affects Versions: 0.4.0
> Environment: Amazon EC2
> Reporter: Tibor Kiss
> Assignee: Tibor Kiss
> Fix For: 0.4.0
>
> Attachments: whirr-167-1.patch, whirr-167-2.patch, whirr-167-3.patch,
> whirr-167-4.patch, whirr-167-5.patch, whirr-integrationtest.tar.gz, whirr.log
>
>
> Actually it is very unstable the cluster startup process on Amazon EC2
> instances. How the number of nodes to be started up is increasing the startup
> process it fails more often. But sometimes even 2-3 nodes startup process
> fails. We don't know how many number of instance startup is going on at the
> same time at Amazon side when it fails or when it successfully starting up.
> The only think I see is that when I am starting around 10 nodes, the
> statistics of failing nodes are higher then with smaller number of nodes and
> is not direct proportional with the number of nodes, looks like it is
> exponentialy higher probability to fail some nodes.
> Lookint into BootstrapCluterAction.java, there is a note "// TODO: Check for
> RunNodesException and don't bail out if only a few " which indicated the
> current unreliable startup process. So we should improve it.
> We could add a "max percent failure" property (per instance template), so
> that if the number failures exceeded this value the whole cluster fails to
> launch and is shutdown. For the master node the value would be 100%, but for
> datanodes it would be more like 75%. (Tom White also mentioned in an email).
> Let's discuss if there are any other requirements to this improvement.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira