[ 
https://issues.apache.org/jira/browse/WHIRR-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tibor Kiss updated WHIRR-167:
-----------------------------

    Attachment: whirr-167-1.patch

I attached whirr-167-1.patch. I'm sure that is not a final one, but I would 
like to hear your opinions too.

I changed the ClusterSpec and InstanceTemplate, in order to be able to tell a 
minimum percentage of successfully started nodes.
If we are not specify anything, it means %100, so a value
whirr.instance-templates=1 jt+nn,4 dn+tt%60
would mean that "jt+nn" roles passess only when 100% of the nodes start 
successfully and
"dn+tt" roles passess only when 60% of the nodes starts successfully.

If any of the roles didn't passed the minim requirement, it will initiate a 
retry phase in which the failing nodes on each roles will be replaced with new 
ones. That means that even a namenode startup problem wouldn't mean a complete 
lost cluster.
Without any retries a failure in namenode would break an entire cluster with 
many dn+tt successfuly started. I think that it worst to minimize the chance to 
fail in this way, therefore I introduced a retry cycle.
If there are some failure in dn+tt only while passing the minimum limit, the 
cluster will start up only with that amount of nodes without any retry.
A retry cycle would mean a chance for both roles to increase the number of 
nodes until the maximum value.

At this moment I don't think that more than one retry it worst! The target is 
just to replace a few sporadic service problems.
My question would be that we can leave a retry in case of insufficient nodes or 
we would leave the default value as without retry and add an extra parameter? 
Initially I wouldn't like the ideea to add more parameters.

About failing nodes... There are 2 different cases:
1. In case when the minimum required nodes couldn't be satisfied by a retry 
cycle, in that case all of the lost nodes will be left as it is. A full cluster 
destroy will be able to remove them.
2. In case when the number of nodes is satisfied from the first round or a 
retry, all the failed nodes (from first round and from retry cycle) will be 
destroyed automatically at the end of BootstrapClusterAction.doAction.

I experienced some difficulties in destroying the nodes. Initially I used a 
destroyNodesMatching(Predicate<NodeMetadata> filter) method which would 
terminate all my enumerated nodes in parallel. But this method would like to 
delete also the security group and placement group. Then I had to use the 
simple destroyNode(String id), which now deletes the nodes sequentially and I 
cannot control the KeyPair delition. My opinion that jclouds library is missing 
some convenient methods to revoke some nodes without optional propagation of 
KeyPair, SecurityGroup and PlacementGroup cleanup. Effectively here I get 
screwed up and I feel I couldn't find an elegant solution which does not incurr 
the revoke process.

> Improve bootstrapping and configuration to be able to isolate and repair or 
> evict failing nodes on EC2
> ------------------------------------------------------------------------------------------------------
>
>                 Key: WHIRR-167
>                 URL: https://issues.apache.org/jira/browse/WHIRR-167
>             Project: Whirr
>          Issue Type: Improvement
>         Environment: Amazon EC2
>            Reporter: Tibor Kiss
>            Assignee: Tibor Kiss
>         Attachments: whirr-167-1.patch, whirr.log
>
>
> Actually it is very unstable the cluster startup process on Amazon EC2 
> instances. How the number of nodes to be started up is increasing the startup 
> process it fails more often. But sometimes even 2-3 nodes startup process 
> fails. We don't know how many number of instance startup is going on at the 
> same time at Amazon side when it fails or when it successfully starting up. 
> The only think I see is that when I am starting around 10 nodes, the 
> statistics of failing nodes are higher then with smaller number of nodes and 
> is not direct proportional with the number of nodes, looks like it is 
> exponentialy higher probability to fail some nodes.
> Lookint into BootstrapCluterAction.java, there is a note "// TODO: Check for 
> RunNodesException and don't bail out if only a few " which indicated the 
> current unreliable startup process. So we should improve it.
> We could add a "max percent failure" property (per instance template), so 
> that if the number failures exceeded this value the whole cluster fails to 
> launch and is shutdown. For the master node the value would be 100%, but for 
> datanodes it would be more like 75%. (Tom White also mentioned in an email).
> Let's discuss if there are any other requirements to this improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to