[jira] Commented: (WHIRR-167) Improve bootstrapping and configuration to be able to isolate and repair or evict failing nodes on EC2

Tibor Kiss (JIRA) Tue, 08 Feb 2011 22:09:23 -0800

    [ 
https://issues.apache.org/jira/browse/WHIRR-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992334#comment-12992334
 ]


Tibor Kiss commented on WHIRR-167:
----------------------------------

Today I manage to rebuild the patch again.
Regarding MAX_STARTUP_RETRIES, normally a value higher than 1 is not 
practically useful because in if some instance losses cannot be covered from 
one retry it can be considered severe enough to manually intervene. Eventually 
somebody really wants to set up 
{{{
whirr.instance-templates-max-percent-failures=100% dn+tt
}}}
and really sacrifices more than one retries, it can be an option. But 
definitively not a recommended one. Instead of this, a new feature to add 
additional nodes can be more effective, compared to the possible loss of nodes. 
Because in my experience I saw that usually a failure to start instance is a 
temporary situation. When this is the situation, if one retry is not enough 
than probably a few seconds or minutes later the instance reservation and 
startup problems disappear again. 

A value of 0 it also makes sense to switch off completely the retry mechanism. 
So I will made a small change and add a new parameter.
{{{
whirr.max-startup-retries=1
}}}
with default value 1.

I also check the temporary keys.


> Improve bootstrapping and configuration to be able to isolate and repair or 
> evict failing nodes on EC2
> ------------------------------------------------------------------------------------------------------
>
>                 Key: WHIRR-167
>                 URL: https://issues.apache.org/jira/browse/WHIRR-167
>             Project: Whirr
>          Issue Type: Improvement
>    Affects Versions: 0.4.0
>         Environment: Amazon EC2
>            Reporter: Tibor Kiss
>            Assignee: Tibor Kiss
>             Fix For: 0.4.0
>
>         Attachments: whirr-167-1.patch, whirr-167-2.patch, whirr-167-3.patch, 
> whirr-167-4.patch, whirr-167-5.patch, whirr-integrationtest.tar.gz, whirr.log
>
>
> Actually it is very unstable the cluster startup process on Amazon EC2 
> instances. How the number of nodes to be started up is increasing the startup 
> process it fails more often. But sometimes even 2-3 nodes startup process 
> fails. We don't know how many number of instance startup is going on at the 
> same time at Amazon side when it fails or when it successfully starting up. 
> The only think I see is that when I am starting around 10 nodes, the 
> statistics of failing nodes are higher then with smaller number of nodes and 
> is not direct proportional with the number of nodes, looks like it is 
> exponentialy higher probability to fail some nodes.
> Lookint into BootstrapCluterAction.java, there is a note "// TODO: Check for 
> RunNodesException and don't bail out if only a few " which indicated the 
> current unreliable startup process. So we should improve it.
> We could add a "max percent failure" property (per instance template), so 
> that if the number failures exceeded this value the whole cluster fails to 
> launch and is shutdown. For the master node the value would be 100%, but for 
> datanodes it would be more like 75%. (Tom White also mentioned in an email).
> Let's discuss if there are any other requirements to this improvement.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (WHIRR-167) Improve bootstrapping and configuration to be able to isolate and repair or evict failing nodes on EC2

Reply via email to