[jira] Updated: (WHIRR-167) Improve bootstrapping and configuration to be able to isolate and repair or evict failing nodes on EC2

Tibor Kiss (JIRA) Tue, 18 Jan 2011 13:24:05 -0800

     [ 
https://issues.apache.org/jira/browse/WHIRR-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tibor Kiss updated WHIRR-167:
-----------------------------

    Attachment: whirr-167-4.patch

* The code is swallowing NumberFormatExceptions in ClusterSpec, which is the 
right thing to do, but should have a comment, or a log statement.
** We don't need NumberFormatException, which test I added to unit tests. So I 
removed from main code.
* The cases for NumberFormatException would benefit from unit tests.
** Now the unit test is verifying that the user is getting informed correctly 
if he/she creates badly formatted value string
* Is ClusterSpec#parse(String... strings) still used? If not then we should 
remove it.
** ClusterSpec#parse we don't have, just these:
{code}
org.apache.whirr.service.ClusterSpec.InstanceTemplate.parse(String...)
org.apache.whirr.service.ClusterSpec.InstanceTemplate.parse(CompositeConfiguration)
{code}
 called from
{code}
setInstanceTemplates(InstanceTemplate.parse(c));
{code}
* Please add some documentation for the new properties to 
src/site/confluence/configuration-guide.confluence.
** I added.
* Running "mvn checkstyle:checkstyle apache-rat:check" seems to produce some 
warnings. Can you fix these please.
** For me does not signals anything wrong. I checked. If so could you point me 
out?

Today I also started to use the patched version to create some clusters where I 
am running meaningful jobs and I experienced two cases. 
Initially I was setting up a 
{code}
whirr.instance-templates=1 jt+nn,8 dn+tt
whirr.instance-templates-max-percent-failures=90% dn+tt
{code}
It happened that 3 started nodes couldn't be accessible by ssh (nor with manual 
intervention), it started a retry mechanism where we exceed the 20 instance 
limit and the cluster startup effectively couldn't start it.
Then later I reduced 
{code}
whirr.instance-templates=1 jt+nn,8 dn+tt
whirr.instance-templates-max-percent-failures=60% dn+tt
{code}
It happened to exceed the 20 instance limit, but only with one instance, of 
course the cluster goes online with 7 dn+tt without any retry.
This is exactly what I would like to have. (Never mind about 20 instance limit, 
normally we get rid of this limit, but for these tests especially I keept on 
this account.)

> Improve bootstrapping and configuration to be able to isolate and repair or 
> evict failing nodes on EC2
> ------------------------------------------------------------------------------------------------------
>
>                 Key: WHIRR-167
>                 URL: https://issues.apache.org/jira/browse/WHIRR-167
>             Project: Whirr
>          Issue Type: Improvement
>         Environment: Amazon EC2
>            Reporter: Tibor Kiss
>            Assignee: Tibor Kiss
>         Attachments: whirr-167-1.patch, whirr-167-2.patch, whirr-167-3.patch, 
> whirr-167-4.patch, whirr-integrationtest.tar.gz, whirr.log
>
>
> Actually it is very unstable the cluster startup process on Amazon EC2 
> instances. How the number of nodes to be started up is increasing the startup 
> process it fails more often. But sometimes even 2-3 nodes startup process 
> fails. We don't know how many number of instance startup is going on at the 
> same time at Amazon side when it fails or when it successfully starting up. 
> The only think I see is that when I am starting around 10 nodes, the 
> statistics of failing nodes are higher then with smaller number of nodes and 
> is not direct proportional with the number of nodes, looks like it is 
> exponentialy higher probability to fail some nodes.
> Lookint into BootstrapCluterAction.java, there is a note "// TODO: Check for 
> RunNodesException and don't bail out if only a few " which indicated the 
> current unreliable startup process. So we should improve it.
> We could add a "max percent failure" property (per instance template), so 
> that if the number failures exceeded this value the whole cluster fails to 
> launch and is shutdown. For the master node the value would be 100%, but for 
> datanodes it would be more like 75%. (Tom White also mentioned in an email).
> Let's discuss if there are any other requirements to this improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (WHIRR-167) Improve bootstrapping and configuration to be able to isolate and repair or evict failing nodes on EC2

Reply via email to