[ 
https://issues.apache.org/jira/browse/WHIRR-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979423#action_12979423
 ] 

Tibor Kiss edited comment on WHIRR-167 at 1/9/11 5:54 PM:
----------------------------------------------------------

Here it is. I repeated the simulation of 1+2 failing nodes on a 1+2 cluster and 
I use 
computeService.destroyNodesMatching(withIds(badIds))
where Predicate<NodeMetadata> withIds(String... ids) is the filter.

At first failing node deletion it breaks my integration test:

org.apache.whirr.service.hadoop.integration.HadoopServiceTest  Time elapsed: 0 
sec  <<< ERROR!
org.jclouds.aws.AWSResponseException: request POST 
https://ec2.us-east-1.amazonaws.com/ HTTP/1.1 failed with code 400, error: 
AWSError{requestId='3fa71f7d-508f-4a95-aa9c-af0c3e060035', requestToken='null', 
code='InvalidGroup.InUse', message='There are active instances using security 
group 'jclouds#hadoopclustertest#us-east-1'', context='{Response=, Errors=}'}
        at 
org.jclouds.aws.handlers.ParseAWSErrorFromXmlContent.handleError(ParseAWSErrorFromXmlContent.java:80)
        at 
org.jclouds.http.handlers.DelegatingErrorHandler.handleError(DelegatingErrorHandler.java:70)
        at 
org.jclouds.http.internal.BaseHttpCommandExecutorService$HttpResponseCallable.shouldContinue(BaseHttpCommandExecutorService.java:193)
        at 
org.jclouds.http.internal.BaseHttpCommandExecutorService$HttpResponseCallable.call(BaseHttpCommandExecutorService.java:163)
        at 
org.jclouds.http.internal.BaseHttpCommandExecutorService$HttpResponseCallable.call(BaseHttpCommandExecutorService.java:132)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

in jclouds-compute.log I have:
2011-01-09 23:40:45,414 DEBUG [jclouds.compute] (main) >> destroying nodes 
matching(withIds([us-east-1/i-698f4a05]))
2011-01-09 23:40:53,202 DEBUG [jclouds.compute] (user thread 13) >> destroying 
node(us-east-1/i-698f4a05)
2011-01-09 23:41:24,186 DEBUG [jclouds.compute] (user thread 13) << destroyed 
node(us-east-1/i-698f4a05) success(false)
2011-01-09 23:41:24,186 DEBUG [jclouds.compute] (main) << destroyed(1)
2011-01-09 23:41:24,380 DEBUG [jclouds.compute] (main) >> deleting 
keyPair(jclouds#hadoopclustertest#us-east-1#40)
2011-01-09 23:41:24,547 DEBUG [jclouds.compute] (main) << deleted 
keyPair(jclouds#hadoopclustertest#us-east-1#40)
2011-01-09 23:41:24,547 DEBUG [jclouds.compute] (main) >> deleting 
keyPair(jclouds#hadoopclustertest#us-east-1#67)
2011-01-09 23:41:24,716 DEBUG [jclouds.compute] (main) << deleted 
keyPair(jclouds#hadoopclustertest#us-east-1#67)
2011-01-09 23:41:24,716 DEBUG [jclouds.compute] (main) >> deleting 
keyPair(jclouds#hadoopclustertest#us-east-1#63)
2011-01-09 23:41:24,882 DEBUG [jclouds.compute] (main) << deleted 
keyPair(jclouds#hadoopclustertest#us-east-1#63)
2011-01-09 23:41:24,883 DEBUG [jclouds.compute] (main) >> deleting 
keyPair(jclouds#hadoopclustertest#us-east-1#2)
2011-01-09 23:41:25,055 DEBUG [jclouds.compute] (main) << deleted 
keyPair(jclouds#hadoopclustertest#us-east-1#2)
2011-01-09 23:41:25,240 DEBUG [jclouds.compute] (main) >> deleting 
securityGroup(jclouds#hadoopclustertest#us-east-1)

And it removes all of my 4 keypairs! I have 4 keypairs because I had to retry 
both roles.
Apropo.. I hope there is no problem with different keypair per roles (due to 
reply it is created the second keypair!).

      was (Author: tibor.kiss):
    Here it is.
At first failing node deletion it breaks my integration test:

org.apache.whirr.service.hadoop.integration.HadoopServiceTest  Time elapsed: 0 
sec  <<< ERROR!
org.jclouds.aws.AWSResponseException: request POST 
https://ec2.us-east-1.amazonaws.com/ HTTP/1.1 failed with code 400, error: 
AWSError{requestId='3fa71f7d-508f-4a95-aa9c-af0c3e060035', requestToken='null', 
code='InvalidGroup.InUse', message='There are active instances using security 
group 'jclouds#hadoopclustertest#us-east-1'', context='{Response=, Errors=}'}
        at 
org.jclouds.aws.handlers.ParseAWSErrorFromXmlContent.handleError(ParseAWSErrorFromXmlContent.java:80)
        at 
org.jclouds.http.handlers.DelegatingErrorHandler.handleError(DelegatingErrorHandler.java:70)
        at 
org.jclouds.http.internal.BaseHttpCommandExecutorService$HttpResponseCallable.shouldContinue(BaseHttpCommandExecutorService.java:193)
        at 
org.jclouds.http.internal.BaseHttpCommandExecutorService$HttpResponseCallable.call(BaseHttpCommandExecutorService.java:163)
        at 
org.jclouds.http.internal.BaseHttpCommandExecutorService$HttpResponseCallable.call(BaseHttpCommandExecutorService.java:132)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

in jclouds-compute.log I have:
2011-01-09 23:40:45,414 DEBUG [jclouds.compute] (main) >> destroying nodes 
matching(withIds([us-east-1/i-698f4a05]))
2011-01-09 23:40:53,202 DEBUG [jclouds.compute] (user thread 13) >> destroying 
node(us-east-1/i-698f4a05)
2011-01-09 23:41:24,186 DEBUG [jclouds.compute] (user thread 13) << destroyed 
node(us-east-1/i-698f4a05) success(false)
2011-01-09 23:41:24,186 DEBUG [jclouds.compute] (main) << destroyed(1)
2011-01-09 23:41:24,380 DEBUG [jclouds.compute] (main) >> deleting 
keyPair(jclouds#hadoopclustertest#us-east-1#40)
2011-01-09 23:41:24,547 DEBUG [jclouds.compute] (main) << deleted 
keyPair(jclouds#hadoopclustertest#us-east-1#40)
2011-01-09 23:41:24,547 DEBUG [jclouds.compute] (main) >> deleting 
keyPair(jclouds#hadoopclustertest#us-east-1#67)
2011-01-09 23:41:24,716 DEBUG [jclouds.compute] (main) << deleted 
keyPair(jclouds#hadoopclustertest#us-east-1#67)
2011-01-09 23:41:24,716 DEBUG [jclouds.compute] (main) >> deleting 
keyPair(jclouds#hadoopclustertest#us-east-1#63)
2011-01-09 23:41:24,882 DEBUG [jclouds.compute] (main) << deleted 
keyPair(jclouds#hadoopclustertest#us-east-1#63)
2011-01-09 23:41:24,883 DEBUG [jclouds.compute] (main) >> deleting 
keyPair(jclouds#hadoopclustertest#us-east-1#2)
2011-01-09 23:41:25,055 DEBUG [jclouds.compute] (main) << deleted 
keyPair(jclouds#hadoopclustertest#us-east-1#2)
2011-01-09 23:41:25,240 DEBUG [jclouds.compute] (main) >> deleting 
securityGroup(jclouds#hadoopclustertest#us-east-1)

And it removes all of my 4 keypairs! I have 4 keypairs because I had to retry 
both roles.
Apropo.. I hope there is no problem with different keypair per roles (due to 
reply it is created the second keypair!).
  
> Improve bootstrapping and configuration to be able to isolate and repair or 
> evict failing nodes on EC2
> ------------------------------------------------------------------------------------------------------
>
>                 Key: WHIRR-167
>                 URL: https://issues.apache.org/jira/browse/WHIRR-167
>             Project: Whirr
>          Issue Type: Improvement
>         Environment: Amazon EC2
>            Reporter: Tibor Kiss
>            Assignee: Tibor Kiss
>         Attachments: whirr-167-1.patch, whirr.log
>
>
> Actually it is very unstable the cluster startup process on Amazon EC2 
> instances. How the number of nodes to be started up is increasing the startup 
> process it fails more often. But sometimes even 2-3 nodes startup process 
> fails. We don't know how many number of instance startup is going on at the 
> same time at Amazon side when it fails or when it successfully starting up. 
> The only think I see is that when I am starting around 10 nodes, the 
> statistics of failing nodes are higher then with smaller number of nodes and 
> is not direct proportional with the number of nodes, looks like it is 
> exponentialy higher probability to fail some nodes.
> Lookint into BootstrapCluterAction.java, there is a note "// TODO: Check for 
> RunNodesException and don't bail out if only a few " which indicated the 
> current unreliable startup process. So we should improve it.
> We could add a "max percent failure" property (per instance template), so 
> that if the number failures exceeded this value the whole cluster fails to 
> launch and is shutdown. For the master node the value would be 100%, but for 
> datanodes it would be more like 75%. (Tom White also mentioned in an email).
> Let's discuss if there are any other requirements to this improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to