Hi Arun, Thanks for the prompt reply. We need to test it for our school project which scheduled to end in early December. So, we still need to continue.
The YARN-128 discussion (https://issues.apache.org/jira/browse/YARN-128) mentions that Devaraj is successfully test the RM resurrection. So in this case, how do test is? Do you kill and resurrect RM at random time? We are doing the resurrection using these following steps: 1. Run example MR jobs (such as the Pi computation) 2. After the mapping and reducing process started, we kill the RM using linux's kill command 3. Then, we wait for 3 seconds before we resurrect it. 4. We noticed that the mapping process is able to continue, and the job stuck when the mapping process reaches 100%. At that time reduce process is still 0%. We also modified TestMRJobs.java to use ZKStore, and use ResourceManagerWrapper to start and stop the ResourceManager regards, Arinto Murdopo European Master in Distributed Computing (EMDC) Universitat Politècnica de Catalunya · BarcelonaTech, Barcelona, Spain KTH Royal Institute of Technology, Stockholm, Sweden Phone: +46 725 548 759 On Sat, Nov 3, 2012 at 7:04 PM, Arun C Murthy <[email protected]> wrote: > Arinto, > > Unfortunately, it's too early to try it yet, I'd wait for a little longer > to for it to stabilize - should be soon. > > Thanks for trying it and the feedback though! Much appreciated. > > Arun > > On Nov 3, 2012, at 6:55 AM, Arinto Murdopo wrote: > > > Hi all, > > > > We have this exception when we tried to resurrect ResourceManager using > > ZKStore. We are using Hadoop version 2.0.2 Alpha RC2, with patch from > > #YARN-128 issue (https://issues.apache.org/jira/browse/YARN-128). > > > > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid > event: > > CONTAINER_FINISHED at RECOVERING > > at > > > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301) > > at > > > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) > > at > > > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443) > > at > > > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:510) > > at > > > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:83) > > at > > > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:442) > > at > > > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:423) > > at > > > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126) > > at > > > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75) > > at java.lang.Thread.run(Thread.java:662) > > > > Inspecting RMAppAttemptImpl, we noticed that the state transition doesn't > > handle CONTAINER_FINISHED event when it is in the RECOVERING state. So in > > this case, what is the correct transition to handle CONTAINER_FINISHED > > event when we are in RECOVERING state? > > > > regards, > > > > Arinto Murdopo > > European Master in Distributed Computing (EMDC) > > Universitat Politècnica de Catalunya · BarcelonaTech, Barcelona, Spain > > KTH Royal Institute of Technology, Stockholm, Sweden > > -- > Arun C. Murthy > Hortonworks Inc. > http://hortonworks.com/ > > >
