[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

Omkar Vinit Joshi (JIRA) Mon, 11 Nov 2013 12:47:59 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819358#comment-13819358
 ]


Omkar Vinit Joshi commented on YARN-1210:
-----------------------------------------

thanks [~jianhe]

bq. Please revert all files with import-only changes, CotainerLauncher etc.
bq. revert NodeManager, TestNodeManagerResync changes.
done..

bq. rename RegisterNodeManagerRequest.addAllContainerStatuses to 
setContainerStatuses to conform with convention namings.
bq. Bug in RegisterNodeManagerRequestPBImpl.addAllContainerStatuses(), you are 
appending the containers to the existing container list instead of set. also no 
need to call initFinishedContainers() inside. Can you see what 
NodeStatusPBImpl.(get/set)ContainerStatuses() is doing?
I am following NodeHeartbeatResponse convention.

bq. RMAppAttemptImpl.getRecoveredFinalState not used, removed.
removed..

bq. wrong comment, am2 attempt should be at Launched state. There's one more 
such same wrong comment at line 537
fixed.

bq. waiting for 20 secs is too long for a unit test, can you pick a value as 
small as possible ?
I am making it 10 secs..

bq. no need to create new method waitForAppAttemptToExpire(), we can just use 
MockRM.waitForState()
bq.similarly for newly created methods, waitForRMToProcessAllEvents, 
waitForRMAppAttempts can be achieved by just waiting for the specific app or 
attempt state. That achieves the same result as waiting for all events get 
processed.
fixed..

bq. For the 3rd case, shouldn't we test that as you said in the comment "all 
the stored attempts had finished then new attempt should be started immediately"
last part of test case is actually testing that only..

bq. we also need one more test case that, if RM crashes before attempt initial 
state info is saved in RMStateStore. App will be recovered with no attempt 
associated with it. For that we have no chance to replay the AttemptRecovered 
logic to start a new attempt, App itself should be able to start a new attempt.
added one.


> During RM restart, RM should start a new attempt only when previous attempt 
> exits for real
> ------------------------------------------------------------------------------------------
>
>                 Key: YARN-1210
>                 URL: https://issues.apache.org/jira/browse/YARN-1210
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Omkar Vinit Joshi
>         Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, 
> YARN-1210.4.patch, YARN-1210.4.patch
>
>
> When RM recovers, it can wait for existing AMs to contact RM back and then 
> kill them forcefully before even starting a new AM. Worst case, RM will start 
> a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
> we'll minimize multiple AMs racing with each other. This can help issues with 
> downstream components like Pig, Hive and Oozie during RM restart.
> In the mean while, new apps will proceed as usual as existing apps wait for 
> recovery.
> This can continue to be useful after work-preserving restart, so that AMs 
> which can properly sync back up with RM can continue to run and those that 
> don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

Reply via email to