[ 
https://issues.apache.org/jira/browse/YARN-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121030#comment-16121030
 ] 

Yuqi Wang commented on YARN-6959:
---------------------------------


{code:java}
2017-07-31 21:29:34,047 INFO [Container Monitor] 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Memory usage of ProcessTree container_e71_1500967702061_2512_01_000001 for 
container-id container_e71_1500967702061_2512_01_000001: 7.1 GB of 20 GB 
physical memory used; 8.5 GB of 30 GB virtual memory used
2017-07-31 21:29:37,423 INFO [Container Monitor] 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Memory usage of ProcessTree container_e71_1500967702061_2512_01_000001 for 
container-id container_e71_1500967702061_2512_01_000001: 7.1 GB of 20 GB 
physical memory used; 8.5 GB of 30 GB virtual memory used
2017-07-31 21:29:38,239 WARN [ContainersLauncher #60] 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
from container container_e71_1500967702061_2512_01_000001 is : 15
2017-07-31 21:29:38,239 WARN [ContainersLauncher #60] 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception 
from container-launch with container ID: 
container_e71_1500967702061_2512_01_000001 and exit code: 15
ExitCodeException exitCode=15: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:579)
        at org.apache.hadoop.util.Shell.run(Shell.java:490)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:756)
        at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:329)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:86)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
2017-07-31 21:29:38,239 INFO [ContainersLauncher #60] 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from 
container-launch.
Container id: container_e71_1500967702061_2512_01_000001
Exit code: 15
Stack trace: ExitCodeException exitCode=15: 
2017-07-31 21:29:38,240 INFO [ContainersLauncher #60] 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:      at 
org.apache.hadoop.util.Shell.runCommand(Shell.java:579)
2017-07-31 21:29:38,240 INFO [ContainersLauncher #60] 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:      at 
org.apache.hadoop.util.Shell.run(Shell.java:490)
2017-07-31 21:29:38,240 INFO [ContainersLauncher #60] 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:      at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:756)
2017-07-31 21:29:38,240 INFO [ContainersLauncher #60] 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:      at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
2017-07-31 21:29:38,240 INFO [ContainersLauncher #60] 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:      at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:329)
2017-07-31 21:29:38,240 INFO [ContainersLauncher #60] 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:      at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:86)
2017-07-31 21:29:38,240 INFO [ContainersLauncher #60] 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:      at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
2017-07-31 21:29:38,240 INFO [ContainersLauncher #60] 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:      at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
2017-07-31 21:29:38,240 INFO [ContainersLauncher #60] 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:      at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
2017-07-31 21:29:38,240 INFO [ContainersLauncher #60] 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:      at 
java.lang.Thread.run(Thread.java:745)
2017-07-31 21:29:38,240 INFO [ContainersLauncher #60] 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: 

2017-07-31 21:29:38,241 WARN [ContainersLauncher #60] 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Container exited with a non-zero exit code 15
2017-07-31 21:29:38,241 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_e71_1500967702061_2512_01_000001 transitioned from RUNNING 
to EXITED_WITH_FAILURE
2017-07-31 21:29:38,241 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Cleaning up container container_e71_1500967702061_2512_01_000001
2017-07-31 21:29:38,331 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Cleaning Yarn container: 
container_id=container_e71_1500967702061_2512_01_000001
2017-07-31 21:29:38,332 WARN [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hadoop       
OPERATION=Container Finished - Failed   TARGET=ContainerImpl    RESULT=FAILURE  
DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE    
APPID=application_1500967702061_2512    
CONTAINERID=container_e71_1500967702061_2512_01_000001
2017-07-31 21:29:38,333 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_e71_1500967702061_2512_01_000001 transitioned from 
EXITED_WITH_FAILURE to DONE
2017-07-31 21:29:38,333 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
 Removing container_e71_1500967702061_2512_01_000001 from application 
application_1500967702061_2512
2017-07-31 21:29:38,333 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
 Considering container container_e71_1500967702061_2512_01_000001 for 
log-aggregation
2017-07-31 21:29:38,333 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
event CONTAINER_STOP for appId application_1500967702061_2512
2017-07-31 21:29:38,333 INFO [AsyncDispatcher event handler] 
org.apache.spark.network.yarn.YarnShuffleService: Stopping container 
container_e71_1500967702061_2512_01_000001
{code}



> RM may allocate wrong AM Container for new attempt
> --------------------------------------------------
>
>                 Key: YARN-6959
>                 URL: https://issues.apache.org/jira/browse/YARN-6959
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler, fairscheduler, scheduler
>    Affects Versions: 2.7.1
>            Reporter: Yuqi Wang
>            Assignee: Yuqi Wang
>              Labels: patch
>             Fix For: 2.7.1, 3.0.0-alpha4
>
>         Attachments: YARN-6959.001.patch, YARN-6959.002.patch, 
> YARN-6959.003.patch, YARN-6959.004.patch, YARN-6959.005.patch, 
> YARN-6959-branch-2.7.001.patch, YARN-6959.yarn_nm.log.zip, 
> YARN-6959.yarn_rm.log.zip
>
>
> *Issue Summary:*
> Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests. These mis-recorded ResourceRequests may confuse AM 
> Container Request and Allocation for current attempt.
> *Issue Pipeline:*
> {code:java}
> // Executing precondition check for the incoming attempt id.
> ApplicationMasterService.allocate() ->
> scheduler.allocate(attemptId, ask, ...) ->
> // Previous precondition check for the attempt id may be outdated here, 
> // i.e. the currentAttempt may not be the corresponding attempt of the 
> attemptId.
> // Such as the attempt id is corresponding to the previous attempt.
> currentAttempt = scheduler.getApplicationAttempt(attemptId) ->
> // Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests
> currentAttempt.updateResourceRequests(ask) ->
> // RM may allocate wrong AM Container for the current attempt, because its 
> ResourceRequests
> // may come from previous attempt which can be any ResourceRequests previous 
> AM asked
> // and there is not matching logic for the original AM Container 
> ResourceRequest and 
> // the returned amContainerAllocation below.
> AMContainerAllocatedTransition.transition(...) ->
> amContainerAllocation = scheduler.allocate(currentAttemptId, ...)
> {code}
> *Patch Correctness:*
> Because after this Patch, RM will definitely record ResourceRequests from 
> different attempt into different objects of 
> SchedulerApplicationAttempt.AppSchedulingInfo.
> So, even if RM still record ResourceRequests from old attempt at any time, 
> these ResourceRequests will be recorded in old AppSchedulingInfo object which 
> will not impact current attempt's resource requests and allocation.
> *Concerns:*
> The getApplicationAttempt function in AbstractYarnScheduler is so confusing, 
> we should better rename it to getCurrentApplicationAttempt. And reconsider 
> whether there are any other bugs related to getApplicationAttempt.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to