[ https://issues.apache.org/jira/browse/YARN-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chandni Singh resolved YARN-8231. --------------------------------- Resolution: Invalid # Distributed shell application doesn't re-launch containers when it gets container completed event from Node Manager. # To enable NM retry failed containers, additional configs need to be provided. For eg. {{container_retry_policy}} and {{container_max_retries}} # Force killing a container, that is, exit code 137 will not trigger a retry. {code} @Override public boolean shouldRetry(int errorCode) { if (errorCode == ExitCode.SUCCESS.getExitCode() || errorCode == ExitCode.FORCE_KILLED.getExitCode() || errorCode == ExitCode.TERMINATED.getExitCode()) { return false; } return retryPolicy.shouldRetry(windowRetryContext, errorCode); } {code} > Dshell application fails when one of the docker container gets killed > --------------------------------------------------------------------- > > Key: YARN-8231 > URL: https://issues.apache.org/jira/browse/YARN-8231 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services > Reporter: Yesha Vora > Priority: Critical > > 1) Launch dshell application > {code} > yarn jar hadoop-yarn-applications-distributedshell-*.jar -shell_command > "sleep 300" -num_containers 2 -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker > -shell_env YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=centos/httpd-24-centos7:latest > -keep_containers_across_application_attempts -jar > /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell-*.jar{code} > 2) Kill container_1524681858728_0012_01_000002 > Expected behavior: > Application should start new instance and finish successfully > Actual behavior: > Application Failed as soon as container was killed > {code:title=AM log} > 18/04/27 23:05:12 INFO distributedshell.ApplicationMaster: Got response from > RM for container ask, completedCnt=1 > 18/04/27 23:05:12 INFO distributedshell.ApplicationMaster: > appattempt_1524681858728_0012_000001 got container status for > containerID=container_1524681858728_0012_01_000002, state=COMPLETE, > exitStatus=137, diagnostics=[2018-04-27 23:05:09.310]Container killed on > request. Exit code is 137 > [2018-04-27 23:05:09.331]Container exited with a non-zero exit code 137. > [2018-04-27 23:05:09.332]Killed by external signal > 18/04/27 23:08:46 INFO distributedshell.ApplicationMaster: Got response from > RM for container ask, completedCnt=1 > 18/04/27 23:08:46 INFO distributedshell.ApplicationMaster: > appattempt_1524681858728_0012_000001 got container status for > containerID=container_1524681858728_0012_01_000003, state=COMPLETE, > exitStatus=0, diagnostics= > 18/04/27 23:08:46 INFO distributedshell.ApplicationMaster: Container > completed successfully., containerId=container_1524681858728_0012_01_000003 > 18/04/27 23:08:46 INFO distributedshell.ApplicationMaster: Application > completed. Stopping running containers > 18/04/27 23:08:46 INFO distributedshell.ApplicationMaster: Application > completed. Signalling finish to RM > 18/04/27 23:08:46 INFO distributedshell.ApplicationMaster: Diagnostics., > total=2, completed=2, allocated=2, failed=1 > 18/04/27 23:08:46 INFO impl.AMRMClientImpl: Waiting for application to be > successfully unregistered.{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org