[jira] [Commented] (YARN-1417) RM may issue expired container tokens to AM while issuing new containers.
[ https://issues.apache.org/jira/browse/YARN-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13836116#comment-13836116 ] Omkar Vinit Joshi commented on YARN-1417: - cool...will update the patch with tests.. RM may issue expired container tokens to AM while issuing new containers. - Key: YARN-1417 URL: https://issues.apache.org/jira/browse/YARN-1417 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Priority: Blocker Attachments: YARN-1417.2.patch Today we create new container token when we create container in RM as a part of schedule cycle. However that container may get reserved or assigned. If the container gets reserved and remains like that (in reserved state) for more than container token expiry interval then RM will end up issuing container with expired token. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1430) InvalidStateTransition exceptions are ignored in state machines
[ https://issues.apache.org/jira/browse/YARN-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13829400#comment-13829400 ] Omkar Vinit Joshi commented on YARN-1430: - I think for now we should add assert statements so that in test environment it will always fail making sure we are not missing some invalid transitions? YARN-1416 is one of those examples. I agree with [~vinodkv] and [~jlowe]. Probably we should be consistent everywhere and should show somewhere these system critical errors without actually crashing daemons. InvalidStateTransition exceptions are ignored in state machines --- Key: YARN-1430 URL: https://issues.apache.org/jira/browse/YARN-1430 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi We have all state machines ignoring InvalidStateTransitions. These exceptions will get logged but will not crash the RM / NM. We definitely should crash it as they move the system into some invalid / unacceptable state. * Places where we hide this exception :- ** JobImpl ** TaskAttemptImpl ** TaskImpl ** NMClientAsyncImpl ** ApplicationImpl ** ContainerImpl ** LocalizedResource ** RMAppAttemptImpl ** RMAppImpl ** RMContainerImpl ** RMNodeImpl thoughts? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1436) ZKRMStateStore should have separate configuration for retry period.
Omkar Vinit Joshi created YARN-1436: --- Summary: ZKRMStateStore should have separate configuration for retry period. Key: YARN-1436 URL: https://issues.apache.org/jira/browse/YARN-1436 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Jian He Problem :- Today we have zkSessionTimeout period which is getting used for zookeeper session timeout and for ZKRMStateStore based retry policy. Proposed suggestion :- Ideally we should have different configuration knobs for this. Ideal values for zkSessionTimeout should be :- number of zookeeper instances participating in quorum * per zookeeper session timeout. see {code} org.apache.zookeeper.ClientCnxn.ClientCnxn().. connectTimeout = sessionTimeout / hostProvider.size(); {code} retry policy... (may be retry time period or count) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1436) ZKRMStateStore should have separate configuration for retry period.
[ https://issues.apache.org/jira/browse/YARN-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1436: Component/s: resourcemanager ZKRMStateStore should have separate configuration for retry period. --- Key: YARN-1436 URL: https://issues.apache.org/jira/browse/YARN-1436 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.1 Reporter: Omkar Vinit Joshi Assignee: Jian He Problem :- Today we have zkSessionTimeout period which is getting used for zookeeper session timeout and for ZKRMStateStore based retry policy. Proposed suggestion :- Ideally we should have different configuration knobs for this. Ideal values for zkSessionTimeout should be :- number of zookeeper instances participating in quorum * per zookeeper session timeout. see {code} org.apache.zookeeper.ClientCnxn.ClientCnxn().. connectTimeout = sessionTimeout / hostProvider.size(); {code} retry policy... (may be retry time period or count) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1436) ZKRMStateStore should have separate configuration for retry period.
[ https://issues.apache.org/jira/browse/YARN-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1436: Affects Version/s: 2.2.1 ZKRMStateStore should have separate configuration for retry period. --- Key: YARN-1436 URL: https://issues.apache.org/jira/browse/YARN-1436 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.1 Reporter: Omkar Vinit Joshi Assignee: Jian He Problem :- Today we have zkSessionTimeout period which is getting used for zookeeper session timeout and for ZKRMStateStore based retry policy. Proposed suggestion :- Ideally we should have different configuration knobs for this. Ideal values for zkSessionTimeout should be :- number of zookeeper instances participating in quorum * per zookeeper session timeout. see {code} org.apache.zookeeper.ClientCnxn.ClientCnxn().. connectTimeout = sessionTimeout / hostProvider.size(); {code} retry policy... (may be retry time period or count) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1425) TestRMRestart is failing on trunk
[ https://issues.apache.org/jira/browse/YARN-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828041#comment-13828041 ] Omkar Vinit Joshi commented on YARN-1425: - yes yarn-tests are passing locally. TestRMRestart is failing on trunk - Key: YARN-1425 URL: https://issues.apache.org/jira/browse/YARN-1425 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Attachments: YARN-1425.1.patch, error.log TestRMRestart is failing on trunk. Fixing it. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1053) Diagnostic message from ContainerExitEvent is ignored in ContainerImpl
[ https://issues.apache.org/jira/browse/YARN-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1053: Attachment: YARN-1053.1.patch Thanks [~bikassaha] .. Added a null check and also updated test case which verifies both diagnostic message and exitCode. Diagnostic message from ContainerExitEvent is ignored in ContainerImpl -- Key: YARN-1053 URL: https://issues.apache.org/jira/browse/YARN-1053 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0, 2.2.1 Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Priority: Blocker Labels: newbie Fix For: 2.3.0, 2.2.1 Attachments: YARN-1053.1.patch, YARN-1053.20130809.patch If the container launch fails then we send ContainerExitEvent. This event contains exitCode and diagnostic message. Today we are ignoring diagnostic message while handling this event inside ContainerImpl. Fixing it as it is useful in diagnosing the failure. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-713) ResourceManager can exit unexpectedly if DNS is unavailable
[ https://issues.apache.org/jira/browse/YARN-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828155#comment-13828155 ] Omkar Vinit Joshi commented on YARN-713: Also fixed YARN-1417 as a part of this. It is straight forward. ResourceManager can exit unexpectedly if DNS is unavailable --- Key: YARN-713 URL: https://issues.apache.org/jira/browse/YARN-713 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Priority: Critical Fix For: 2.3.0 Attachments: YARN-713.09052013.1.patch, YARN-713.09062013.1.patch, YARN-713.1.patch, YARN-713.2.patch, YARN-713.20130910.1.patch, YARN-713.patch, YARN-713.patch, YARN-713.patch, YARN-713.patch As discussed in MAPREDUCE-5261, there's a possibility that a DNS outage could lead to an unhandled exception in the ResourceManager's AsyncDispatcher, and that ultimately would cause the RM to exit. The RM should not exit during DNS hiccups. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1416) InvalidStateTransitions getting reported in multiple test cases even though they pass
[ https://issues.apache.org/jira/browse/YARN-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828287#comment-13828287 ] Omkar Vinit Joshi commented on YARN-1416: - Thanks [~jianhe] I have basic question.. RM should have crashed right? we can't just ignore such invalid state transitions? Should we? I see that someone has modified it to log the exception but ignore it inside RMAppImpl.java. {code} try { /* keep the master in sync with the state machine */ this.stateMachine.doTransition(event.getType(), event); } catch (InvalidStateTransitonException e) { LOG.error(Can't handle this event at current state, e); /* TODO fail the application on the failed transition */ } {code} I see that other places too we are ignoring this after logging it. Not sure if this is right because we may just move the system into corrupted state without crashing/stopping it. At least we should add assert statements to all the state machines to make sure that such transitions don't go unnoticed. I applied the patch and tested locally.. one more test needs to be fixed.. {code} 2013-11-20 15:23:52,127 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(645)) - appattempt_1384989831257_0042_01 State change from NEW to SUBMITTED 2013-11-20 15:23:52,129 ERROR [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(593)) - Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: APP_ACCEPTED at RUNNING at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:591) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:77) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions$TestApplicationEventDispatcher.handle(TestRMAppTransitions.java:139) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions$TestApplicationEventDispatcher.handle(TestRMAppTransitions.java:125) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:159) at org.apache.hadoop.yarn.event.DrainDispatcher$1.run(DrainDispatcher.java:65) at java.lang.Thread.run(Thread.java:680) {code} InvalidStateTransitions getting reported in multiple test cases even though they pass - Key: YARN-1416 URL: https://issues.apache.org/jira/browse/YARN-1416 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Jian He Attachments: YARN-1416.1.patch, YARN-1416.1.patch It might be worth checking why they are reporting this. Testcase : TestRMAppTransitions, TestRM there are large number of such errors. can't handle RMAppEventType.APP_UPDATE_SAVED at RMAppState.FAILED -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1430) InvalidStateTransition exceptions are ignored in state machines
Omkar Vinit Joshi created YARN-1430: --- Summary: InvalidStateTransition exceptions are ignored in state machines Key: YARN-1430 URL: https://issues.apache.org/jira/browse/YARN-1430 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi We have all state machines ignoring InvalidStateTransitions. These exceptions will get logged but will not crash the RM / NM. We definitely should crash it as they move the system into some invalid / unacceptable state. *Places where we hide this exception :- ** JobImpl ** TaskAttemptImpl ** TaskImpl ** NMClientAsyncImpl ** ApplicationImpl ** ContainerImpl ** LocalizedResource ** RMAppAttemptImpl ** RMAppImpl ** RMContainerImpl ** RMNodeImpl thoughts? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1417) RM may issue expired container tokens to AM while issuing new containers.
[ https://issues.apache.org/jira/browse/YARN-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1417: Attachment: YARN-1417.2.patch RM may issue expired container tokens to AM while issuing new containers. - Key: YARN-1417 URL: https://issues.apache.org/jira/browse/YARN-1417 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Attachments: YARN-1417.2.patch Today we create new container token when we create container in RM as a part of schedule cycle. However that container may get reserved or assigned. If the container gets reserved and remains like that (in reserved state) for more than container token expiry interval then RM will end up issuing container with expired token. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1417) RM may issue expired container tokens to AM while issuing new containers.
[ https://issues.apache.org/jira/browse/YARN-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828435#comment-13828435 ] Omkar Vinit Joshi commented on YARN-1417: - updating basic patch here.. without test cases .. if this approach looks ok then we can easily fix YARN-713.. RM may issue expired container tokens to AM while issuing new containers. - Key: YARN-1417 URL: https://issues.apache.org/jira/browse/YARN-1417 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Attachments: YARN-1417.2.patch Today we create new container token when we create container in RM as a part of schedule cycle. However that container may get reserved or assigned. If the container gets reserved and remains like that (in reserved state) for more than container token expiry interval then RM will end up issuing container with expired token. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1431) TestWebAppProxyServlet is failing on trunk
Omkar Vinit Joshi created YARN-1431: --- Summary: TestWebAppProxyServlet is failing on trunk Key: YARN-1431 URL: https://issues.apache.org/jira/browse/YARN-1431 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.1 Reporter: Omkar Vinit Joshi Priority: Blocker Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.609 sec FAILURE! - in org.apache.hadoop.yarn.server.webproxy.TestWebAppProxyServlet testWebAppProxyServerMainMethod(org.apache.hadoop.yarn.server.webproxy.TestWebAppProxyServlet) Time elapsed: 5.006 sec ERROR! java.lang.Exception: test timed out after 5000 milliseconds at java.net.Inet4AddressImpl.getHostByAddr(Native Method) at java.net.InetAddress$1.getHostByAddr(InetAddress.java:881) at java.net.InetAddress.getHostFromNameService(InetAddress.java:560) at java.net.InetAddress.getCanonicalHostName(InetAddress.java:531) at org.apache.hadoop.security.SecurityUtil.getLocalHostName(SecurityUtil.java:227) at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:247) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServer.doSecureLogin(WebAppProxyServer.java:72) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServer.serviceInit(WebAppProxyServer.java:57) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServer.startServer(WebAppProxyServer.java:99) at org.apache.hadoop.yarn.server.webproxy.TestWebAppProxyServlet.testWebAppProxyServerMainMethod(TestWebAppProxyServlet.java:187) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1431) TestWebAppProxyServlet is failing on trunk
[ https://issues.apache.org/jira/browse/YARN-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828444#comment-13828444 ] Omkar Vinit Joshi commented on YARN-1431: - extract from surefire logs {code} 2013-11-20 18:59:47,514 INFO [Thread-4] mortbay.log (Slf4jLog.java:info(67)) - Extract jar:file:/Users/ojoshi/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/target/hadoop-yarn-common-3.0.0-SNAPSHOT.jar!/webapps/proxy to /var/folders/h8/dlw3rlfx0_b5zjrw7kn1752mgn/T/Jetty_localhost_57922_proxyrtadom/webapp 2013-11-20 18:59:47,678 INFO [Thread-4] mortbay.log (Slf4jLog.java:info(67)) - Started SelectChannelConnector@localhost:57922 Proxy server is started at port 57922 2013-11-20 18:59:47,797 ERROR [1023736867@qtp-568432173-0] mortbay.log (Slf4jLog.java:warn(87)) - /proxy/app org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Error parsing application ID: app at org.apache.hadoop.yarn.util.Apps.throwParseException(Apps.java:69) at org.apache.hadoop.yarn.util.Apps.toAppID(Apps.java:54) at org.apache.hadoop.yarn.util.Apps.toAppID(Apps.java:49) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:252) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1220) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) 2013-11-20 18:59:47,854 INFO [1023736867@qtp-568432173-0] webproxy.WebAppProxyServlet (WebAppProxyServlet.java:doGet(322)) - dr.who is accessing unchecked http://localhost:57919/foo/bar/ which is the app master GUI of application_00_0 owned by dr.who 2013-11-20 18:59:47,925 WARN [1023736867@qtp-568432173-0] webproxy.WebAppProxyServlet (WebAppProxyServlet.java:doGet(278)) - dr.who Attempting to access application_0_ that was not found {code} TestWebAppProxyServlet is failing on trunk -- Key: YARN-1431 URL: https://issues.apache.org/jira/browse/YARN-1431 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.1 Reporter: Omkar Vinit Joshi Priority: Blocker Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.609 sec FAILURE! - in org.apache.hadoop.yarn.server.webproxy.TestWebAppProxyServlet testWebAppProxyServerMainMethod(org.apache.hadoop.yarn.server.webproxy.TestWebAppProxyServlet) Time elapsed: 5.006 sec ERROR! java.lang.Exception: test timed out after 5000 milliseconds at java.net.Inet4AddressImpl.getHostByAddr(Native Method) at java.net.InetAddress$1.getHostByAddr(InetAddress.java:881) at java.net.InetAddress.getHostFromNameService(InetAddress.java:560) at
[jira] [Commented] (YARN-744) Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated.
[ https://issues.apache.org/jira/browse/YARN-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826974#comment-13826974 ] Omkar Vinit Joshi commented on YARN-744: Thanks [~bikassaha] addressed your comments. Attaching a new patch. Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. - Key: YARN-744 URL: https://issues.apache.org/jira/browse/YARN-744 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bikas Saha Assignee: Omkar Vinit Joshi Priority: Minor Attachments: MAPREDUCE-3899-branch-0.23.patch, YARN-744-20130711.1.patch, YARN-744-20130715.1.patch, YARN-744-20130726.1.patch, YARN-744.1.patch, YARN-744.2.patch, YARN-744.patch Looks like the lock taken in this is broken. It takes a lock on lastResponse object and then puts a new lastResponse object into the map. At this point a new thread entering this function will get a new lastResponse object and will be able to take its lock and enter the critical section. Presumably we want to limit one response per app attempt. So the lock could be taken on the ApplicationAttemptId key of the response map object. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-744) Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated.
[ https://issues.apache.org/jira/browse/YARN-744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-744: --- Attachment: YARN-744.2.patch Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. - Key: YARN-744 URL: https://issues.apache.org/jira/browse/YARN-744 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bikas Saha Assignee: Omkar Vinit Joshi Priority: Minor Attachments: MAPREDUCE-3899-branch-0.23.patch, YARN-744-20130711.1.patch, YARN-744-20130715.1.patch, YARN-744-20130726.1.patch, YARN-744.1.patch, YARN-744.2.patch, YARN-744.patch Looks like the lock taken in this is broken. It takes a lock on lastResponse object and then puts a new lastResponse object into the map. At this point a new thread entering this function will get a new lastResponse object and will be able to take its lock and enter the critical section. Presumably we want to limit one response per app attempt. So the lock could be taken on the ApplicationAttemptId key of the response map object. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1053) Diagnostic message from ContainerExitEvent is ignored in ContainerImpl
[ https://issues.apache.org/jira/browse/YARN-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1053: Affects Version/s: 2.2.1 2.2.0 Diagnostic message from ContainerExitEvent is ignored in ContainerImpl -- Key: YARN-1053 URL: https://issues.apache.org/jira/browse/YARN-1053 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0, 2.2.1 Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Priority: Blocker Labels: newbie Fix For: 2.3.0, 2.2.1 Attachments: YARN-1053.20130809.patch If the container launch fails then we send ContainerExitEvent. This event contains exitCode and diagnostic message. Today we are ignoring diagnostic message while handling this event inside ContainerImpl. Fixing it as it is useful in diagnosing the failure. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1425) TestRMRestart is failing on trunk
Omkar Vinit Joshi created YARN-1425: --- Summary: TestRMRestart is failing on trunk Key: YARN-1425 URL: https://issues.apache.org/jira/browse/YARN-1425 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi TestRMRestart is failing on trunk. Fixing it. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1425) TestRMRestart is failing on trunk
[ https://issues.apache.org/jira/browse/YARN-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1425: Attachment: error.log [issue was seen|https://builds.apache.org/job/PreCommit-YARN-Build/2486//testReport/org.apache.hadoop.yarn.server.resourcemanager/TestRMRestart/testRMRestartWaitForPreviousAMToFinish/] TestRMRestart is failing on trunk - Key: YARN-1425 URL: https://issues.apache.org/jira/browse/YARN-1425 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Attachments: error.log TestRMRestart is failing on trunk. Fixing it. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-744) Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated.
[ https://issues.apache.org/jira/browse/YARN-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13827011#comment-13827011 ] Omkar Vinit Joshi commented on YARN-744: Test failure is not related to this. Opened ticket YARN-1425 to track this. Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. - Key: YARN-744 URL: https://issues.apache.org/jira/browse/YARN-744 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bikas Saha Assignee: Omkar Vinit Joshi Priority: Minor Attachments: MAPREDUCE-3899-branch-0.23.patch, YARN-744-20130711.1.patch, YARN-744-20130715.1.patch, YARN-744-20130726.1.patch, YARN-744.1.patch, YARN-744.2.patch, YARN-744.patch Looks like the lock taken in this is broken. It takes a lock on lastResponse object and then puts a new lastResponse object into the map. At this point a new thread entering this function will get a new lastResponse object and will be able to take its lock and enter the critical section. Presumably we want to limit one response per app attempt. So the lock could be taken on the ApplicationAttemptId key of the response map object. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1425) TestRMRestart is failing on trunk
[ https://issues.apache.org/jira/browse/YARN-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13827075#comment-13827075 ] Omkar Vinit Joshi commented on YARN-1425: - just discovered MockRM.waitForState(appAttempt, RMAppAttemptState)... simple ignores the passed in application attempt and always considers current application attempt. Fixing it. *RMAppAttempt attempt = app.getCurrentAppAttempt();* {code} public void waitForState(ApplicationAttemptId attemptId, RMAppAttemptState finalState) throws Exception { RMApp app = getRMContext().getRMApps().get(attemptId.getApplicationId()); Assert.assertNotNull(app shouldn't be null, app); RMAppAttempt attempt = app.getCurrentAppAttempt(); int timeoutSecs = 0; while (!finalState.equals(attempt.getAppAttemptState()) timeoutSecs++ 40) { System.out.println(AppAttempt : + attemptId + State is : + attempt.getAppAttemptState() + Waiting for state : + finalState); Thread.sleep(1000); } System.out.println(Attempt State is : + attempt.getAppAttemptState()); Assert.assertEquals(Attempt state is not correct (timedout), finalState, attempt.getAppAttemptState()); } {code} TestRMRestart is failing on trunk - Key: YARN-1425 URL: https://issues.apache.org/jira/browse/YARN-1425 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Attachments: error.log TestRMRestart is failing on trunk. Fixing it. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1425) TestRMRestart is failing on trunk
[ https://issues.apache.org/jira/browse/YARN-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1425: Attachment: YARN-1425.1.patch TestRMRestart is failing on trunk - Key: YARN-1425 URL: https://issues.apache.org/jira/browse/YARN-1425 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Attachments: YARN-1425.1.patch, error.log TestRMRestart is failing on trunk. Fixing it. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1363) Get / Cancel / Renew delegation token api should be non blocking
[ https://issues.apache.org/jira/browse/YARN-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1363: Attachment: YARN-1363.1.patch Work in progress patch.. YARN-1363.1.patch Get / Cancel / Renew delegation token api should be non blocking Key: YARN-1363 URL: https://issues.apache.org/jira/browse/YARN-1363 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Attachments: YARN-1363.1.patch Today GetDelgationToken, CancelDelegationToken and RenewDelegationToken are all blocking apis. * As a part of these calls we try to update RMStateStore and that may slow it down. * Now as we have limited number of client request handlers we may fill up client handlers quickly. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1421) Node managers will not receive application finish event where containers ran before RM restart
Omkar Vinit Joshi created YARN-1421: --- Summary: Node managers will not receive application finish event where containers ran before RM restart Key: YARN-1421 URL: https://issues.apache.org/jira/browse/YARN-1421 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Priority: Critical Problem :- Today for every application we track the node managers where container ran. So when application finishes it notifies all those node managers about application finish event (via node manager heartbeat). However if rm restarts then we forget this past information and those node managers will never get application finish event and will keep reporting finished applications. Propose Solution :- Instead of remembering the node managers where containers ran for this particular application it would be better if we depend on node manager heartbeat to take this decision. i.e. when node manager heartbeats saying it is running application (app1, app2) then we should those application's status in RM's memory {code}rmContext.getRMApps(){code} and if either they are not found (very old applications) or they are in their final state (FINISHED, KILLED, FAILED) then we should immediately notify the node manager about the application finish event. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1421) Node managers will not receive application finish event where containers ran before RM restart
[ https://issues.apache.org/jira/browse/YARN-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1421: Description: Problem :- Today for every application we track the node managers where containers ran. So when application finishes it notifies all those node managers about application finish event (via node manager heartbeat). However if rm restarts then we forget this past information and those node managers will never get application finish event and will keep reporting finished applications. Proposed Solution :- Instead of remembering the node managers where containers ran for this particular application it would be better if we depend on node manager heartbeat to take this decision. i.e. when node manager heartbeats saying it is running application (app1, app2) then we should check those application's status in RM's memory {code}rmContext.getRMApps(){code} and if either they are not found (very old applications) or they are in their final state (FINISHED, KILLED, FAILED) then we should immediately notify the node manager about the application finish event. By doing this we are reducing the state which we need to store at RM after restart. was: Problem :- Today for every application we track the node managers where container ran. So when application finishes it notifies all those node managers about application finish event (via node manager heartbeat). However if rm restarts then we forget this past information and those node managers will never get application finish event and will keep reporting finished applications. Propose Solution :- Instead of remembering the node managers where containers ran for this particular application it would be better if we depend on node manager heartbeat to take this decision. i.e. when node manager heartbeats saying it is running application (app1, app2) then we should those application's status in RM's memory {code}rmContext.getRMApps(){code} and if either they are not found (very old applications) or they are in their final state (FINISHED, KILLED, FAILED) then we should immediately notify the node manager about the application finish event. Node managers will not receive application finish event where containers ran before RM restart -- Key: YARN-1421 URL: https://issues.apache.org/jira/browse/YARN-1421 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Priority: Critical Problem :- Today for every application we track the node managers where containers ran. So when application finishes it notifies all those node managers about application finish event (via node manager heartbeat). However if rm restarts then we forget this past information and those node managers will never get application finish event and will keep reporting finished applications. Proposed Solution :- Instead of remembering the node managers where containers ran for this particular application it would be better if we depend on node manager heartbeat to take this decision. i.e. when node manager heartbeats saying it is running application (app1, app2) then we should check those application's status in RM's memory {code}rmContext.getRMApps(){code} and if either they are not found (very old applications) or they are in their final state (FINISHED, KILLED, FAILED) then we should immediately notify the node manager about the application finish event. By doing this we are reducing the state which we need to store at RM after restart. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-674: --- Attachment: YARN-674.8.patch Thanks [~vinodkv] for pointing it out..didn't understand earlier. Adding synchronized block to service state change. Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch, YARN-674.7.patch, YARN-674.8.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1210: Attachment: YARN-1210.7.patch During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, YARN-1210.4.patch, YARN-1210.4.patch, YARN-1210.5.patch, YARN-1210.6.patch, YARN-1210.7.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-674: --- Attachment: YARN-674.9.patch Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch, YARN-674.7.patch, YARN-674.8.patch, YARN-674.9.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1422) RM CapacityScheduler can deadlock when getQueueUserAclInfo() is called and a container is completing
[ https://issues.apache.org/jira/browse/YARN-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825988#comment-13825988 ] Omkar Vinit Joshi commented on YARN-1422: - Yes this looks to be a problem. check this [synchronization locking problem | https://issues.apache.org/jira/browse/YARN-897?focusedCommentId=13706284page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13706284] The ordering always should be from root to leaf queue. I think there can be other places too where this ordering is mixed. RM CapacityScheduler can deadlock when getQueueUserAclInfo() is called and a container is completing Key: YARN-1422 URL: https://issues.apache.org/jira/browse/YARN-1422 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.2.0 Reporter: Adam Kawa Priority: Critical If getQueueUserAclInfo() on a parent/root queue (e.g. via CapacityScheduler.getQueueUserAclInfo) is called, and a container is completing, then the ResourceManager can deadlock. It is similar to https://issues.apache.org/jira/browse/YARN-325. *More details:* * Thread A 1) In a synchronized block of code (a lockid 0xc18d8870=LeafQueue.class), LeafQueue.completedContainer wants to inform the parent queue that a container is being completed and invokes ParentQueue.completedContainer method. 3) The ParentQueue.completedContainer waits to aquire a lock on itself (a lockid 0xc1846350=ParentQueue.class) to go to synchronized block of code. It can not accuire this lock, because Thread B already has this lock. * Thread B 0) A moment earlier, CapacityScheduler.getQueueUserAclInfo is called. This method invokes a synchronized method on ParentQueue.class i.e. ParentQueue.getQueueUserAclInfo (a lockid 0xc1846350=ParentQueue.class) and aquires the lock that Thread A will be waiting for. 2) Unluckyly, ParentQueue.getQueueUserAclInfo iterates over children queue acls and it wants to run a synchonized method, LeafQueue.getQueueUserAclInfo, but it does not have a lock on LeafQueue.class (a lockid 0xc18d8870=LeafQueue.class). This lock is already held by LeafQueue.completedContainer in Thread A. The order that causes the deadlock: B0 - A1 - B2 - A3. *Java Stacktrace* {code} Found one Java-level deadlock: = 1956747953@qtp-109760451-1959: waiting to lock monitor 0x434e10c8 (object 0xc1846350, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue), which is held by IPC Server handler 39 on 8032 IPC Server handler 39 on 8032: waiting to lock monitor 0x422bbc58 (object 0xc18d8870, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue), which is held by ResourceManager Event Processor ResourceManager Event Processor: waiting to lock monitor 0x434e10c8 (object 0xc1846350, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue), which is held by IPC Server handler 39 on 8032 Java stack information for the threads listed above: === 1956747953@qtp-109760451-1959: at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getUsedCapacity(ParentQueue.java:276) - waiting to lock 0xc1846350 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerInfo.init(CapacitySchedulerInfo.java:49) at org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:203) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:76) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826055#comment-13826055 ] Omkar Vinit Joshi commented on YARN-674: [~bikassaha] I completely missed your comment. What you are saying will not occur. {code} pool.allowCoreThreadTimeOut(true); {code} this should time out core threads if there are any lying around. Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch, YARN-674.7.patch, YARN-674.8.patch, YARN-674.9.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826066#comment-13826066 ] Omkar Vinit Joshi commented on YARN-674: I think we should just ignore the find bug warning.. it is never going to occur...plus TestRMRestart is passing locally... there must be some race condition here not related to this patch. Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch, YARN-674.7.patch, YARN-674.8.patch, YARN-674.9.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1416) InvalidStateTransitions getting reported in multiple test cases even though they pass
Omkar Vinit Joshi created YARN-1416: --- Summary: InvalidStateTransitions getting reported in multiple test cases even though they pass Key: YARN-1416 URL: https://issues.apache.org/jira/browse/YARN-1416 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Jian He It might be worth checking why they are reporting this. Testcase : TestRMAppTransitions, TestRM there are large number of such errors. can't handle RMAppEventType.APP_UPDATE_SAVED at RMAppState.FAILED -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1417) RM may issue expired container tokens to AM while issuing new containers.
[ https://issues.apache.org/jira/browse/YARN-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13824175#comment-13824175 ] Omkar Vinit Joshi commented on YARN-1417: - Fixing this as a part of YARN-713 where I am restructuring the token generation logic. RM may issue expired container tokens to AM while issuing new containers. - Key: YARN-1417 URL: https://issues.apache.org/jira/browse/YARN-1417 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Today we create new container token when we create container in RM as a part of schedule cycle. However that container may get reserved or assigned. If the container gets reserved and remains like that (in reserved state) for more than container token expiry interval then RM will end up issuing container with expired token. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1417) RM may issue expired container tokens to AM while issuing new containers.
Omkar Vinit Joshi created YARN-1417: --- Summary: RM may issue expired container tokens to AM while issuing new containers. Key: YARN-1417 URL: https://issues.apache.org/jira/browse/YARN-1417 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Today we create new container token when we create container in RM as a part of schedule cycle. However that container may get reserved or assigned. If the container gets reserved and remains like that (in reserved state) for more than container token expiry interval then RM will end up issuing container with expired token. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-713) ResourceManager can exit unexpectedly if DNS is unavailable
[ https://issues.apache.org/jira/browse/YARN-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-713: --- Attachment: YARN-713.2.patch ResourceManager can exit unexpectedly if DNS is unavailable --- Key: YARN-713 URL: https://issues.apache.org/jira/browse/YARN-713 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Priority: Critical Fix For: 2.3.0 Attachments: YARN-713.09052013.1.patch, YARN-713.09062013.1.patch, YARN-713.1.patch, YARN-713.2.patch, YARN-713.20130910.1.patch, YARN-713.patch, YARN-713.patch, YARN-713.patch, YARN-713.patch As discussed in MAPREDUCE-5261, there's a possibility that a DNS outage could lead to an unhandled exception in the ResourceManager's AsyncDispatcher, and that ultimately would cause the RM to exit. The RM should not exit during DNS hiccups. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13823100#comment-13823100 ] Omkar Vinit Joshi commented on YARN-1210: - Thanks [~vinodkv] bq. cleanupContainersOnNMResync: We are no longer making the call to getNodeStatusAndUpdateContainersInContext, can you please put a comment as to why - I believe this is so that NodeStatusUpdater can eventually take these statuses up when it reregisters. yes they are used when NM re-registers with RM. added comment.. bq. use getContainerState instead of cloneAndGetContainerStatus? They are different. bq. Use RegisterNodeManagerRequest.newInstance() in registerWithRM? bq. Similarly NodeStatus.newInstance, NodeHealthStatus.newInstance? they were missing added them and fixed NodeStatusUpdater. bq. As of now because we kill all containers it's fine, but it's better to explicitly check for master-container's state during registration and then only send the event. bq. Also put a comment as to why we are directly faking RMAppAttemptContainerFinishedEvent instead of informing RMContainerImpl. But we don't know about the container today..right? bq. Instead of sending and ignoring ATTEMPT_FAILED at FAILED state, we can skip sending this event by RMAppAttempt if the app was already in a final state? Ok.. should I also remove the similar transition from FINISHED / KILLED? address all other comments. During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, YARN-1210.4.patch, YARN-1210.4.patch, YARN-1210.5.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1210: Attachment: YARN-1210.6.patch During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, YARN-1210.4.patch, YARN-1210.4.patch, YARN-1210.5.patch, YARN-1210.6.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13822026#comment-13822026 ] Omkar Vinit Joshi commented on YARN-674: Thanks [~vinodkv] bq. RMAppManager.submitApplication: Put a comment where you move apps to finish state saying we are doing this before token-renewal so that we don't renew tokens for finished apps. Added a comment. bq. isServiceStarted needs to be volatile? No.. it is updated only once just when service starts. bq. handleDTRenewerEvent - handleDTRenewerAppSubmitEvent done.. bq. Add a comment in handleDTRenewerEvent to indicate why DTRenewer is starting the app as opposed to RMAppManager. added one.. bq. Instead of putting renewerCount in the main code path, you can access the thread count from ThreadPoolExecutor.getPoolSize() in the tests directly ? moved this to test code. bq. DelegationTokenRenewerAppSubmitEvent can be nested class inside DelegationTokenRenewer? This is not an event from outside the renewer. Similarly DelegationTokenRenewerEventType. Either nest them in, or create a separate package. moved the events and eventType inside DTTokenRenewer. bq. testInvalidDelegationTokenApplicationSubmit, testInvalidDTWithAddApplication: Seem similar but test different things. May be rename one or both? renamed both.. bq. The other point is the default number of threads in the renewer. 5 is too small, may be bump it up to existing number of RPC threads - 50 or something in that range? using thread pool with core pool size = 5 and max pool size = 50 (configurable). Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-674: --- Attachment: YARN-674.7.patch Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch, YARN-674.7.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1338) Recover localized resource cache state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820378#comment-13820378 ] Omkar Vinit Joshi commented on YARN-1338: - Thanks [~jlowe] bq. I would rather not tie a checksum to this. Corruption of the file isn't related to whether the NM is restarting, and it seems odd to only check for corruption on restart rather than every time the resource is requested. IMHO we should treat checksums for localized resources as an orthogonal feature request to this. (It would also significantly slow down the recovery time if the NM had to checksum-compare everything in the distcache on startup.) Yes I completely agree..checksum should be an additional feature rather than done as a part of this. bq. So if we persist the LocalResourceRequest to LocalizedResource map then we can tell after a recovery whether we already have the requested resource or not when a new request arrives. Agreed. This way we will have all the information we need to reconstruct the cache. bq. We have a very rough start on persisting the local cache state, and I plan on working on this in earnest in the next few weeks. good ... any thoughts on how and when we are planning to store the container's resource request and newly downloaded resource request to persistent store? * clearly for resource request it should be quite clear. When download finishes and resource is marked as LOCALIZED..we should save the info...(the way RMRestart is doing today for RMAppImpl...NEW...to...NEW_SAVING...to...SUBMITTED) * But for container request it will become little bit tricky... ** When we initially get resource request for all the required resources during container start? ** or when individual resource request gets satisfied (as they are added to ref of LocalizedResource) ** or when for container all the resources are downloaded / localized? 3rd scenario looks good to me because * by then we will have information about all the localized resources. If downloading failed for any of them then we frankly don't care about storing partial success so we can avoid this write. * Also when container finishes / fails we can simply remove the entry Any thoughts whether we want to avoid container start before we process all the writes to store or can we start in parallel? Clearly parallel writes don't look good to me because if any of the write events are in flight and nm restarts then after restart we won't know about those changes..but at the same time if we wait for all the writes to go through then we are delaying container start by that duration. Recover localized resource cache state upon nodemanager restart --- Key: YARN-1338 URL: https://issues.apache.org/jira/browse/YARN-1338 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Today when node manager restarts we clean up all the distributed cache files from disk. This is definitely not ideal from 2 aspects. * For work preserving restart we definitely want them as running containers are using them * For even non work preserving restart this will be useful in the sense that we don't have to download them again if needed by future tasks. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-713) ResourceManager can exit unexpectedly if DNS is unavailable
[ https://issues.apache.org/jira/browse/YARN-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-713: --- Attachment: YARN-713.1.patch ResourceManager can exit unexpectedly if DNS is unavailable --- Key: YARN-713 URL: https://issues.apache.org/jira/browse/YARN-713 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Priority: Critical Fix For: 2.3.0 Attachments: YARN-713.09052013.1.patch, YARN-713.09062013.1.patch, YARN-713.1.patch, YARN-713.20130910.1.patch, YARN-713.patch, YARN-713.patch, YARN-713.patch, YARN-713.patch As discussed in MAPREDUCE-5261, there's a possibility that a DNS outage could lead to an unhandled exception in the ResourceManager's AsyncDispatcher, and that ultimately would cause the RM to exit. The RM should not exit during DNS hiccups. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-744) Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated.
[ https://issues.apache.org/jira/browse/YARN-744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-744: --- Attachment: YARN-744.1.patch Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated. - Key: YARN-744 URL: https://issues.apache.org/jira/browse/YARN-744 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bikas Saha Assignee: Omkar Vinit Joshi Priority: Minor Attachments: MAPREDUCE-3899-branch-0.23.patch, YARN-744-20130711.1.patch, YARN-744-20130715.1.patch, YARN-744-20130726.1.patch, YARN-744.1.patch, YARN-744.patch Looks like the lock taken in this is broken. It takes a lock on lastResponse object and then puts a new lastResponse object into the map. At this point a new thread entering this function will get a new lastResponse object and will be able to take its lock and enter the critical section. Presumably we want to limit one response per app attempt. So the lock could be taken on the ApplicationAttemptId key of the response map object. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13819358#comment-13819358 ] Omkar Vinit Joshi commented on YARN-1210: - thanks [~jianhe] bq. Please revert all files with import-only changes, CotainerLauncher etc. bq. revert NodeManager, TestNodeManagerResync changes. done.. bq. rename RegisterNodeManagerRequest.addAllContainerStatuses to setContainerStatuses to conform with convention namings. bq. Bug in RegisterNodeManagerRequestPBImpl.addAllContainerStatuses(), you are appending the containers to the existing container list instead of set. also no need to call initFinishedContainers() inside. Can you see what NodeStatusPBImpl.(get/set)ContainerStatuses() is doing? I am following NodeHeartbeatResponse convention. bq. RMAppAttemptImpl.getRecoveredFinalState not used, removed. removed.. bq. wrong comment, am2 attempt should be at Launched state. There's one more such same wrong comment at line 537 fixed. bq. waiting for 20 secs is too long for a unit test, can you pick a value as small as possible ? I am making it 10 secs.. bq. no need to create new method waitForAppAttemptToExpire(), we can just use MockRM.waitForState() bq.similarly for newly created methods, waitForRMToProcessAllEvents, waitForRMAppAttempts can be achieved by just waiting for the specific app or attempt state. That achieves the same result as waiting for all events get processed. fixed.. bq. For the 3rd case, shouldn't we test that as you said in the comment all the stored attempts had finished then new attempt should be started immediately last part of test case is actually testing that only.. bq. we also need one more test case that, if RM crashes before attempt initial state info is saved in RMStateStore. App will be recovered with no attempt associated with it. For that we have no chance to replay the AttemptRecovered logic to start a new attempt, App itself should be able to start a new attempt. added one. During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, YARN-1210.4.patch, YARN-1210.4.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1338) Recover localized resource cache state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1338: Description: Today when node manager restarts we clean up all the Recover localized resource cache state upon nodemanager restart --- Key: YARN-1338 URL: https://issues.apache.org/jira/browse/YARN-1338 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Ravi Prakash Today when node manager restarts we clean up all the -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1338) Recover localized resource cache state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13819678#comment-13819678 ] Omkar Vinit Joshi commented on YARN-1338: - Here are certain things which we may want to track as part of this. * Info from LocalizedResource ** Local Disk Path ** timestamp ** RemoteUrl (Here do we need to trust that the old and new url are identical..not changed)? ** we store the resources inside the distributed cache in an hierarchical manner (to avoid unix directory limit)... we may need to recover that too). ** checksum? * We will also need to track containers which are using this resource. It would be better if we isolate this from the place where we are storing LocalizedResource thereby changes to this will be minimal. ** Do we need to store the symlink we are creating? anyone working on this actively? Recover localized resource cache state upon nodemanager restart --- Key: YARN-1338 URL: https://issues.apache.org/jira/browse/YARN-1338 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Ravi Prakash Today when node manager restarts we clean up all the distributed cache files from disk. This is definitely not ideal from 2 aspects. * For work preserving restart we definitely want them as running containers are using them * For even non work preserving restart this will be useful in the sense that we don't have to download them again if needed by future tasks. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1210: Attachment: YARN-1210.5.patch During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, YARN-1210.4.patch, YARN-1210.4.patch, YARN-1210.5.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817941#comment-13817941 ] Omkar Vinit Joshi commented on YARN-1210: - Attaching rebased patch. I slightly modified the logic for RMRestart app recovery code. * If application doesn't have any attempt then it will start new attempt when we do submitApplication as a part of recovery. * If application has 1 more application attempts then the attempt recovery will take place in 2 steps. ** All the application attempts except the last attempt will be recovered first. ** When we do submitApplication as a part of application recovery we will replay the last attempt. *** If last attempt doesn't have any finalRecoveredState stored then it will be considered as the one for which AM may or may not have been started/finished. So we will move this application attempt into LAUNCHED state, add it to AMLivenessMonitor and move application to RUNNING state. *** If last attempt was in either FAILED/KILLED/FINISHED state then we will replay that attempt's BaseFinalTransition by recovering attempt synchronously here. Adding test to cover below scenarios * New application attempt is not started until previous AM container finish event is reported back to RM as a part of nm registration. * If previous AM container finish event is never reported back (i.e. node manager on which this AM container was running also went down) in that case AMLivenessMonitor should time out previous attempt and start new attempt. * If all the stored attempts had finished then new attempt should be started immediately. During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1210: Attachment: YARN-1210.4.patch During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, YARN-1210.4.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1210: Attachment: YARN-1210.4.patch During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, YARN-1210.4.patch, YARN-1210.4.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814291#comment-13814291 ] Omkar Vinit Joshi commented on YARN-674: Thanks [~bikassaha] for the review bq. We were intentionally going through the same submitApplication() method to make sure that all the initialization and setup code paths are consistently followed in both cases by keeping the code path identical as much as possible. The RM would submit a recovered application, in essence proxying a user submitting the application. Its a general pattern followed through the recovery logic - to be minimally invasive to the mainline code path so that we can avoid functional bugs as much as possible. Separating them into 2 methods has resulted in code duplication in both methods without any huge benefit that I can see. It also leave us susceptible to future code changes made in one code path and not the other. I agree with your suggestion... reverting the changes ..discussed with [~vinodkv] offline. bq. Why is isSecurityEnabled() being checked at this internal level. The code should not even reach this point if security is not enabled. you have a point ..fixing it.. bq. Also why is it calling rmContext.getDelegationTokenRenewer().addApplication(event) instead of DelegationTokenRenewer.this.addApplication(). Same for rmContext.getDelegationTokenRenewer().applicationFinished(evt); Makes sense...fixed it.. bq. Rename DelegationTokenRenewerThread to not have misleading Thread in the name ? fixed. bq. Can DelegationTokenRenewerAppSubmitEvent event objects have an event type different from VERIFY_AND_START_APPLICATION? If not, we dont need this check and we can change the constructor of DelegationTokenRenewerAppSubmitEvent to not expect an event type argument. It should set the VERIFY_AND_START_APPLICATION within the constructor. fixed.. bq. @Private + @VisibleForTesting??? fixed. Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-674: --- Attachment: YARN-674.5.patch Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814496#comment-13814496 ] Omkar Vinit Joshi commented on YARN-1210: - Thanks [~jianhe] for reviewing it. {code} Instead of passing running containers as parameter in RegisterNodeManagerRequest, is it possible to just call heartBeat immediately after registerCall and then unBlockNewContainerRequests ? That way we can take advantage of the existing heartbeat logic, cover other things like keep app alive for log aggregation after AM container completes. Or at least we can send the list of ContainerStatus(including diagnostics) instead of just container Ids and also the list of keep-alive apps (separate jira)? {code} it makes sense replacing finishedContainers with containerStatuses. bq. Unnecessary import changes in DefaultContainerExecutor.java and LinuxContainerExecutor, ContainerLaunch, ContainersLauncher actually I wanted that earlier as I had created new ExitCode.java. I wanted to access it from ResourceTrackerService. Now since we are sending container status from node manager itself so no longer need that ..fixed it. bq. Finished containers may not necessary be killed. The containers can also normal finish and remain in the NM cache before NM resync. Updated the logic for cleanupContainers on node manager side. Now we should have all the finishedContainer statuses as it is. bq. wrong LOG class name. :) fixed it.. bq. LogFactory.getLog(RMAppImpl.class); removed. bq. Isn't always the case that after this patch only the last attempt can be running ? a new attempt will not be launched until the previous attempt reports back it really exits. If this is case, it can be a bug. We may only need to check that if the last attempt is finished or not. It is actually checking for any attempt to be in non running state. Do you want me to only check last attempt (by comparing application attempt ids)?. bq. should we return RUNNING or ACCEPTED for apps that are not in final state ? It's ok to return RUNNING in the scope of this patch because anyways we are launching a new attempt. Later on in working preserving restart, RM can crash before attempt register, attempt can register with RM after RM comes back in which case we can then move app from ACCEPTED to RUNNING? Yes right now I will keep it as RUNNING only. Today we don't have any information whether previous application master started and registered or not. Once we will have that information then probably we can do this. During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1210: Attachment: YARN-1210.3.patch During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814541#comment-13814541 ] Omkar Vinit Joshi commented on YARN-674: Thanks [~jianhe], [~bikassaha] . bq. Saw this is changed back to asynchronous submission on recovery, the original intention was to prevent client from seeing the application as a new application. If asynchronously, the client can query the application before recover event gets processed, meaning before the application is fully recovered as some recover logic happens when app is processing the recover event(app.FinalTransition). fixed to make sure that it gets updated synchronously. bq. The assert doesnt make it to the production jar - so it wont catch anything on the cluster. Need to throw an exception here. If we dont want to crash the RM here then we can log and error. When the attempt state machine gets the event then it will crash on the async dispatcher thread if the event is not handled in the current state. discussed with bikas offline.. this is fine. Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1210: Attachment: YARN-1210.2.patch During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813109#comment-13813109 ] Omkar Vinit Joshi commented on YARN-1210: - completely removed RECOVERED state. rest of the patch is same. Only major difference is * Before launching new appAttempt RM will check if any of the application attempts were running before. If so then RM will wait instead of starting a new application attempt. If no application attempts are found to be in running (anything other than final state) state then it launch new application attempt. * When Node manager receives resync signal it kills all the running containers and then reports back the killed containers to RM during RM registration. On receiving the container information RM checks if any of the reported container is an AM container If so then sends container_failed event to the related app attempt and eventually starts new application attempt. During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813115#comment-13813115 ] Omkar Vinit Joshi commented on YARN-1210: - cancelled the patch as it is based on YARN-674 During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch, YARN-1210.2.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813560#comment-13813560 ] Omkar Vinit Joshi commented on YARN-674: Thanks [~vinodkv] for review... bq. Does this patch also include YARN-1210? Seems like it, we should separate that code. No .. anything specific? YARN-1210 is more about waiting for older AM to finish before launching a new AM. bq. Depending on the final patch, I think we should split RMAppManager.submitApp into two, one for regular submit and one for submit after recovery. Splitting the method into 2. * submitApplication - normal application submission * submitRecoveredApplication - submitting recovered application bq. RMAppState.java change is unnecessary. fixed bq. ForwardingEventHandler is a bottleneck for renewals now - especially during submission. We need to have a thread pool. Create fixed thread pool service with thread count controllable via configuration (Not adding this to yarn-default). Keeping default thread count to be 5. fair enough? bq. Once we do the above, the old concurrency test should be added back. yeah..added that test back.. bq. We are undoing most of YARN-1107. Good that we laid the groundwork there. Let's make sure we remove all the dead code. One comment stands out Anything did I miss here? didn't understand. The comment I have not removed as it is still valid. bq. The newly added test can have race conditions? We may be lucky in the test, but in real life scenario, client has to submit app and poll for app failure due to invalid tokens I think it will not. For clients yes after they submit the application they will have to keep polling to know the status of the application (got accepted or failed due to token renewal). bq. Similarly we should add a test for successful submission after renewal. sure added one.. checking for RMAppEvent.START Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-674: --- Attachment: YARN-674.5.patch Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch, YARN-674.5.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-674: --- Attachment: YARN-674.4.patch Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, YARN-674.4.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13810927#comment-13810927 ] Omkar Vinit Joshi commented on YARN-1210: - submitting patch on top of YARN-674. During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1359) AMRMToken should not be sent to Container other than AM.
[ https://issues.apache.org/jira/browse/YARN-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13809610#comment-13809610 ] Omkar Vinit Joshi commented on YARN-1359: - Today node manager doesn't do this filtering of tokens. Proposal :- Let node manager filter out AMRMToken from tokens while launching container other than AM. Thereby we are only (truly) allowing AM container to talk to RM on AMRM protocol. Enhancements :- today node manager doesn't know which container is AM container. There are lot of problems because of this. So we first need a way to inform node manager about the container being AM. As today node manager comes to know everything about the new container from container token so it will be better to add isAM flag inside the token . Thoughts? (Note: we are anyway not encouraging users to talk to RM using multiple containers which are sharing same AMRMToken). AMRMToken should not be sent to Container other than AM. Key: YARN-1359 URL: https://issues.apache.org/jira/browse/YARN-1359 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1377) Log aggregation via node manager should expose expose a way to cancel aggregation at application or container level
Omkar Vinit Joshi created YARN-1377: --- Summary: Log aggregation via node manager should expose expose a way to cancel aggregation at application or container level Key: YARN-1377 URL: https://issues.apache.org/jira/browse/YARN-1377 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Today when application finishes it starts aggregating all the logs but that may slow down the whole process significantly... there can be situations where certain containers overwrote the logs .. say in multiple GBsin these scenarios we need a way to cancel log aggregation for certain containers. These can be at per application level or at per container level. thoughts? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1377) Log aggregation via node manager should expose expose a way to cancel aggregation at application or container level
[ https://issues.apache.org/jira/browse/YARN-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1377: Assignee: Xuan Gong Log aggregation via node manager should expose expose a way to cancel aggregation at application or container level --- Key: YARN-1377 URL: https://issues.apache.org/jira/browse/YARN-1377 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Xuan Gong Today when application finishes it starts aggregating all the logs but that may slow down the whole process significantly... there can be situations where certain containers overwrote the logs .. say in multiple GBsin these scenarios we need a way to cancel log aggregation for certain containers. These can be at per application level or at per container level. thoughts? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1377) Log aggregation via node manager should expose expose a way to cancel log aggregation at application or container level
[ https://issues.apache.org/jira/browse/YARN-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1377: Summary: Log aggregation via node manager should expose expose a way to cancel log aggregation at application or container level (was: Log aggregation via node manager should expose expose a way to cancel aggregation at application or container level) Log aggregation via node manager should expose expose a way to cancel log aggregation at application or container level --- Key: YARN-1377 URL: https://issues.apache.org/jira/browse/YARN-1377 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Xuan Gong Today when application finishes it starts aggregating all the logs but that may slow down the whole process significantly... there can be situations where certain containers overwrote the logs .. say in multiple GBsin these scenarios we need a way to cancel log aggregation for certain containers. These can be at per application level or at per container level. thoughts? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1363) Get / Cancel / Renew delegation token api should be non blocking
Omkar Vinit Joshi created YARN-1363: --- Summary: Get / Cancel / Renew delegation token api should be non blocking Key: YARN-1363 URL: https://issues.apache.org/jira/browse/YARN-1363 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Today GetDelgationToken, CancelDelegationToken and RenewDelegationToken are all blocking apis. * As a part of these calls we try to update RMStateStore and that may slow it down. * Now as we have limited number of client request handlers we may fill up client handlers quickly. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1363) Get / Cancel / Renew delegation token api should be non blocking
[ https://issues.apache.org/jira/browse/YARN-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808620#comment-13808620 ] Omkar Vinit Joshi commented on YARN-1363: - Proposal :- * Retain the current behavior as it is (synchronous one). * Add a configuration knob to enable asynchronous behavior. ** If enabled then Get / Renew / Cancel apis will work as it is except saving updated DT to RMStateStore synchronously. ** Client will have to make an additional call to check the status of the operation passing DT and OP [ Get/Renew/Cancel ]. *** ClientRMService remembers the status of the operation starting from the time when client requested and RMStateStore saved it to until first Client request arrived to check its status[ requests for (token, op) ] or timer (configurable..may be 10 min) expired. Get / Cancel / Renew delegation token api should be non blocking Key: YARN-1363 URL: https://issues.apache.org/jira/browse/YARN-1363 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Today GetDelgationToken, CancelDelegationToken and RenewDelegationToken are all blocking apis. * As a part of these calls we try to update RMStateStore and that may slow it down. * Now as we have limited number of client request handlers we may fill up client handlers quickly. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807046#comment-13807046 ] Omkar Vinit Joshi commented on YARN-674: bq. parseCredentials is not asynchronous, right? Therefore, is it better to fail the application submission immediately instead of forcing the client the come back to check the status? In fact, in submitApplication, there're already several points where the application submission can fail immediately, even though the application START is handled asynchronously. I am not sure about this. No problem but will add it. {code} Maybe you can do if (event instanceof DelegationTokenRenewerAppSubmitEvent) { ... } to avoid the findbug warning? {code} the whole point of adding ExceptionType is to avoid this. right [~zjshen]? I am wondering why at other places we are not getting similar casting exception. Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807503#comment-13807503 ] Omkar Vinit Joshi commented on YARN-674: Thanks [~zjshen] and [~jianhe] for reviewing.. bq. DelegationTokenRenewer.applicationFinished() is anyways canceling the token on a separate thread, do we need to funnel through the dispatcher as well ? discussed with [~jianhe] offline.. We need this because user may just submit an application and kill it immediately after it. The earlier submitted application may still be in flight (in dispatcher queue) and we may try to process application finish event as a part of kill before it. To avoid this I am enqueuing it. bq. It would be to have an end-to-end test that application submitted with an Invalid token will be rejected and verify yarn client can get Failed application status using MockRM. Added the test Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-674: --- Attachment: YARN-674.3.patch Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1359) AMRMToken should not be sent to Container other than AM.
Omkar Vinit Joshi created YARN-1359: --- Summary: AMRMToken should not be sent to Container other than AM. Key: YARN-1359 URL: https://issues.apache.org/jira/browse/YARN-1359 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805697#comment-13805697 ] Omkar Vinit Joshi commented on YARN-674: Thanks [~zjshen] for reviewing my patch bq. I think the exception needs to be thrown, which is missing in your patch. The exception will notice the client that the app submission fails; otherwise, the client will think the submission succeeds? Yes I have removed the error purposefully..here are the thoughts. * For client once he submits the application should check the app status and will come to know about the failing app from it. ** Either when parsing credentials fails. ** OR when initial token renewal fails. bq. Since DelegationTokenRenewer#addApplication becomes asynchronous, what will the impact of that the application is already accepted and starts its life cycle, while DelegationTokenRenewer is so slow to DelegationTokenRenewerAppSubmitEvent. Will the application fail somewhere else due to the fresh token unavailable? The logic here is modified a bit. If token renewal succeeds then only app is submitted to scheduler not before that. Today too it is the same case. Only problem is that we are holding client request while doing this. With the change this will become async. bq. I noticed testConncurrentAddApplication has been removed. Does the change affect the current app submission? No. Now there is no problem w.r.t. concurrent app submission as we are anyway funneling it through event handler. This test is no longer required so removed it completely. * Fixing findbug warnings... * fixing failed test case... Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-674: --- Attachment: YARN-674.2.patch Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1350) Should not add Lost Node by NodeManager reboot
[ https://issues.apache.org/jira/browse/YARN-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805825#comment-13805825 ] Omkar Vinit Joshi commented on YARN-1350: - [~sinchii] I have basic question..why your nodeId is changing everytime? have you configured your nodemanager with ephemeral port (0) ? what is NM_ADDRESS? RM will consider this as same node only when your newly restarted node manager reports with same node id .. i.e. host-name:port Should not add Lost Node by NodeManager reboot -- Key: YARN-1350 URL: https://issues.apache.org/jira/browse/YARN-1350 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0 Reporter: Shinichi Yamashita Attachments: NodeState.txt In current trunk, when NodeManager reboots, the node information before the reboot is treated as LOST. This occurs to confirm only Inactive node information at the time of reboot. Therefore Lost Node will exist even if NodeManager works in all nodes. We should change it not to register Lost Node by the NodeManager reboot. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (YARN-1252) Secure RM fails to start up in secure HA setup with Renewal request for unknown token exception
[ https://issues.apache.org/jira/browse/YARN-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi reassigned YARN-1252: --- Assignee: Omkar Vinit Joshi Secure RM fails to start up in secure HA setup with Renewal request for unknown token exception --- Key: YARN-1252 URL: https://issues.apache.org/jira/browse/YARN-1252 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.1-beta Reporter: Arpit Gupta Assignee: Omkar Vinit Joshi {code} 2013-09-26 08:15:20,507 INFO ipc.Server (Server.java:run(861)) - IPC Server Responder: starting 2013-09-26 08:15:20,521 ERROR security.UserGroupInformation (UserGroupInformation.java:doAs(1486)) - PriviledgedActionException as:rm/host@realm (auth:KERBEROS) cause:org.apache.hadoop.security.token.SecretManager$InvalidToken: Renewal request for unknown token at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:388) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:5934) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:453) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:851) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59650) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1483) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042 {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1252) Secure RM fails to start up in secure HA setup with Renewal request for unknown token exception
[ https://issues.apache.org/jira/browse/YARN-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805827#comment-13805827 ] Omkar Vinit Joshi commented on YARN-1252: - taking it over.. Secure RM fails to start up in secure HA setup with Renewal request for unknown token exception --- Key: YARN-1252 URL: https://issues.apache.org/jira/browse/YARN-1252 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.1-beta Reporter: Arpit Gupta {code} 2013-09-26 08:15:20,507 INFO ipc.Server (Server.java:run(861)) - IPC Server Responder: starting 2013-09-26 08:15:20,521 ERROR security.UserGroupInformation (UserGroupInformation.java:doAs(1486)) - PriviledgedActionException as:rm/host@realm (auth:KERBEROS) cause:org.apache.hadoop.security.token.SecretManager$InvalidToken: Renewal request for unknown token at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:388) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:5934) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:453) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:851) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59650) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1483) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042 {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1252) Secure RM fails to start up in secure HA setup with Renewal request for unknown token exception
[ https://issues.apache.org/jira/browse/YARN-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805828#comment-13805828 ] Omkar Vinit Joshi commented on YARN-1252: - YARN-674 should solve this problem. Now as token renewal is asynchronous in nature so if the token in unknown or external system (token renewing system) is down then the application for which this token was submitted will be marked as failed without crashing RM. Secure RM fails to start up in secure HA setup with Renewal request for unknown token exception --- Key: YARN-1252 URL: https://issues.apache.org/jira/browse/YARN-1252 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.1-beta Reporter: Arpit Gupta Assignee: Omkar Vinit Joshi {code} 2013-09-26 08:15:20,507 INFO ipc.Server (Server.java:run(861)) - IPC Server Responder: starting 2013-09-26 08:15:20,521 ERROR security.UserGroupInformation (UserGroupInformation.java:doAs(1486)) - PriviledgedActionException as:rm/host@realm (auth:KERBEROS) cause:org.apache.hadoop.security.token.SecretManager$InvalidToken: Renewal request for unknown token at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:388) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:5934) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:453) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:851) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59650) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1483) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042 {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1252) Secure RM fails to start up in secure HA setup with Renewal request for unknown token exception
[ https://issues.apache.org/jira/browse/YARN-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805830#comment-13805830 ] Omkar Vinit Joshi commented on YARN-1252: - [~vinodkv] [~jianhe] if you agree then we can close this. Secure RM fails to start up in secure HA setup with Renewal request for unknown token exception --- Key: YARN-1252 URL: https://issues.apache.org/jira/browse/YARN-1252 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.1-beta Reporter: Arpit Gupta Assignee: Omkar Vinit Joshi {code} 2013-09-26 08:15:20,507 INFO ipc.Server (Server.java:run(861)) - IPC Server Responder: starting 2013-09-26 08:15:20,521 ERROR security.UserGroupInformation (UserGroupInformation.java:doAs(1486)) - PriviledgedActionException as:rm/host@realm (auth:KERBEROS) cause:org.apache.hadoop.security.token.SecretManager$InvalidToken: Renewal request for unknown token at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:388) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:5934) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:453) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:851) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59650) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1483) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042 {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1350) Should not add Lost Node by NodeManager reboot
[ https://issues.apache.org/jira/browse/YARN-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805867#comment-13805867 ] Omkar Vinit Joshi commented on YARN-1350: - That is mainly for single node cluster to avoid port clashing. For real cluster you should define a port there. If you agree I will close this as invalid. Should not add Lost Node by NodeManager reboot -- Key: YARN-1350 URL: https://issues.apache.org/jira/browse/YARN-1350 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0 Reporter: Shinichi Yamashita Attachments: NodeState.txt In current trunk, when NodeManager reboots, the node information before the reboot is treated as LOST. This occurs to confirm only Inactive node information at the time of reboot. Therefore Lost Node will exist even if NodeManager works in all nodes. We should change it not to register Lost Node by the NodeManager reboot. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (YARN-1350) Should not add Lost Node by NodeManager reboot
[ https://issues.apache.org/jira/browse/YARN-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi resolved YARN-1350. - Resolution: Invalid Assignee: Omkar Vinit Joshi Should not add Lost Node by NodeManager reboot -- Key: YARN-1350 URL: https://issues.apache.org/jira/browse/YARN-1350 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0 Reporter: Shinichi Yamashita Assignee: Omkar Vinit Joshi Attachments: NodeState.txt In current trunk, when NodeManager reboots, the node information before the reboot is treated as LOST. This occurs to confirm only Inactive node information at the time of reboot. Therefore Lost Node will exist even if NodeManager works in all nodes. We should change it not to register Lost Node by the NodeManager reboot. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805885#comment-13805885 ] Omkar Vinit Joshi commented on YARN-674: * The recent test failure doesn't seem to be related to the code. The test passes locally. Should I open one ticket for this? * Not understanding how to fix that findbug warning .. should I add that too into exclude-findbug.xml? I tried this. Even eclipse doesn't complain {code} @Override @SuppressWarnings(unchecked) public void handle(DelegationTokenRenewerEvent event) { if (event.getType().equals( DelegationTokenRenewerEventType.VERIFY_AND_START_APPLICATION)) { DelegationTokenRenewerAppSubmitEvent appSubmitEvt = (DelegationTokenRenewerAppSubmitEvent) event; handleDTRenewerEvent(appSubmitEvt); } else if (event.getType().equals( DelegationTokenRenewerEventType.FINISH_APPLICATION)) { rmContext.getDelegationTokenRenewer().applicationFinished(event); } } {code} Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch, YARN-674.2.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1350) Should not add Lost Node by NodeManager reboot
[ https://issues.apache.org/jira/browse/YARN-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805908#comment-13805908 ] Omkar Vinit Joshi commented on YARN-1350: - you should checkout MiniYarnCluster. Should not add Lost Node by NodeManager reboot -- Key: YARN-1350 URL: https://issues.apache.org/jira/browse/YARN-1350 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0 Reporter: Shinichi Yamashita Assignee: Omkar Vinit Joshi Attachments: NodeState.txt In current trunk, when NodeManager reboots, the node information before the reboot is treated as LOST. This occurs to confirm only Inactive node information at the time of reboot. Therefore Lost Node will exist even if NodeManager works in all nodes. We should change it not to register Lost Node by the NodeManager reboot. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-674: --- Attachment: YARN-674.1.patch Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13804834#comment-13804834 ] Omkar Vinit Joshi commented on YARN-674: updating the patch Modifying the event flow * In unsecured case nothing will change * In secured case ** For Recovery *** ApplicationEvents will be enqueued when we are recovering. When service starts they will get processed. ** For normal app submission *** Event will be enqueued into the separate DelegationTokenRenewer dispatcher queue and client's request will be returned immediately. *** If the token renewal is successful then renewer will send the START/RECOVER event else it will fail the app. * Testing ** Updated unit test to test the updated behavior ** Manually tested it on secured cluster *** tested app submission with by default HDFS TOKEN and it works. *** tested same by restarting RM in between Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-674.1.patch This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi reassigned YARN-674: -- Assignee: Omkar Vinit Joshi (was: Vinod Kumar Vavilapalli) Slow or failing DelegationToken renewals on submission itself make RM unavailable - Key: YARN-674 URL: https://issues.apache.org/jira/browse/YARN-674 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi This was caused by YARN-280. A slow or a down NameNode for will make it look like RM is unavailable as it may run out of RPC handlers due to blocked client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1303) Allow multiple commands separating with ; in distributed-shell
[ https://issues.apache.org/jira/browse/YARN-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13802146#comment-13802146 ] Omkar Vinit Joshi commented on YARN-1303: - {code} +For multiple shell scripts, combine them + +into one shell script); {code} please remove. I think it is intuitive. Similarly the exception code message {code} +if (shellCommand.contains(;) || shellCommand.contains(|)) { + throw new IllegalArgumentException( + DistributedShell does not support multiple commands + + or command pipeline. Please create a shell script for + + them and use --shell_script option); +} +if (shellCommand.contains()) { + throw new IllegalArgumentException( + Please create a shell script for redirected output + + and use --shell_script option); +} + {code} This will not work on windows. [link| http://superuser.com/questions/62850/execute-multiple-commands-with-1-line-in-windows-commandline] Think about it. Make it not OS specific. Btw do we really need to parse and tell user that you are indeed using multiple commands instead of allowed one? Now we are putting this in help message I think this validation checks will complicate the stuff. thoughts? Allow multiple commands separating with ; in distributed-shell Key: YARN-1303 URL: https://issues.apache.org/jira/browse/YARN-1303 Project: Hadoop YARN Issue Type: Improvement Components: applications/distributed-shell Reporter: Tassapol Athiapinya Assignee: Xuan Gong Fix For: 2.2.1 Attachments: YARN-1303.1.patch, YARN-1303.2.patch, YARN-1303.3.patch, YARN-1303.3.patch, YARN-1303.4.patch, YARN-1303.4.patch, YARN-1303.5.patch, YARN-1303.6.patch In shell, we can do ls; ls to run 2 commands at once. In distributed shell, this is not working. We should improve to allow this to occur. There are practical use cases that I know of to run multiple commands or to set environment variables before a command. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1314) Cannot pass more than 1 argument to shell command
[ https://issues.apache.org/jira/browse/YARN-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13802223#comment-13802223 ] Omkar Vinit Joshi commented on YARN-1314: - Thanks [~xgong] latest patch looks good to me. Can you please comment on how you tested this manually? Cannot pass more than 1 argument to shell command - Key: YARN-1314 URL: https://issues.apache.org/jira/browse/YARN-1314 Project: Hadoop YARN Issue Type: Bug Components: applications/distributed-shell Reporter: Tassapol Athiapinya Assignee: Xuan Gong Fix For: 2.2.1 Attachments: YARN-1314.1.patch, YARN-1314.1.patch, YARN-1314.2.patch Distributed shell cannot accept more than 1 parameters in argument parts. All of these commands are treated as 1 parameter: /usr/bin/yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar distrubuted shell jar -shell_command echo -shell_args 'My name is Teddy' /usr/bin/yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar distrubuted shell jar -shell_command echo -shell_args ''My name' 'is Teddy'' /usr/bin/yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar distrubuted shell jar -shell_command echo -shell_args 'My name' 'is Teddy' -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1321) NMTokenCache is a a singleton, prevents multiple AMs running in a single JVM to work correctly
[ https://issues.apache.org/jira/browse/YARN-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13801095#comment-13801095 ] Omkar Vinit Joshi commented on YARN-1321: - bq. Change containsNMToken to containsToken and removeNMToken to removeToken for consistency with getToken and setToken? Also, should setToken not be putToken? can you please make it consistent for all apis xxxNMToken() ? Can you please add a test case for the multi AM use case? NMTokenCache is a a singleton, prevents multiple AMs running in a single JVM to work correctly -- Key: YARN-1321 URL: https://issues.apache.org/jira/browse/YARN-1321 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Fix For: 2.2.1 Attachments: YARN-1321.patch, YARN-1321.patch, YARN-1321.patch NMTokenCache is a singleton. Because of this, if running multiple AMs in a single JVM NMTokens for the same node from different AMs step on each other and starting containers fail due to mismatch tokens. The error observed in the client side is something like: {code} ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:llama (auth:PROXY) via llama (auth:SIMPLE) cause:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. NMToken for application attempt : appattempt_1382038445650_0002_01 was used for starting container with container token issued for application attempt : appattempt_1382038445650_0001_01 {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-895) RM crashes if it restarts while NameNode is in safe mode
[ https://issues.apache.org/jira/browse/YARN-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-895: --- Summary: RM crashes if it restarts while NameNode is in safe mode (was: If NameNode is in safemode when RM restarts, RM should wait instead of crashing.) RM crashes if it restarts while NameNode is in safe mode Key: YARN-895 URL: https://issues.apache.org/jira/browse/YARN-895 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-895.1.patch, YARN-895.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-895) RM crashes if it restarts while NameNode is in safe mode
[ https://issues.apache.org/jira/browse/YARN-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-895: --- Description: (was: Today if RM restarts while name node is in safe mode then RM crashes. During safe mode modifications are not allowed ) RM crashes if it restarts while NameNode is in safe mode Key: YARN-895 URL: https://issues.apache.org/jira/browse/YARN-895 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-895.1.patch, YARN-895.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-895) RM crashes if it restarts while NameNode is in safe mode
[ https://issues.apache.org/jira/browse/YARN-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-895: --- Description: Today if RM restarts while name node is in safe mode then RM crashes. During safe mode modifications are not allowed RM crashes if it restarts while NameNode is in safe mode Key: YARN-895 URL: https://issues.apache.org/jira/browse/YARN-895 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-895.1.patch, YARN-895.patch Today if RM restarts while name node is in safe mode then RM crashes. During safe mode modifications are not allowed -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi reassigned YARN-1121: --- Assignee: Omkar Vinit Joshi RMStateStore should flush all pending store events before closing - Key: YARN-1121 URL: https://issues.apache.org/jira/browse/YARN-1121 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Bikas Saha Assignee: Omkar Vinit Joshi Fix For: 2.2.1 on serviceStop it should wait for all internal pending events to drain before stopping. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13801409#comment-13801409 ] Omkar Vinit Joshi commented on YARN-1121: - [~bikassaha] we have a single dispatcher queue. so should we ignore other events when RM is going down and selectively only process rm state store writes related events? In any case we will have very short time before we actually get kill -9. RMStateStore should flush all pending store events before closing - Key: YARN-1121 URL: https://issues.apache.org/jira/browse/YARN-1121 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Bikas Saha Assignee: Omkar Vinit Joshi Fix For: 2.2.1 on serviceStop it should wait for all internal pending events to drain before stopping. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1053) Diagnostic message from ContainerExitEvent is ignored in ContainerImpl
[ https://issues.apache.org/jira/browse/YARN-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1053: Priority: Blocker (was: Major) Diagnostic message from ContainerExitEvent is ignored in ContainerImpl -- Key: YARN-1053 URL: https://issues.apache.org/jira/browse/YARN-1053 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Priority: Blocker Labels: newbie Fix For: 2.3.0, 2.2.1 Attachments: YARN-1053.20130809.patch If the container launch fails then we send ContainerExitEvent. This event contains exitCode and diagnostic message. Today we are ignoring diagnostic message while handling this event inside ContainerImpl. Fixing it as it is useful in diagnosing the failure. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1185: Attachment: YARN-1185.3.patch FileSystemRMStateStore can leave partial files that prevent subsequent recovery --- Key: YARN-1185 URL: https://issues.apache.org/jira/browse/YARN-1185 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: YARN-1185.1.patch, YARN-1185.2.patch, YARN-1185.3.patch FileSystemRMStateStore writes directly to the destination file when storing state. However if the RM were to crash in the middle of the write, the recovery method could encounter a partially-written file and either outright crash during recovery or silently load incomplete state. To avoid this, the data should be written to a temporary file and renamed to the destination file afterwards. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1321) NMTokenCache should not be a singleton
[ https://issues.apache.org/jira/browse/YARN-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799736#comment-13799736 ] Omkar Vinit Joshi commented on YARN-1321: - why are you in fact running multiple AM's inside a same JVM? as per YARN we can never have multiple AM's per JVM per process. Definitely not a blocker. Please explain the use case for running multiple AMs inside same process? If you really want to run it that way ..Why not just update the NMTokenCache but default to single AM case but still I don't see why you are doing this? NMTokenCache should not be a singleton -- Key: YARN-1321 URL: https://issues.apache.org/jira/browse/YARN-1321 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.2.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Blocker Fix For: 2.2.1 NMTokenCache is a singleton. Because of this, if running multiple AMs in a single JVM NMTokens for the same node from different AMs step on each other and starting containers fail due to mismatch tokens. The error observed in the client side is something like: {code} ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:llama (auth:PROXY) via llama (auth:SIMPLE) cause:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. NMToken for application attempt : appattempt_1382038445650_0002_01 was used for starting container with container token issued for application attempt : appattempt_1382038445650_0001_01 {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798485#comment-13798485 ] Omkar Vinit Joshi commented on YARN-1210: - taking it over. During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Jian He When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi reassigned YARN-1210: --- Assignee: Omkar Vinit Joshi (was: Jian He) During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1210: Attachment: YARN-1210.1.patch During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798620#comment-13798620 ] Omkar Vinit Joshi commented on YARN-1210: - Summarizing current patch. * After RMAppAttempts are recovered then all of the attempts are moved into LAUNCHED state. After YARN-891 we will know the state of the earlier finished application attempts; so then based on that we can decide where the current app attempt should transition to. On RECOVER event ** It will move to LAUNCHED state if it is was the last running app attempt ** It will move to FAILED / KILLED /..other terminal application attempt state. * When NM RESYNCs containers will be killed and then NM will re-register with RM passing already running containers. On RM side if any of the container turns out to be earlier AM container then we will fail that app attempt and immediately start new app attempt. However if we don't get AM's finished containerId during furture NM register then after some time AMLivelinessMonitor will expire and will fail the running app attempt and start a new one. During RM restart, RM should start a new attempt only when previous attempt exits for real -- Key: YARN-1210 URL: https://issues.apache.org/jira/browse/YARN-1210 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Omkar Vinit Joshi Attachments: YARN-1210.1.patch When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. In the mean while, new apps will proceed as usual as existing apps wait for recovery. This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)