[jira] [Commented] (YARN-1417) RM may issue expired container tokens to AM while issuing new containers.

2013-12-01 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13836116#comment-13836116
 ] 

Omkar Vinit Joshi commented on YARN-1417:
-

cool...will update the patch with tests..

 RM may issue expired container tokens to AM while issuing new containers.
 -

 Key: YARN-1417
 URL: https://issues.apache.org/jira/browse/YARN-1417
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
Priority: Blocker
 Attachments: YARN-1417.2.patch


 Today we create new container token when we create container in RM as a part 
 of schedule cycle. However that container may get reserved or assigned. If 
 the container gets reserved and remains like that (in reserved state) for 
 more than container token expiry interval then RM will end up issuing 
 container with expired token.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1430) InvalidStateTransition exceptions are ignored in state machines

2013-11-21 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13829400#comment-13829400
 ] 

Omkar Vinit Joshi commented on YARN-1430:
-

I think for now we should add assert statements so that in test environment it 
will always fail making sure we are not missing some invalid transitions? 
YARN-1416 is one of those examples.

I agree with [~vinodkv] and [~jlowe]. Probably we should be consistent 
everywhere and should show somewhere these system critical errors without 
actually crashing daemons.

 InvalidStateTransition exceptions are ignored in state machines
 ---

 Key: YARN-1430
 URL: https://issues.apache.org/jira/browse/YARN-1430
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi

 We have all state machines ignoring InvalidStateTransitions. These exceptions 
 will get logged but will not crash the RM / NM. We definitely should crash it 
 as they move the system into some invalid / unacceptable state.
 * Places where we hide this exception :-
 ** JobImpl
 ** TaskAttemptImpl
 ** TaskImpl
 ** NMClientAsyncImpl
 ** ApplicationImpl
 ** ContainerImpl
 ** LocalizedResource
 ** RMAppAttemptImpl
 ** RMAppImpl
 ** RMContainerImpl
 ** RMNodeImpl
 thoughts?



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (YARN-1436) ZKRMStateStore should have separate configuration for retry period.

2013-11-21 Thread Omkar Vinit Joshi (JIRA)
Omkar Vinit Joshi created YARN-1436:
---

 Summary: ZKRMStateStore should have separate configuration for 
retry period.
 Key: YARN-1436
 URL: https://issues.apache.org/jira/browse/YARN-1436
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Jian He


Problem :- Today we have zkSessionTimeout period which is getting used for 
zookeeper session timeout and for ZKRMStateStore based retry policy. 

Proposed suggestion :- Ideally we should have different configuration knobs for 
this. 
Ideal values for 
zkSessionTimeout should be :- number of zookeeper instances participating in 
quorum * per zookeeper session timeout. see
{code}
org.apache.zookeeper.ClientCnxn.ClientCnxn()..
connectTimeout = sessionTimeout / hostProvider.size();
{code}
retry policy... (may be retry time period or count)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1436) ZKRMStateStore should have separate configuration for retry period.

2013-11-21 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1436:


Component/s: resourcemanager

 ZKRMStateStore should have separate configuration for retry period.
 ---

 Key: YARN-1436
 URL: https://issues.apache.org/jira/browse/YARN-1436
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.2.1
Reporter: Omkar Vinit Joshi
Assignee: Jian He

 Problem :- Today we have zkSessionTimeout period which is getting used for 
 zookeeper session timeout and for ZKRMStateStore based retry policy. 
 Proposed suggestion :- Ideally we should have different configuration knobs 
 for this. 
 Ideal values for 
 zkSessionTimeout should be :- number of zookeeper instances participating in 
 quorum * per zookeeper session timeout. see
 {code}
 org.apache.zookeeper.ClientCnxn.ClientCnxn()..
 connectTimeout = sessionTimeout / hostProvider.size();
 {code}
 retry policy... (may be retry time period or count)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1436) ZKRMStateStore should have separate configuration for retry period.

2013-11-21 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1436:


Affects Version/s: 2.2.1

 ZKRMStateStore should have separate configuration for retry period.
 ---

 Key: YARN-1436
 URL: https://issues.apache.org/jira/browse/YARN-1436
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.2.1
Reporter: Omkar Vinit Joshi
Assignee: Jian He

 Problem :- Today we have zkSessionTimeout period which is getting used for 
 zookeeper session timeout and for ZKRMStateStore based retry policy. 
 Proposed suggestion :- Ideally we should have different configuration knobs 
 for this. 
 Ideal values for 
 zkSessionTimeout should be :- number of zookeeper instances participating in 
 quorum * per zookeeper session timeout. see
 {code}
 org.apache.zookeeper.ClientCnxn.ClientCnxn()..
 connectTimeout = sessionTimeout / hostProvider.size();
 {code}
 retry policy... (may be retry time period or count)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1425) TestRMRestart is failing on trunk

2013-11-20 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828041#comment-13828041
 ] 

Omkar Vinit Joshi commented on YARN-1425:
-

yes yarn-tests are passing locally.

 TestRMRestart is failing on trunk
 -

 Key: YARN-1425
 URL: https://issues.apache.org/jira/browse/YARN-1425
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1425.1.patch, error.log


 TestRMRestart is failing on trunk. Fixing it. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1053) Diagnostic message from ContainerExitEvent is ignored in ContainerImpl

2013-11-20 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1053:


Attachment: YARN-1053.1.patch

Thanks [~bikassaha] ..
Added a null check and also updated test case which verifies both diagnostic 
message and exitCode.

 Diagnostic message from ContainerExitEvent is ignored in ContainerImpl
 --

 Key: YARN-1053
 URL: https://issues.apache.org/jira/browse/YARN-1053
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0, 2.2.1
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
Priority: Blocker
  Labels: newbie
 Fix For: 2.3.0, 2.2.1

 Attachments: YARN-1053.1.patch, YARN-1053.20130809.patch


 If the container launch fails then we send ContainerExitEvent. This event 
 contains exitCode and diagnostic message. Today we are ignoring diagnostic 
 message while handling this event inside ContainerImpl. Fixing it as it is 
 useful in diagnosing the failure.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-713) ResourceManager can exit unexpectedly if DNS is unavailable

2013-11-20 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828155#comment-13828155
 ] 

Omkar Vinit Joshi commented on YARN-713:


Also fixed YARN-1417 as a part of this. It is straight forward.

 ResourceManager can exit unexpectedly if DNS is unavailable
 ---

 Key: YARN-713
 URL: https://issues.apache.org/jira/browse/YARN-713
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
Priority: Critical
 Fix For: 2.3.0

 Attachments: YARN-713.09052013.1.patch, YARN-713.09062013.1.patch, 
 YARN-713.1.patch, YARN-713.2.patch, YARN-713.20130910.1.patch, 
 YARN-713.patch, YARN-713.patch, YARN-713.patch, YARN-713.patch


 As discussed in MAPREDUCE-5261, there's a possibility that a DNS outage could 
 lead to an unhandled exception in the ResourceManager's AsyncDispatcher, and 
 that ultimately would cause the RM to exit.  The RM should not exit during 
 DNS hiccups.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1416) InvalidStateTransitions getting reported in multiple test cases even though they pass

2013-11-20 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828287#comment-13828287
 ] 

Omkar Vinit Joshi commented on YARN-1416:
-

Thanks [~jianhe]

I have basic question.. RM should have crashed right? we can't just ignore such 
invalid state transitions? Should we? I see that someone has modified it to log 
the exception but ignore it inside RMAppImpl.java. 
{code}
  try {
/* keep the master in sync with the state machine */
this.stateMachine.doTransition(event.getType(), event);
  } catch (InvalidStateTransitonException e) {
LOG.error(Can't handle this event at current state, e);
/* TODO fail the application on the failed transition */
  }
{code}
I see that other places too we are ignoring this after logging it. Not sure if 
this is right because we may just move the system into corrupted state without 
crashing/stopping it. At least we should add assert statements to all the state 
machines to make sure that such transitions don't go unnoticed.

I applied the patch and tested locally.. one more test needs to be fixed..
{code}
2013-11-20 15:23:52,127 INFO  [AsyncDispatcher event handler] 
attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(645)) - 
appattempt_1384989831257_0042_01 State change from NEW to SUBMITTED
2013-11-20 15:23:52,129 ERROR [AsyncDispatcher event handler] rmapp.RMAppImpl 
(RMAppImpl.java:handle(593)) - Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
APP_ACCEPTED at RUNNING
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:591)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:77)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions$TestApplicationEventDispatcher.handle(TestRMAppTransitions.java:139)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions$TestApplicationEventDispatcher.handle(TestRMAppTransitions.java:125)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:159)
at 
org.apache.hadoop.yarn.event.DrainDispatcher$1.run(DrainDispatcher.java:65)
at java.lang.Thread.run(Thread.java:680)
{code}



 InvalidStateTransitions getting reported in multiple test cases even though 
 they pass
 -

 Key: YARN-1416
 URL: https://issues.apache.org/jira/browse/YARN-1416
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Jian He
 Attachments: YARN-1416.1.patch, YARN-1416.1.patch


 It might be worth checking why they are reporting this.
 Testcase : TestRMAppTransitions, TestRM
 there are large number of such errors.
 can't handle RMAppEventType.APP_UPDATE_SAVED at RMAppState.FAILED



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (YARN-1430) InvalidStateTransition exceptions are ignored in state machines

2013-11-20 Thread Omkar Vinit Joshi (JIRA)
Omkar Vinit Joshi created YARN-1430:
---

 Summary: InvalidStateTransition exceptions are ignored in state 
machines
 Key: YARN-1430
 URL: https://issues.apache.org/jira/browse/YARN-1430
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi


We have all state machines ignoring InvalidStateTransitions. These exceptions 
will get logged but will not crash the RM / NM. We definitely should crash it 
as they move the system into some invalid / unacceptable state.

*Places where we hide this exception :-
** JobImpl
** TaskAttemptImpl
** TaskImpl
** NMClientAsyncImpl
** ApplicationImpl
** ContainerImpl
** LocalizedResource
** RMAppAttemptImpl
** RMAppImpl
** RMContainerImpl
** RMNodeImpl

thoughts?



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1417) RM may issue expired container tokens to AM while issuing new containers.

2013-11-20 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1417:


Attachment: YARN-1417.2.patch

 RM may issue expired container tokens to AM while issuing new containers.
 -

 Key: YARN-1417
 URL: https://issues.apache.org/jira/browse/YARN-1417
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1417.2.patch


 Today we create new container token when we create container in RM as a part 
 of schedule cycle. However that container may get reserved or assigned. If 
 the container gets reserved and remains like that (in reserved state) for 
 more than container token expiry interval then RM will end up issuing 
 container with expired token.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1417) RM may issue expired container tokens to AM while issuing new containers.

2013-11-20 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828435#comment-13828435
 ] 

Omkar Vinit Joshi commented on YARN-1417:
-

updating basic patch here.. without test cases .. if this approach looks ok 
then we can easily fix YARN-713..

 RM may issue expired container tokens to AM while issuing new containers.
 -

 Key: YARN-1417
 URL: https://issues.apache.org/jira/browse/YARN-1417
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1417.2.patch


 Today we create new container token when we create container in RM as a part 
 of schedule cycle. However that container may get reserved or assigned. If 
 the container gets reserved and remains like that (in reserved state) for 
 more than container token expiry interval then RM will end up issuing 
 container with expired token.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (YARN-1431) TestWebAppProxyServlet is failing on trunk

2013-11-20 Thread Omkar Vinit Joshi (JIRA)
Omkar Vinit Joshi created YARN-1431:
---

 Summary: TestWebAppProxyServlet is failing on trunk
 Key: YARN-1431
 URL: https://issues.apache.org/jira/browse/YARN-1431
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Omkar Vinit Joshi
Priority: Blocker


Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.609 sec  
FAILURE! - in org.apache.hadoop.yarn.server.webproxy.TestWebAppProxyServlet
testWebAppProxyServerMainMethod(org.apache.hadoop.yarn.server.webproxy.TestWebAppProxyServlet)
  Time elapsed: 5.006 sec   ERROR!
java.lang.Exception: test timed out after 5000 milliseconds
at java.net.Inet4AddressImpl.getHostByAddr(Native Method)
at java.net.InetAddress$1.getHostByAddr(InetAddress.java:881)
at java.net.InetAddress.getHostFromNameService(InetAddress.java:560)
at java.net.InetAddress.getCanonicalHostName(InetAddress.java:531)
at 
org.apache.hadoop.security.SecurityUtil.getLocalHostName(SecurityUtil.java:227)
at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:247)
at 
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServer.doSecureLogin(WebAppProxyServer.java:72)
at 
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServer.serviceInit(WebAppProxyServer.java:57)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServer.startServer(WebAppProxyServer.java:99)
at 
org.apache.hadoop.yarn.server.webproxy.TestWebAppProxyServlet.testWebAppProxyServerMainMethod(TestWebAppProxyServlet.java:187)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1431) TestWebAppProxyServlet is failing on trunk

2013-11-20 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828444#comment-13828444
 ] 

Omkar Vinit Joshi commented on YARN-1431:
-

extract from surefire logs
{code}
2013-11-20 18:59:47,514 INFO  [Thread-4] mortbay.log (Slf4jLog.java:info(67)) - 
Extract 
jar:file:/Users/ojoshi/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/target/hadoop-yarn-common-3.0.0-SNAPSHOT.jar!/webapps/proxy
 to 
/var/folders/h8/dlw3rlfx0_b5zjrw7kn1752mgn/T/Jetty_localhost_57922_proxyrtadom/webapp
2013-11-20 18:59:47,678 INFO  [Thread-4] mortbay.log (Slf4jLog.java:info(67)) - 
Started SelectChannelConnector@localhost:57922
Proxy server is started at port 57922   
2013-11-20 18:59:47,797 ERROR [1023736867@qtp-568432173-0] mortbay.log 
(Slf4jLog.java:warn(87)) - /proxy/app
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Error parsing 
application ID: app
at org.apache.hadoop.yarn.util.Apps.throwParseException(Apps.java:69)   
at org.apache.hadoop.yarn.util.Apps.toAppID(Apps.java:54)   
at org.apache.hadoop.yarn.util.Apps.toAppID(Apps.java:49)   
at 
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:252)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) 
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) 
at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
at 
org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1220)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) 
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326) 
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)  
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) 
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) 
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
2013-11-20 18:59:47,854 INFO  [1023736867@qtp-568432173-0] 
webproxy.WebAppProxyServlet (WebAppProxyServlet.java:doGet(322)) - dr.who is 
accessing unchecked http://localhost:57919/foo/bar/ which is the app master GUI 
of application_00_0 owned by dr.who
2013-11-20 18:59:47,925 WARN  [1023736867@qtp-568432173-0] 
webproxy.WebAppProxyServlet (WebAppProxyServlet.java:doGet(278)) - dr.who 
Attempting to access application_0_ that was not found
{code}

 TestWebAppProxyServlet is failing on trunk
 --

 Key: YARN-1431
 URL: https://issues.apache.org/jira/browse/YARN-1431
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Omkar Vinit Joshi
Priority: Blocker

 Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.609 sec  
 FAILURE! - in org.apache.hadoop.yarn.server.webproxy.TestWebAppProxyServlet
 testWebAppProxyServerMainMethod(org.apache.hadoop.yarn.server.webproxy.TestWebAppProxyServlet)
   Time elapsed: 5.006 sec   ERROR!
 java.lang.Exception: test timed out after 5000 milliseconds
   at java.net.Inet4AddressImpl.getHostByAddr(Native Method)
   at java.net.InetAddress$1.getHostByAddr(InetAddress.java:881)
   at java.net.InetAddress.getHostFromNameService(InetAddress.java:560)
   at 

[jira] [Commented] (YARN-744) Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated.

2013-11-19 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826974#comment-13826974
 ] 

Omkar Vinit Joshi commented on YARN-744:


Thanks [~bikassaha] addressed your comments. Attaching a new patch.

 Race condition in ApplicationMasterService.allocate .. It might process same 
 allocate request twice resulting in additional containers getting allocated.
 -

 Key: YARN-744
 URL: https://issues.apache.org/jira/browse/YARN-744
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Omkar Vinit Joshi
Priority: Minor
 Attachments: MAPREDUCE-3899-branch-0.23.patch, 
 YARN-744-20130711.1.patch, YARN-744-20130715.1.patch, 
 YARN-744-20130726.1.patch, YARN-744.1.patch, YARN-744.2.patch, YARN-744.patch


 Looks like the lock taken in this is broken. It takes a lock on lastResponse 
 object and then puts a new lastResponse object into the map. At this point a 
 new thread entering this function will get a new lastResponse object and will 
 be able to take its lock and enter the critical section. Presumably we want 
 to limit one response per app attempt. So the lock could be taken on the 
 ApplicationAttemptId key of the response map object.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-744) Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated.

2013-11-19 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-744:
---

Attachment: YARN-744.2.patch

 Race condition in ApplicationMasterService.allocate .. It might process same 
 allocate request twice resulting in additional containers getting allocated.
 -

 Key: YARN-744
 URL: https://issues.apache.org/jira/browse/YARN-744
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Omkar Vinit Joshi
Priority: Minor
 Attachments: MAPREDUCE-3899-branch-0.23.patch, 
 YARN-744-20130711.1.patch, YARN-744-20130715.1.patch, 
 YARN-744-20130726.1.patch, YARN-744.1.patch, YARN-744.2.patch, YARN-744.patch


 Looks like the lock taken in this is broken. It takes a lock on lastResponse 
 object and then puts a new lastResponse object into the map. At this point a 
 new thread entering this function will get a new lastResponse object and will 
 be able to take its lock and enter the critical section. Presumably we want 
 to limit one response per app attempt. So the lock could be taken on the 
 ApplicationAttemptId key of the response map object.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1053) Diagnostic message from ContainerExitEvent is ignored in ContainerImpl

2013-11-19 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1053:


Affects Version/s: 2.2.1
   2.2.0

 Diagnostic message from ContainerExitEvent is ignored in ContainerImpl
 --

 Key: YARN-1053
 URL: https://issues.apache.org/jira/browse/YARN-1053
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0, 2.2.1
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
Priority: Blocker
  Labels: newbie
 Fix For: 2.3.0, 2.2.1

 Attachments: YARN-1053.20130809.patch


 If the container launch fails then we send ContainerExitEvent. This event 
 contains exitCode and diagnostic message. Today we are ignoring diagnostic 
 message while handling this event inside ContainerImpl. Fixing it as it is 
 useful in diagnosing the failure.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (YARN-1425) TestRMRestart is failing on trunk

2013-11-19 Thread Omkar Vinit Joshi (JIRA)
Omkar Vinit Joshi created YARN-1425:
---

 Summary: TestRMRestart is failing on trunk
 Key: YARN-1425
 URL: https://issues.apache.org/jira/browse/YARN-1425
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi


TestRMRestart is failing on trunk. Fixing it. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1425) TestRMRestart is failing on trunk

2013-11-19 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1425:


Attachment: error.log

[issue was 
seen|https://builds.apache.org/job/PreCommit-YARN-Build/2486//testReport/org.apache.hadoop.yarn.server.resourcemanager/TestRMRestart/testRMRestartWaitForPreviousAMToFinish/]

 TestRMRestart is failing on trunk
 -

 Key: YARN-1425
 URL: https://issues.apache.org/jira/browse/YARN-1425
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
 Attachments: error.log


 TestRMRestart is failing on trunk. Fixing it. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-744) Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated.

2013-11-19 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13827011#comment-13827011
 ] 

Omkar Vinit Joshi commented on YARN-744:


Test failure is not related to this. Opened ticket YARN-1425 to track this.

 Race condition in ApplicationMasterService.allocate .. It might process same 
 allocate request twice resulting in additional containers getting allocated.
 -

 Key: YARN-744
 URL: https://issues.apache.org/jira/browse/YARN-744
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Omkar Vinit Joshi
Priority: Minor
 Attachments: MAPREDUCE-3899-branch-0.23.patch, 
 YARN-744-20130711.1.patch, YARN-744-20130715.1.patch, 
 YARN-744-20130726.1.patch, YARN-744.1.patch, YARN-744.2.patch, YARN-744.patch


 Looks like the lock taken in this is broken. It takes a lock on lastResponse 
 object and then puts a new lastResponse object into the map. At this point a 
 new thread entering this function will get a new lastResponse object and will 
 be able to take its lock and enter the critical section. Presumably we want 
 to limit one response per app attempt. So the lock could be taken on the 
 ApplicationAttemptId key of the response map object.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1425) TestRMRestart is failing on trunk

2013-11-19 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13827075#comment-13827075
 ] 

Omkar Vinit Joshi commented on YARN-1425:
-

just discovered MockRM.waitForState(appAttempt, RMAppAttemptState)... simple 
ignores the passed in application attempt and always considers current 
application attempt. Fixing it. *RMAppAttempt attempt = 
app.getCurrentAppAttempt();*
{code}
  public void waitForState(ApplicationAttemptId attemptId, 
   RMAppAttemptState finalState)
  throws Exception {
RMApp app = getRMContext().getRMApps().get(attemptId.getApplicationId());
Assert.assertNotNull(app shouldn't be null, app);
RMAppAttempt attempt = app.getCurrentAppAttempt();
int timeoutSecs = 0;
while (!finalState.equals(attempt.getAppAttemptState())  timeoutSecs++  
40) {
  System.out.println(AppAttempt :  + attemptId 
  +  State is :  + attempt.getAppAttemptState()
  +  Waiting for state :  + finalState);
  Thread.sleep(1000);
}
System.out.println(Attempt State is :  + attempt.getAppAttemptState());
Assert.assertEquals(Attempt state is not correct (timedout), finalState,
attempt.getAppAttemptState());
  }
{code}

 TestRMRestart is failing on trunk
 -

 Key: YARN-1425
 URL: https://issues.apache.org/jira/browse/YARN-1425
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
 Attachments: error.log


 TestRMRestart is failing on trunk. Fixing it. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1425) TestRMRestart is failing on trunk

2013-11-19 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1425:


Attachment: YARN-1425.1.patch

 TestRMRestart is failing on trunk
 -

 Key: YARN-1425
 URL: https://issues.apache.org/jira/browse/YARN-1425
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1425.1.patch, error.log


 TestRMRestart is failing on trunk. Fixing it. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1363) Get / Cancel / Renew delegation token api should be non blocking

2013-11-19 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1363:


Attachment: YARN-1363.1.patch

Work in progress patch.. YARN-1363.1.patch

 Get / Cancel / Renew delegation token api should be non blocking
 

 Key: YARN-1363
 URL: https://issues.apache.org/jira/browse/YARN-1363
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1363.1.patch


 Today GetDelgationToken, CancelDelegationToken and RenewDelegationToken are 
 all blocking apis.
 * As a part of these calls we try to update RMStateStore and that may slow it 
 down.
 * Now as we have limited number of client request handlers we may fill up 
 client handlers quickly.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (YARN-1421) Node managers will not receive application finish event where containers ran before RM restart

2013-11-18 Thread Omkar Vinit Joshi (JIRA)
Omkar Vinit Joshi created YARN-1421:
---

 Summary: Node managers will not receive application finish event 
where containers ran before RM restart
 Key: YARN-1421
 URL: https://issues.apache.org/jira/browse/YARN-1421
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
Priority: Critical


Problem :- Today for every application we track the node managers where 
container ran. So when application finishes it notifies all those node managers 
about application finish event (via node manager heartbeat). However if rm 
restarts then we forget this past information and those node managers will 
never get application finish event and will keep reporting finished 
applications.

Propose Solution :- Instead of remembering the node managers where containers 
ran for this particular application it would be better if we depend on node 
manager heartbeat to take this decision. i.e. when node manager heartbeats 
saying it is running application (app1, app2) then we should those 
application's status in RM's memory {code}rmContext.getRMApps(){code} and if 
either they are not found (very old applications) or they are in their final 
state (FINISHED, KILLED, FAILED) then we should immediately notify the node 
manager about the application finish event.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1421) Node managers will not receive application finish event where containers ran before RM restart

2013-11-18 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1421:


Description: 
Problem :- Today for every application we track the node managers where 
containers ran. So when application finishes it notifies all those node 
managers about application finish event (via node manager heartbeat). However 
if rm restarts then we forget this past information and those node managers 
will never get application finish event and will keep reporting finished 
applications.

Proposed Solution :- Instead of remembering the node managers where containers 
ran for this particular application it would be better if we depend on node 
manager heartbeat to take this decision. i.e. when node manager heartbeats 
saying it is running application (app1, app2) then we should check those 
application's status in RM's memory {code}rmContext.getRMApps(){code} and if 
either they are not found (very old applications) or they are in their final 
state (FINISHED, KILLED, FAILED) then we should immediately notify the node 
manager about the application finish event. By doing this we are reducing the 
state which we need to store at RM after restart.

  was:
Problem :- Today for every application we track the node managers where 
container ran. So when application finishes it notifies all those node managers 
about application finish event (via node manager heartbeat). However if rm 
restarts then we forget this past information and those node managers will 
never get application finish event and will keep reporting finished 
applications.

Propose Solution :- Instead of remembering the node managers where containers 
ran for this particular application it would be better if we depend on node 
manager heartbeat to take this decision. i.e. when node manager heartbeats 
saying it is running application (app1, app2) then we should those 
application's status in RM's memory {code}rmContext.getRMApps(){code} and if 
either they are not found (very old applications) or they are in their final 
state (FINISHED, KILLED, FAILED) then we should immediately notify the node 
manager about the application finish event.


 Node managers will not receive application finish event where containers ran 
 before RM restart
 --

 Key: YARN-1421
 URL: https://issues.apache.org/jira/browse/YARN-1421
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
Priority: Critical

 Problem :- Today for every application we track the node managers where 
 containers ran. So when application finishes it notifies all those node 
 managers about application finish event (via node manager heartbeat). However 
 if rm restarts then we forget this past information and those node managers 
 will never get application finish event and will keep reporting finished 
 applications.
 Proposed Solution :- Instead of remembering the node managers where 
 containers ran for this particular application it would be better if we 
 depend on node manager heartbeat to take this decision. i.e. when node 
 manager heartbeats saying it is running application (app1, app2) then we 
 should check those application's status in RM's memory 
 {code}rmContext.getRMApps(){code} and if either they are not found (very old 
 applications) or they are in their final state (FINISHED, KILLED, FAILED) 
 then we should immediately notify the node manager about the application 
 finish event. By doing this we are reducing the state which we need to store 
 at RM after restart.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-11-18 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-674:
---

Attachment: YARN-674.8.patch

Thanks [~vinodkv] for pointing it out..didn't understand earlier. Adding 
synchronized block to service state change.

 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, 
 YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch, 
 YARN-674.7.patch, YARN-674.8.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

2013-11-18 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1210:


Attachment: YARN-1210.7.patch

 During RM restart, RM should start a new attempt only when previous attempt 
 exits for real
 --

 Key: YARN-1210
 URL: https://issues.apache.org/jira/browse/YARN-1210
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, 
 YARN-1210.4.patch, YARN-1210.4.patch, YARN-1210.5.patch, YARN-1210.6.patch, 
 YARN-1210.7.patch


 When RM recovers, it can wait for existing AMs to contact RM back and then 
 kill them forcefully before even starting a new AM. Worst case, RM will start 
 a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
 we'll minimize multiple AMs racing with each other. This can help issues with 
 downstream components like Pig, Hive and Oozie during RM restart.
 In the mean while, new apps will proceed as usual as existing apps wait for 
 recovery.
 This can continue to be useful after work-preserving restart, so that AMs 
 which can properly sync back up with RM can continue to run and those that 
 don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-11-18 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-674:
---

Attachment: YARN-674.9.patch

 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, 
 YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch, 
 YARN-674.7.patch, YARN-674.8.patch, YARN-674.9.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1422) RM CapacityScheduler can deadlock when getQueueUserAclInfo() is called and a container is completing

2013-11-18 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825988#comment-13825988
 ] 

Omkar Vinit Joshi commented on YARN-1422:
-

Yes this looks to be a problem.
check this [synchronization locking problem | 
https://issues.apache.org/jira/browse/YARN-897?focusedCommentId=13706284page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13706284]
 The ordering always should be from root to leaf queue. I think there can be 
other places too where this ordering is mixed. 

 RM CapacityScheduler can deadlock when getQueueUserAclInfo() is called and a 
 container is completing
 

 Key: YARN-1422
 URL: https://issues.apache.org/jira/browse/YARN-1422
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.2.0
Reporter: Adam Kawa
Priority: Critical

 If getQueueUserAclInfo() on a parent/root queue (e.g. via 
 CapacityScheduler.getQueueUserAclInfo) is called, and a container is 
 completing, then the ResourceManager can deadlock. 
 It is similar to https://issues.apache.org/jira/browse/YARN-325. 
 *More details:*
 * Thread A
 1) In a synchronized block of code (a lockid 
 0xc18d8870=LeafQueue.class), LeafQueue.completedContainer wants to 
 inform the parent queue that a container is being completed and invokes 
 ParentQueue.completedContainer method.
 3) The ParentQueue.completedContainer waits to aquire a lock on itself (a 
 lockid 0xc1846350=ParentQueue.class) to go to synchronized block of 
 code. It can not accuire this lock, because Thread B already has this lock.
 * Thread B
 0) A moment earlier, CapacityScheduler.getQueueUserAclInfo is called. This 
 method invokes a synchronized method on ParentQueue.class i.e. 
 ParentQueue.getQueueUserAclInfo (a lockid 
 0xc1846350=ParentQueue.class) and aquires the lock that Thread A will 
 be waiting for. 
 2) Unluckyly, ParentQueue.getQueueUserAclInfo iterates over children queue 
 acls and it wants to run a synchonized method, LeafQueue.getQueueUserAclInfo, 
 but it does not have a lock on LeafQueue.class (a lockid 
 0xc18d8870=LeafQueue.class). This lock is already held by 
 LeafQueue.completedContainer in Thread A.
 The order that causes the deadlock: B0 - A1 - B2 - A3.
 *Java Stacktrace*
 {code}
 Found one Java-level deadlock:
 =
 1956747953@qtp-109760451-1959:
   waiting to lock monitor 0x434e10c8 (object 0xc1846350, a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue),
   which is held by IPC Server handler 39 on 8032
 IPC Server handler 39 on 8032:
   waiting to lock monitor 0x422bbc58 (object 0xc18d8870, a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue),
   which is held by ResourceManager Event Processor
 ResourceManager Event Processor:
   waiting to lock monitor 0x434e10c8 (object 0xc1846350, a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue),
   which is held by IPC Server handler 39 on 8032
 Java stack information for the threads listed above:
 ===
 1956747953@qtp-109760451-1959:
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getUsedCapacity(ParentQueue.java:276)
   - waiting to lock 0xc1846350 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerInfo.init(CapacitySchedulerInfo.java:49)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:203)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76)
   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
   at 
 org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117)
   at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845)
   at 
 org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56)
   at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
   at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:76)
   at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
   at 
 

[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-11-18 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826055#comment-13826055
 ] 

Omkar Vinit Joshi commented on YARN-674:


[~bikassaha] I completely missed your comment. What you are saying will not 
occur.
{code}
pool.allowCoreThreadTimeOut(true);
{code}
this should time out core threads if there are any lying around.

 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, 
 YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch, 
 YARN-674.7.patch, YARN-674.8.patch, YARN-674.9.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-11-18 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826066#comment-13826066
 ] 

Omkar Vinit Joshi commented on YARN-674:


I think we should just ignore the find bug warning.. it is never going to 
occur...plus TestRMRestart is passing locally... there must be some race 
condition here not related to this patch.

 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, 
 YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch, 
 YARN-674.7.patch, YARN-674.8.patch, YARN-674.9.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (YARN-1416) InvalidStateTransitions getting reported in multiple test cases even though they pass

2013-11-15 Thread Omkar Vinit Joshi (JIRA)
Omkar Vinit Joshi created YARN-1416:
---

 Summary: InvalidStateTransitions getting reported in multiple test 
cases even though they pass
 Key: YARN-1416
 URL: https://issues.apache.org/jira/browse/YARN-1416
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Jian He


It might be worth checking why they are reporting this.
Testcase : TestRMAppTransitions, TestRM
there are large number of such errors.
can't handle RMAppEventType.APP_UPDATE_SAVED at RMAppState.FAILED




--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1417) RM may issue expired container tokens to AM while issuing new containers.

2013-11-15 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13824175#comment-13824175
 ] 

Omkar Vinit Joshi commented on YARN-1417:
-

Fixing this as a part of YARN-713 where I am restructuring the token generation 
logic.

 RM may issue expired container tokens to AM while issuing new containers.
 -

 Key: YARN-1417
 URL: https://issues.apache.org/jira/browse/YARN-1417
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi

 Today we create new container token when we create container in RM as a part 
 of schedule cycle. However that container may get reserved or assigned. If 
 the container gets reserved and remains like that (in reserved state) for 
 more than container token expiry interval then RM will end up issuing 
 container with expired token.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (YARN-1417) RM may issue expired container tokens to AM while issuing new containers.

2013-11-15 Thread Omkar Vinit Joshi (JIRA)
Omkar Vinit Joshi created YARN-1417:
---

 Summary: RM may issue expired container tokens to AM while issuing 
new containers.
 Key: YARN-1417
 URL: https://issues.apache.org/jira/browse/YARN-1417
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi


Today we create new container token when we create container in RM as a part of 
schedule cycle. However that container may get reserved or assigned. If the 
container gets reserved and remains like that (in reserved state) for more than 
container token expiry interval then RM will end up issuing container with 
expired token.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-713) ResourceManager can exit unexpectedly if DNS is unavailable

2013-11-15 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-713:
---

Attachment: YARN-713.2.patch

 ResourceManager can exit unexpectedly if DNS is unavailable
 ---

 Key: YARN-713
 URL: https://issues.apache.org/jira/browse/YARN-713
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
Priority: Critical
 Fix For: 2.3.0

 Attachments: YARN-713.09052013.1.patch, YARN-713.09062013.1.patch, 
 YARN-713.1.patch, YARN-713.2.patch, YARN-713.20130910.1.patch, 
 YARN-713.patch, YARN-713.patch, YARN-713.patch, YARN-713.patch


 As discussed in MAPREDUCE-5261, there's a possibility that a DNS outage could 
 lead to an unhandled exception in the ResourceManager's AsyncDispatcher, and 
 that ultimately would cause the RM to exit.  The RM should not exit during 
 DNS hiccups.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

2013-11-14 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13823100#comment-13823100
 ] 

Omkar Vinit Joshi commented on YARN-1210:
-

Thanks [~vinodkv]
bq. cleanupContainersOnNMResync: We are no longer making the call to 
getNodeStatusAndUpdateContainersInContext, can you please put a comment as to 
why - I believe this is so that NodeStatusUpdater can eventually take these 
statuses up when it reregisters.
yes they are used when NM re-registers with RM. added comment..

bq. use getContainerState instead of cloneAndGetContainerStatus?
They are different.

bq. Use RegisterNodeManagerRequest.newInstance() in registerWithRM?
bq. Similarly NodeStatus.newInstance, NodeHealthStatus.newInstance?
they were missing added them and fixed NodeStatusUpdater.

bq. As of now because we kill all containers it's fine, but it's better to 
explicitly check for master-container's state during registration and then only 
send the event.
bq. Also put a comment as to why we are directly faking 
RMAppAttemptContainerFinishedEvent instead of informing RMContainerImpl.
But we don't know about the container today..right?

bq. Instead of sending and ignoring ATTEMPT_FAILED at FAILED state, we can skip 
sending this event by RMAppAttempt if the app was already in a final state?
Ok.. should I also remove the similar transition from FINISHED / KILLED?

address all other comments.

 During RM restart, RM should start a new attempt only when previous attempt 
 exits for real
 --

 Key: YARN-1210
 URL: https://issues.apache.org/jira/browse/YARN-1210
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, 
 YARN-1210.4.patch, YARN-1210.4.patch, YARN-1210.5.patch


 When RM recovers, it can wait for existing AMs to contact RM back and then 
 kill them forcefully before even starting a new AM. Worst case, RM will start 
 a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
 we'll minimize multiple AMs racing with each other. This can help issues with 
 downstream components like Pig, Hive and Oozie during RM restart.
 In the mean while, new apps will proceed as usual as existing apps wait for 
 recovery.
 This can continue to be useful after work-preserving restart, so that AMs 
 which can properly sync back up with RM can continue to run and those that 
 don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

2013-11-14 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1210:


Attachment: YARN-1210.6.patch

 During RM restart, RM should start a new attempt only when previous attempt 
 exits for real
 --

 Key: YARN-1210
 URL: https://issues.apache.org/jira/browse/YARN-1210
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, 
 YARN-1210.4.patch, YARN-1210.4.patch, YARN-1210.5.patch, YARN-1210.6.patch


 When RM recovers, it can wait for existing AMs to contact RM back and then 
 kill them forcefully before even starting a new AM. Worst case, RM will start 
 a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
 we'll minimize multiple AMs racing with each other. This can help issues with 
 downstream components like Pig, Hive and Oozie during RM restart.
 In the mean while, new apps will proceed as usual as existing apps wait for 
 recovery.
 This can continue to be useful after work-preserving restart, so that AMs 
 which can properly sync back up with RM can continue to run and those that 
 don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-11-13 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13822026#comment-13822026
 ] 

Omkar Vinit Joshi commented on YARN-674:


Thanks [~vinodkv] 
bq. RMAppManager.submitApplication: Put a comment where you move apps to finish 
state saying we are doing this before token-renewal so that we don't renew 
tokens for finished apps.
Added a comment.

bq. isServiceStarted needs to be volatile?
No.. it is updated only once just when service starts.

bq. handleDTRenewerEvent - handleDTRenewerAppSubmitEvent
done..

bq. Add a comment in handleDTRenewerEvent to indicate why DTRenewer is starting 
the app as opposed to RMAppManager.
added one..

bq. Instead of putting renewerCount in the main code path, you can access the 
thread count from ThreadPoolExecutor.getPoolSize() in the tests directly ?
moved this to test code.

bq. DelegationTokenRenewerAppSubmitEvent can be nested class inside 
DelegationTokenRenewer? This is not an event from outside the renewer. 
Similarly DelegationTokenRenewerEventType. Either nest them in, or create a 
separate package.
moved the events and eventType inside DTTokenRenewer.

bq. testInvalidDelegationTokenApplicationSubmit, 
testInvalidDTWithAddApplication: Seem similar but test different things. May be 
rename one or both?
renamed both..

bq. The other point is the default number of threads in the renewer. 5 is too 
small, may be bump it up to existing number of RPC threads - 50 or something in 
that range?
using thread pool with core pool size = 5 and max pool size = 50 (configurable).

 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, 
 YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-11-13 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-674:
---

Attachment: YARN-674.7.patch

 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, 
 YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch, 
 YARN-674.7.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1338) Recover localized resource cache state upon nodemanager restart

2013-11-12 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820378#comment-13820378
 ] 

Omkar Vinit Joshi commented on YARN-1338:
-

Thanks [~jlowe] 
bq. I would rather not tie a checksum to this. Corruption of the file isn't 
related to whether the NM is restarting, and it seems odd to only check for 
corruption on restart rather than every time the resource is requested. IMHO we 
should treat checksums for localized resources as an orthogonal feature request 
to this. (It would also significantly slow down the recovery time if the NM had 
to checksum-compare everything in the distcache on startup.)
Yes I completely agree..checksum should be an additional feature rather than 
done as a part of this. 

bq. So if we persist the LocalResourceRequest to LocalizedResource map then we 
can tell after a recovery whether we already have the requested resource or not 
when a new request arrives.
Agreed. This way we will have all the information we need to reconstruct the 
cache. 

bq. We have a very rough start on persisting the local cache state, and I plan 
on working on this in earnest in the next few weeks.
good ... 

any thoughts on how and when we are planning to store the container's resource 
request and newly downloaded resource request to persistent store?
* clearly for resource request it should be quite clear. When download finishes 
and resource is marked as LOCALIZED..we should save the info...(the way 
RMRestart is doing today for RMAppImpl...NEW...to...NEW_SAVING...to...SUBMITTED)
* But for container request it will become little bit tricky...
** When we initially get resource request for all the required resources during 
container start?
** or when individual resource request gets satisfied (as they are added to ref 
of LocalizedResource)
** or when for container all the resources are downloaded / localized?
3rd scenario looks good to me because 
* by then we will have information about all the localized resources. If 
downloading failed for any of them then we frankly don't care about storing 
partial success so we can avoid this write.
* Also when container finishes / fails we can simply remove the entry
Any thoughts whether we want to avoid container start before we process all the 
writes to store or can we start in parallel? Clearly parallel writes don't look 
good to me because if any of the write events are in flight and nm restarts 
then after restart we won't know about those changes..but at the same time if 
we wait for all the writes to go through then we are delaying container start 
by that duration.

 Recover localized resource cache state upon nodemanager restart
 ---

 Key: YARN-1338
 URL: https://issues.apache.org/jira/browse/YARN-1338
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe

 Today when node manager restarts we clean up all the distributed cache files 
 from disk. This is definitely not ideal from 2 aspects.
 * For work preserving restart we definitely want them as running containers 
 are using them
 * For even non work preserving restart this will be useful in the sense that 
 we don't have to download them again if needed by future tasks.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-713) ResourceManager can exit unexpectedly if DNS is unavailable

2013-11-12 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-713:
---

Attachment: YARN-713.1.patch

 ResourceManager can exit unexpectedly if DNS is unavailable
 ---

 Key: YARN-713
 URL: https://issues.apache.org/jira/browse/YARN-713
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
Priority: Critical
 Fix For: 2.3.0

 Attachments: YARN-713.09052013.1.patch, YARN-713.09062013.1.patch, 
 YARN-713.1.patch, YARN-713.20130910.1.patch, YARN-713.patch, YARN-713.patch, 
 YARN-713.patch, YARN-713.patch


 As discussed in MAPREDUCE-5261, there's a possibility that a DNS outage could 
 lead to an unhandled exception in the ResourceManager's AsyncDispatcher, and 
 that ultimately would cause the RM to exit.  The RM should not exit during 
 DNS hiccups.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-744) Race condition in ApplicationMasterService.allocate .. It might process same allocate request twice resulting in additional containers getting allocated.

2013-11-12 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-744:
---

Attachment: YARN-744.1.patch

 Race condition in ApplicationMasterService.allocate .. It might process same 
 allocate request twice resulting in additional containers getting allocated.
 -

 Key: YARN-744
 URL: https://issues.apache.org/jira/browse/YARN-744
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Omkar Vinit Joshi
Priority: Minor
 Attachments: MAPREDUCE-3899-branch-0.23.patch, 
 YARN-744-20130711.1.patch, YARN-744-20130715.1.patch, 
 YARN-744-20130726.1.patch, YARN-744.1.patch, YARN-744.patch


 Looks like the lock taken in this is broken. It takes a lock on lastResponse 
 object and then puts a new lastResponse object into the map. At this point a 
 new thread entering this function will get a new lastResponse object and will 
 be able to take its lock and enter the critical section. Presumably we want 
 to limit one response per app attempt. So the lock could be taken on the 
 ApplicationAttemptId key of the response map object.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

2013-11-11 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13819358#comment-13819358
 ] 

Omkar Vinit Joshi commented on YARN-1210:
-

thanks [~jianhe]

bq. Please revert all files with import-only changes, CotainerLauncher etc.
bq. revert NodeManager, TestNodeManagerResync changes.
done..

bq. rename RegisterNodeManagerRequest.addAllContainerStatuses to 
setContainerStatuses to conform with convention namings.
bq. Bug in RegisterNodeManagerRequestPBImpl.addAllContainerStatuses(), you are 
appending the containers to the existing container list instead of set. also no 
need to call initFinishedContainers() inside. Can you see what 
NodeStatusPBImpl.(get/set)ContainerStatuses() is doing?
I am following NodeHeartbeatResponse convention.

bq. RMAppAttemptImpl.getRecoveredFinalState not used, removed.
removed..

bq. wrong comment, am2 attempt should be at Launched state. There's one more 
such same wrong comment at line 537
fixed.

bq. waiting for 20 secs is too long for a unit test, can you pick a value as 
small as possible ?
I am making it 10 secs..

bq. no need to create new method waitForAppAttemptToExpire(), we can just use 
MockRM.waitForState()
bq.similarly for newly created methods, waitForRMToProcessAllEvents, 
waitForRMAppAttempts can be achieved by just waiting for the specific app or 
attempt state. That achieves the same result as waiting for all events get 
processed.
fixed..

bq. For the 3rd case, shouldn't we test that as you said in the comment all 
the stored attempts had finished then new attempt should be started immediately
last part of test case is actually testing that only..

bq. we also need one more test case that, if RM crashes before attempt initial 
state info is saved in RMStateStore. App will be recovered with no attempt 
associated with it. For that we have no chance to replay the AttemptRecovered 
logic to start a new attempt, App itself should be able to start a new attempt.
added one.


 During RM restart, RM should start a new attempt only when previous attempt 
 exits for real
 --

 Key: YARN-1210
 URL: https://issues.apache.org/jira/browse/YARN-1210
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, 
 YARN-1210.4.patch, YARN-1210.4.patch


 When RM recovers, it can wait for existing AMs to contact RM back and then 
 kill them forcefully before even starting a new AM. Worst case, RM will start 
 a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
 we'll minimize multiple AMs racing with each other. This can help issues with 
 downstream components like Pig, Hive and Oozie during RM restart.
 In the mean while, new apps will proceed as usual as existing apps wait for 
 recovery.
 This can continue to be useful after work-preserving restart, so that AMs 
 which can properly sync back up with RM can continue to run and those that 
 don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1338) Recover localized resource cache state upon nodemanager restart

2013-11-11 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1338:


Description: Today when node manager restarts we clean up all the 

 Recover localized resource cache state upon nodemanager restart
 ---

 Key: YARN-1338
 URL: https://issues.apache.org/jira/browse/YARN-1338
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Ravi Prakash

 Today when node manager restarts we clean up all the 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1338) Recover localized resource cache state upon nodemanager restart

2013-11-11 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13819678#comment-13819678
 ] 

Omkar Vinit Joshi commented on YARN-1338:
-

Here are certain things which we may want to track as part of this.
* Info from LocalizedResource
** Local Disk Path 
** timestamp
** RemoteUrl (Here do we need to trust that the old and new url are 
identical..not changed)?
** we store the resources inside the distributed cache in an hierarchical 
manner (to avoid unix directory limit)... we may need to recover that too).
** checksum? 
* We will also need to track containers which are using this resource. It would 
be better if we isolate this from the place where we are storing 
LocalizedResource thereby changes to this will be minimal.
** Do we need to store the symlink we are creating?
anyone working on this actively?

 Recover localized resource cache state upon nodemanager restart
 ---

 Key: YARN-1338
 URL: https://issues.apache.org/jira/browse/YARN-1338
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Ravi Prakash

 Today when node manager restarts we clean up all the distributed cache files 
 from disk. This is definitely not ideal from 2 aspects.
 * For work preserving restart we definitely want them as running containers 
 are using them
 * For even non work preserving restart this will be useful in the sense that 
 we don't have to download them again if needed by future tasks.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

2013-11-11 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1210:


Attachment: YARN-1210.5.patch

 During RM restart, RM should start a new attempt only when previous attempt 
 exits for real
 --

 Key: YARN-1210
 URL: https://issues.apache.org/jira/browse/YARN-1210
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, 
 YARN-1210.4.patch, YARN-1210.4.patch, YARN-1210.5.patch


 When RM recovers, it can wait for existing AMs to contact RM back and then 
 kill them forcefully before even starting a new AM. Worst case, RM will start 
 a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
 we'll minimize multiple AMs racing with each other. This can help issues with 
 downstream components like Pig, Hive and Oozie during RM restart.
 In the mean while, new apps will proceed as usual as existing apps wait for 
 recovery.
 This can continue to be useful after work-preserving restart, so that AMs 
 which can properly sync back up with RM can continue to run and those that 
 don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

2013-11-08 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817941#comment-13817941
 ] 

Omkar Vinit Joshi commented on YARN-1210:
-

Attaching rebased patch.
I slightly modified the logic for RMRestart app recovery code.
* If application doesn't have any attempt then it will start new attempt when 
we do submitApplication as a part of recovery.
* If application has 1 more application attempts then the attempt recovery will 
take place in 2 steps.
** All the application attempts except the last attempt will be recovered first.
** When we do submitApplication as a part of application recovery we will 
replay the last attempt.
*** If last attempt doesn't have any finalRecoveredState stored then it will be 
considered as the one for which AM may or may not have been started/finished. 
So we will move this application attempt into LAUNCHED state, add it to 
AMLivenessMonitor and move application to RUNNING state.
*** If last attempt was in either FAILED/KILLED/FINISHED state then we will 
replay that attempt's BaseFinalTransition by recovering attempt synchronously 
here.

Adding test to cover below scenarios
* New application attempt is not started until previous AM container finish 
event is reported back to RM as a part of nm registration.
* If previous AM container finish event is never reported back (i.e. node 
manager on which this AM container was running also went down) in that case 
AMLivenessMonitor should time out previous attempt and start new attempt.
* If all the stored attempts had finished then new attempt should be started 
immediately.

 During RM restart, RM should start a new attempt only when previous attempt 
 exits for real
 --

 Key: YARN-1210
 URL: https://issues.apache.org/jira/browse/YARN-1210
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch


 When RM recovers, it can wait for existing AMs to contact RM back and then 
 kill them forcefully before even starting a new AM. Worst case, RM will start 
 a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
 we'll minimize multiple AMs racing with each other. This can help issues with 
 downstream components like Pig, Hive and Oozie during RM restart.
 In the mean while, new apps will proceed as usual as existing apps wait for 
 recovery.
 This can continue to be useful after work-preserving restart, so that AMs 
 which can properly sync back up with RM can continue to run and those that 
 don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

2013-11-08 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1210:


Attachment: YARN-1210.4.patch

 During RM restart, RM should start a new attempt only when previous attempt 
 exits for real
 --

 Key: YARN-1210
 URL: https://issues.apache.org/jira/browse/YARN-1210
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, 
 YARN-1210.4.patch


 When RM recovers, it can wait for existing AMs to contact RM back and then 
 kill them forcefully before even starting a new AM. Worst case, RM will start 
 a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
 we'll minimize multiple AMs racing with each other. This can help issues with 
 downstream components like Pig, Hive and Oozie during RM restart.
 In the mean while, new apps will proceed as usual as existing apps wait for 
 recovery.
 This can continue to be useful after work-preserving restart, so that AMs 
 which can properly sync back up with RM can continue to run and those that 
 don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

2013-11-08 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1210:


Attachment: YARN-1210.4.patch

 During RM restart, RM should start a new attempt only when previous attempt 
 exits for real
 --

 Key: YARN-1210
 URL: https://issues.apache.org/jira/browse/YARN-1210
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch, 
 YARN-1210.4.patch, YARN-1210.4.patch


 When RM recovers, it can wait for existing AMs to contact RM back and then 
 kill them forcefully before even starting a new AM. Worst case, RM will start 
 a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
 we'll minimize multiple AMs racing with each other. This can help issues with 
 downstream components like Pig, Hive and Oozie during RM restart.
 In the mean while, new apps will proceed as usual as existing apps wait for 
 recovery.
 This can continue to be useful after work-preserving restart, so that AMs 
 which can properly sync back up with RM can continue to run and those that 
 don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-11-05 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814291#comment-13814291
 ] 

Omkar Vinit Joshi commented on YARN-674:


Thanks [~bikassaha] for the review
bq. We were intentionally going through the same submitApplication() method to 
make sure that all the initialization and setup code paths are consistently 
followed in both cases by keeping the code path identical as much as possible. 
The RM would submit a recovered application, in essence proxying a user 
submitting the application. Its a general pattern followed through the recovery 
logic - to be minimally invasive to the mainline code path so that we can avoid 
functional bugs as much as possible. Separating them into 2 methods has 
resulted in code duplication in both methods without any huge benefit that I 
can see. It also leave us susceptible to future code changes made in one code 
path and not the other.
I agree with your suggestion... reverting the changes ..discussed with 
[~vinodkv] offline.

bq. Why is isSecurityEnabled() being checked at this internal level. The code 
should not even reach this point if security is not enabled. 
you have a point ..fixing it..

bq. Also why is it calling 
rmContext.getDelegationTokenRenewer().addApplication(event) instead of 
DelegationTokenRenewer.this.addApplication(). Same for 
rmContext.getDelegationTokenRenewer().applicationFinished(evt);
Makes sense...fixed it..

bq. Rename DelegationTokenRenewerThread to not have misleading Thread in the 
name ?
fixed.

bq. Can DelegationTokenRenewerAppSubmitEvent event objects have an event type 
different from VERIFY_AND_START_APPLICATION? If not, we dont need this check 
and we can change the constructor of DelegationTokenRenewerAppSubmitEvent to 
not expect an event type argument. It should set the 
VERIFY_AND_START_APPLICATION within the constructor.
fixed..

bq. @Private + @VisibleForTesting???
fixed.


 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, 
 YARN-674.4.patch, YARN-674.5.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-11-05 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-674:
---

Attachment: YARN-674.5.patch

 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, 
 YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

2013-11-05 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814496#comment-13814496
 ] 

Omkar Vinit Joshi commented on YARN-1210:
-

Thanks [~jianhe] for reviewing it.

{code}
Instead of passing running containers as parameter in 
RegisterNodeManagerRequest, is it possible to just call heartBeat immediately 
after registerCall and then unBlockNewContainerRequests ? That way we can take 
advantage of the existing heartbeat logic, cover other things like keep app 
alive for log aggregation after AM container completes.
Or at least we can send the list of ContainerStatus(including diagnostics) 
instead of just container Ids and also the list of keep-alive apps (separate 
jira)?
{code}
it makes sense replacing finishedContainers with containerStatuses. 

bq. Unnecessary import changes in DefaultContainerExecutor.java and 
LinuxContainerExecutor, ContainerLaunch, ContainersLauncher
actually I wanted that earlier as I had created new ExitCode.java. I wanted to 
access it from ResourceTrackerService. Now since we are sending container 
status from node manager itself so no longer need that ..fixed it.

bq. Finished containers may not necessary be killed. The containers can also 
normal finish and remain in the NM cache before NM resync.
Updated the logic for cleanupContainers on node manager side. Now we should 
have all the finishedContainer statuses as it is.

bq. wrong LOG class name.
:) fixed it..

bq. LogFactory.getLog(RMAppImpl.class);
removed.

bq. Isn't always the case that after this patch only the last attempt can be 
running ? a new attempt will not be launched until the previous attempt reports 
back it really exits. If this is case, it can be a bug.
We may only need to check that if the last attempt is finished or not.
It is actually checking for any attempt to be in non running state. Do you want 
me to only check last attempt (by comparing application attempt ids)?.

bq. should we return RUNNING or ACCEPTED for apps that are not in final state ? 
It's ok to return RUNNING in the scope of this patch because anyways we are 
launching a new attempt. Later on in working preserving restart, RM can crash 
before attempt register, attempt can register with RM after RM comes back in 
which case we can then move app from ACCEPTED to RUNNING?
Yes right now I will keep it as RUNNING only. Today we don't have any 
information whether previous application master started and registered or not. 
Once we will have that information then probably we can do this.

 During RM restart, RM should start a new attempt only when previous attempt 
 exits for real
 --

 Key: YARN-1210
 URL: https://issues.apache.org/jira/browse/YARN-1210
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1210.1.patch, YARN-1210.2.patch


 When RM recovers, it can wait for existing AMs to contact RM back and then 
 kill them forcefully before even starting a new AM. Worst case, RM will start 
 a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
 we'll minimize multiple AMs racing with each other. This can help issues with 
 downstream components like Pig, Hive and Oozie during RM restart.
 In the mean while, new apps will proceed as usual as existing apps wait for 
 recovery.
 This can continue to be useful after work-preserving restart, so that AMs 
 which can properly sync back up with RM can continue to run and those that 
 don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

2013-11-05 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1210:


Attachment: YARN-1210.3.patch

 During RM restart, RM should start a new attempt only when previous attempt 
 exits for real
 --

 Key: YARN-1210
 URL: https://issues.apache.org/jira/browse/YARN-1210
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch


 When RM recovers, it can wait for existing AMs to contact RM back and then 
 kill them forcefully before even starting a new AM. Worst case, RM will start 
 a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
 we'll minimize multiple AMs racing with each other. This can help issues with 
 downstream components like Pig, Hive and Oozie during RM restart.
 In the mean while, new apps will proceed as usual as existing apps wait for 
 recovery.
 This can continue to be useful after work-preserving restart, so that AMs 
 which can properly sync back up with RM can continue to run and those that 
 don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-11-05 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13814541#comment-13814541
 ] 

Omkar Vinit Joshi commented on YARN-674:


Thanks [~jianhe], [~bikassaha] .

bq. Saw this is changed back to asynchronous submission on recovery, the 
original intention was to prevent client from seeing the application as a new 
application. If asynchronously, the client can query the application before 
recover event gets processed, meaning before the application is fully recovered 
as some recover logic happens when app is processing the recover 
event(app.FinalTransition).
fixed to make sure that it gets updated synchronously.

bq. The assert doesnt make it to the production jar - so it wont catch anything 
on the cluster. Need to throw an exception here. If we dont want to crash the 
RM here then we can log and error. When the attempt state machine gets the 
event then it will crash on the async dispatcher thread if the event is not 
handled in the current state.
discussed with bikas offline.. this is fine.

 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, 
 YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

2013-11-04 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1210:


Attachment: YARN-1210.2.patch

 During RM restart, RM should start a new attempt only when previous attempt 
 exits for real
 --

 Key: YARN-1210
 URL: https://issues.apache.org/jira/browse/YARN-1210
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1210.1.patch, YARN-1210.2.patch


 When RM recovers, it can wait for existing AMs to contact RM back and then 
 kill them forcefully before even starting a new AM. Worst case, RM will start 
 a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
 we'll minimize multiple AMs racing with each other. This can help issues with 
 downstream components like Pig, Hive and Oozie during RM restart.
 In the mean while, new apps will proceed as usual as existing apps wait for 
 recovery.
 This can continue to be useful after work-preserving restart, so that AMs 
 which can properly sync back up with RM can continue to run and those that 
 don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

2013-11-04 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813109#comment-13813109
 ] 

Omkar Vinit Joshi commented on YARN-1210:
-

completely removed RECOVERED state. rest of the patch is same. Only major 
difference is
* Before launching new appAttempt RM will check if any of the application 
attempts were running before. If so then RM will wait instead of starting a new 
application attempt. If no application attempts are found to be in running 
(anything other than final state) state then it launch new application attempt.
* When Node manager receives resync signal it kills all the running containers 
and then reports back the killed containers to RM during RM registration. On 
receiving the container information RM checks if any of the reported container 
is an AM container If so then sends container_failed event to the related app 
attempt and eventually starts new application attempt.

 During RM restart, RM should start a new attempt only when previous attempt 
 exits for real
 --

 Key: YARN-1210
 URL: https://issues.apache.org/jira/browse/YARN-1210
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1210.1.patch, YARN-1210.2.patch


 When RM recovers, it can wait for existing AMs to contact RM back and then 
 kill them forcefully before even starting a new AM. Worst case, RM will start 
 a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
 we'll minimize multiple AMs racing with each other. This can help issues with 
 downstream components like Pig, Hive and Oozie during RM restart.
 In the mean while, new apps will proceed as usual as existing apps wait for 
 recovery.
 This can continue to be useful after work-preserving restart, so that AMs 
 which can properly sync back up with RM can continue to run and those that 
 don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

2013-11-04 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813115#comment-13813115
 ] 

Omkar Vinit Joshi commented on YARN-1210:
-

cancelled the patch as it is based on YARN-674

 During RM restart, RM should start a new attempt only when previous attempt 
 exits for real
 --

 Key: YARN-1210
 URL: https://issues.apache.org/jira/browse/YARN-1210
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1210.1.patch, YARN-1210.2.patch


 When RM recovers, it can wait for existing AMs to contact RM back and then 
 kill them forcefully before even starting a new AM. Worst case, RM will start 
 a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
 we'll minimize multiple AMs racing with each other. This can help issues with 
 downstream components like Pig, Hive and Oozie during RM restart.
 In the mean while, new apps will proceed as usual as existing apps wait for 
 recovery.
 This can continue to be useful after work-preserving restart, so that AMs 
 which can properly sync back up with RM can continue to run and those that 
 don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-11-04 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813560#comment-13813560
 ] 

Omkar Vinit Joshi commented on YARN-674:


Thanks [~vinodkv] for review...

bq. Does this patch also include YARN-1210? Seems like it, we should separate 
that code.
No .. anything specific? YARN-1210 is more about waiting for older AM to finish 
before launching a new AM.

bq. Depending on the final patch, I think we should split 
RMAppManager.submitApp into two, one for regular submit and one for submit 
after recovery.
Splitting the method into 2.
* submitApplication - normal application submission
* submitRecoveredApplication - submitting recovered application

bq. RMAppState.java change is unnecessary.
fixed

bq. ForwardingEventHandler is a bottleneck for renewals now - especially during 
submission. We need to have a thread pool.
Create fixed thread pool service with thread count controllable via 
configuration (Not adding this to yarn-default). Keeping default thread count 
to be 5. fair enough?

bq. Once we do the above, the old concurrency test should be added back.
yeah..added that test back..

bq. We are undoing most of YARN-1107. Good that we laid the groundwork there. 
Let's make sure we remove all the dead code. One comment stands out
Anything did I miss here? didn't understand. The comment I have not removed as 
it is still valid. 

bq. The newly added test can have race conditions? We may be lucky in the test, 
but in real life scenario, client has to submit app and poll for app failure 
due to invalid tokens
I think it will not. For clients yes after they submit the application they 
will have to keep polling to know the status of the application (got accepted 
or failed due to token renewal).

bq. Similarly we should add a test for successful submission after renewal.
sure added one.. checking for RMAppEvent.START


 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, 
 YARN-674.4.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-11-04 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-674:
---

Attachment: YARN-674.5.patch

 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, 
 YARN-674.4.patch, YARN-674.5.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-10-31 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-674:
---

Attachment: YARN-674.4.patch

 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, 
 YARN-674.4.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

2013-10-31 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13810927#comment-13810927
 ] 

Omkar Vinit Joshi commented on YARN-1210:
-

submitting patch on top of YARN-674.

 During RM restart, RM should start a new attempt only when previous attempt 
 exits for real
 --

 Key: YARN-1210
 URL: https://issues.apache.org/jira/browse/YARN-1210
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1210.1.patch


 When RM recovers, it can wait for existing AMs to contact RM back and then 
 kill them forcefully before even starting a new AM. Worst case, RM will start 
 a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
 we'll minimize multiple AMs racing with each other. This can help issues with 
 downstream components like Pig, Hive and Oozie during RM restart.
 In the mean while, new apps will proceed as usual as existing apps wait for 
 recovery.
 This can continue to be useful after work-preserving restart, so that AMs 
 which can properly sync back up with RM can continue to run and those that 
 don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1359) AMRMToken should not be sent to Container other than AM.

2013-10-30 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13809610#comment-13809610
 ] 

Omkar Vinit Joshi commented on YARN-1359:
-

Today node manager doesn't do this filtering of tokens. 

Proposal :-
Let node manager filter out AMRMToken from tokens while launching container 
other than AM. Thereby we are only (truly) allowing AM container to talk to RM 
on AMRM protocol.

Enhancements :- today node manager doesn't know which container is AM 
container. There are lot of problems because of this. So we first need a way to 
inform node manager about the container being AM. As today node manager comes 
to know everything about the new container from container token so it will be 
better to add isAM flag inside the token . Thoughts? 
(Note: we are anyway not encouraging users to talk to RM using multiple 
containers which are sharing same AMRMToken).


 AMRMToken should not be sent to Container other than AM.
 

 Key: YARN-1359
 URL: https://issues.apache.org/jira/browse/YARN-1359
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi





--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (YARN-1377) Log aggregation via node manager should expose expose a way to cancel aggregation at application or container level

2013-10-30 Thread Omkar Vinit Joshi (JIRA)
Omkar Vinit Joshi created YARN-1377:
---

 Summary: Log aggregation via node manager should expose expose a 
way to cancel aggregation at application or container level
 Key: YARN-1377
 URL: https://issues.apache.org/jira/browse/YARN-1377
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi


Today when application finishes it starts aggregating all the logs but that may 
slow down the whole process significantly...
there can be situations where certain containers overwrote the logs .. say in 
multiple GBsin these scenarios we need a way to cancel log aggregation for 
certain containers. These can be at per application level or at per container 
level.
thoughts?



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1377) Log aggregation via node manager should expose expose a way to cancel aggregation at application or container level

2013-10-30 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1377:


Assignee: Xuan Gong

 Log aggregation via node manager should expose expose a way to cancel 
 aggregation at application or container level
 ---

 Key: YARN-1377
 URL: https://issues.apache.org/jira/browse/YARN-1377
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Xuan Gong

 Today when application finishes it starts aggregating all the logs but that 
 may slow down the whole process significantly...
 there can be situations where certain containers overwrote the logs .. say in 
 multiple GBsin these scenarios we need a way to cancel log aggregation 
 for certain containers. These can be at per application level or at per 
 container level.
 thoughts?



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1377) Log aggregation via node manager should expose expose a way to cancel log aggregation at application or container level

2013-10-30 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1377:


Summary: Log aggregation via node manager should expose expose a way to 
cancel log aggregation at application or container level  (was: Log aggregation 
via node manager should expose expose a way to cancel aggregation at 
application or container level)

 Log aggregation via node manager should expose expose a way to cancel log 
 aggregation at application or container level
 ---

 Key: YARN-1377
 URL: https://issues.apache.org/jira/browse/YARN-1377
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Xuan Gong

 Today when application finishes it starts aggregating all the logs but that 
 may slow down the whole process significantly...
 there can be situations where certain containers overwrote the logs .. say in 
 multiple GBsin these scenarios we need a way to cancel log aggregation 
 for certain containers. These can be at per application level or at per 
 container level.
 thoughts?



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (YARN-1363) Get / Cancel / Renew delegation token api should be non blocking

2013-10-29 Thread Omkar Vinit Joshi (JIRA)
Omkar Vinit Joshi created YARN-1363:
---

 Summary: Get / Cancel / Renew delegation token api should be non 
blocking
 Key: YARN-1363
 URL: https://issues.apache.org/jira/browse/YARN-1363
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi


Today GetDelgationToken, CancelDelegationToken and RenewDelegationToken are all 
blocking apis.
* As a part of these calls we try to update RMStateStore and that may slow it 
down.
* Now as we have limited number of client request handlers we may fill up 
client handlers quickly.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1363) Get / Cancel / Renew delegation token api should be non blocking

2013-10-29 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808620#comment-13808620
 ] 

Omkar Vinit Joshi commented on YARN-1363:
-

Proposal :- 
* Retain the current behavior as it is (synchronous one). 
* Add a configuration knob to enable asynchronous behavior. 
** If enabled then Get / Renew / Cancel apis will work as it is except saving 
updated DT to RMStateStore synchronously.
** Client will have to make an additional call to check the status of the 
operation passing DT and OP [ Get/Renew/Cancel ]. 
*** ClientRMService remembers the status of the operation starting from the 
time when client requested and RMStateStore saved it to
 until first Client request arrived to check its status[ requests for 
(token, op) ]
 or timer (configurable..may be 10 min) expired.

 Get / Cancel / Renew delegation token api should be non blocking
 

 Key: YARN-1363
 URL: https://issues.apache.org/jira/browse/YARN-1363
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi

 Today GetDelgationToken, CancelDelegationToken and RenewDelegationToken are 
 all blocking apis.
 * As a part of these calls we try to update RMStateStore and that may slow it 
 down.
 * Now as we have limited number of client request handlers we may fill up 
 client handlers quickly.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-10-28 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807046#comment-13807046
 ] 

Omkar Vinit Joshi commented on YARN-674:


bq. parseCredentials is not asynchronous, right? Therefore, is it better to 
fail the application submission immediately instead of forcing the client the 
come back to check the status? In fact, in submitApplication, there're already 
several points where the application submission can fail immediately, even 
though the application START is handled asynchronously.
I am not sure about this. No problem but will add it.

{code}
Maybe you can do
  if (event instanceof DelegationTokenRenewerAppSubmitEvent) {
...
  }
to avoid the findbug warning?
{code}
the whole point of adding ExceptionType is to avoid this. right [~zjshen]? I am 
wondering why at other places we are not getting similar casting exception.

 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch, YARN-674.2.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-10-28 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807503#comment-13807503
 ] 

Omkar Vinit Joshi commented on YARN-674:


Thanks [~zjshen] and [~jianhe] for reviewing..

bq. DelegationTokenRenewer.applicationFinished() is anyways canceling the token 
on a separate thread, do we need to funnel through the dispatcher as well ?
discussed with [~jianhe] offline.. We need this because user may just submit an 
application and kill it immediately after it. The earlier submitted application 
may still be in flight (in dispatcher queue) and we may try to process 
application finish event as a part of kill before it. To avoid this I am 
enqueuing it.

bq. It would be to have an end-to-end test that application submitted with an 
Invalid token will be rejected and verify yarn client can get Failed 
application status using MockRM.
Added the test


 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch, YARN-674.2.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-10-28 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-674:
---

Attachment: YARN-674.3.patch

 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (YARN-1359) AMRMToken should not be sent to Container other than AM.

2013-10-28 Thread Omkar Vinit Joshi (JIRA)
Omkar Vinit Joshi created YARN-1359:
---

 Summary: AMRMToken should not be sent to Container other than AM.
 Key: YARN-1359
 URL: https://issues.apache.org/jira/browse/YARN-1359
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi






--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-10-25 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805697#comment-13805697
 ] 

Omkar Vinit Joshi commented on YARN-674:


Thanks [~zjshen] for reviewing my patch
bq. I think the exception needs to be thrown, which is missing in your patch. 
The exception will notice the client that the app submission fails; otherwise, 
the client will think the submission succeeds?
Yes I have removed the error purposefully..here are the thoughts.
* For client once he submits the application should check the app status and 
will come to know about the failing app from it.
** Either when parsing credentials fails.
** OR when initial token renewal fails.

bq. Since DelegationTokenRenewer#addApplication becomes asynchronous, what will 
the impact of that the application is already accepted and starts its life 
cycle, while DelegationTokenRenewer is so slow to 
DelegationTokenRenewerAppSubmitEvent. Will the application fail somewhere else 
due to the fresh token unavailable?
The logic here is modified a bit. If token renewal succeeds then only app is 
submitted to scheduler not before that. Today too it is the same case. Only 
problem is that we are holding client request while doing this. With the change 
this will become async.

bq. I noticed testConncurrentAddApplication has been removed. Does the change 
affect the current app submission?
No. Now there is no problem w.r.t. concurrent app submission as we are anyway 
funneling it through event handler. This test is no longer required so removed 
it completely.

* Fixing findbug warnings...
* fixing failed test case...




 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-10-25 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-674:
---

Attachment: YARN-674.2.patch

 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch, YARN-674.2.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1350) Should not add Lost Node by NodeManager reboot

2013-10-25 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805825#comment-13805825
 ] 

Omkar Vinit Joshi commented on YARN-1350:
-

[~sinchii] I have basic question..why your nodeId is changing everytime? have 
you configured your nodemanager with ephemeral port (0) ? what is NM_ADDRESS? 
RM will consider this as same node only when your newly restarted node manager 
reports with same node id .. i.e. host-name:port

 Should not add Lost Node by NodeManager reboot
 --

 Key: YARN-1350
 URL: https://issues.apache.org/jira/browse/YARN-1350
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0
Reporter: Shinichi Yamashita
 Attachments: NodeState.txt


 In current trunk, when NodeManager reboots, the node information before the 
 reboot is treated as LOST.
 This occurs to confirm only Inactive node information at the time of reboot.
 Therefore Lost Node will exist even if NodeManager works in all nodes.
 We should change it not to register Lost Node by the NodeManager reboot.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Assigned] (YARN-1252) Secure RM fails to start up in secure HA setup with Renewal request for unknown token exception

2013-10-25 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi reassigned YARN-1252:
---

Assignee: Omkar Vinit Joshi

 Secure RM fails to start up in secure HA setup with Renewal request for 
 unknown token exception
 ---

 Key: YARN-1252
 URL: https://issues.apache.org/jira/browse/YARN-1252
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.1-beta
Reporter: Arpit Gupta
Assignee: Omkar Vinit Joshi

 {code}
 2013-09-26 08:15:20,507 INFO  ipc.Server (Server.java:run(861)) - IPC Server 
 Responder: starting
 2013-09-26 08:15:20,521 ERROR security.UserGroupInformation 
 (UserGroupInformation.java:doAs(1486)) - PriviledgedActionException 
 as:rm/host@realm (auth:KERBEROS) 
 cause:org.apache.hadoop.security.token.SecretManager$InvalidToken: Renewal 
 request for unknown token
 at 
 org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:388)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:5934)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:453)
 at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:851)
 at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59650)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1483)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1252) Secure RM fails to start up in secure HA setup with Renewal request for unknown token exception

2013-10-25 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805827#comment-13805827
 ] 

Omkar Vinit Joshi commented on YARN-1252:
-

taking it over..


 Secure RM fails to start up in secure HA setup with Renewal request for 
 unknown token exception
 ---

 Key: YARN-1252
 URL: https://issues.apache.org/jira/browse/YARN-1252
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.1-beta
Reporter: Arpit Gupta

 {code}
 2013-09-26 08:15:20,507 INFO  ipc.Server (Server.java:run(861)) - IPC Server 
 Responder: starting
 2013-09-26 08:15:20,521 ERROR security.UserGroupInformation 
 (UserGroupInformation.java:doAs(1486)) - PriviledgedActionException 
 as:rm/host@realm (auth:KERBEROS) 
 cause:org.apache.hadoop.security.token.SecretManager$InvalidToken: Renewal 
 request for unknown token
 at 
 org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:388)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:5934)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:453)
 at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:851)
 at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59650)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1483)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1252) Secure RM fails to start up in secure HA setup with Renewal request for unknown token exception

2013-10-25 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805828#comment-13805828
 ] 

Omkar Vinit Joshi commented on YARN-1252:
-

YARN-674 should solve this problem. Now as token renewal is asynchronous in 
nature so if the token in unknown or external system (token renewing system) is 
down then the application for which this token was submitted will be marked as 
failed without crashing RM.

 Secure RM fails to start up in secure HA setup with Renewal request for 
 unknown token exception
 ---

 Key: YARN-1252
 URL: https://issues.apache.org/jira/browse/YARN-1252
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.1-beta
Reporter: Arpit Gupta
Assignee: Omkar Vinit Joshi

 {code}
 2013-09-26 08:15:20,507 INFO  ipc.Server (Server.java:run(861)) - IPC Server 
 Responder: starting
 2013-09-26 08:15:20,521 ERROR security.UserGroupInformation 
 (UserGroupInformation.java:doAs(1486)) - PriviledgedActionException 
 as:rm/host@realm (auth:KERBEROS) 
 cause:org.apache.hadoop.security.token.SecretManager$InvalidToken: Renewal 
 request for unknown token
 at 
 org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:388)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:5934)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:453)
 at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:851)
 at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59650)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1483)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1252) Secure RM fails to start up in secure HA setup with Renewal request for unknown token exception

2013-10-25 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805830#comment-13805830
 ] 

Omkar Vinit Joshi commented on YARN-1252:
-

[~vinodkv] [~jianhe] if you agree then we can close this.

 Secure RM fails to start up in secure HA setup with Renewal request for 
 unknown token exception
 ---

 Key: YARN-1252
 URL: https://issues.apache.org/jira/browse/YARN-1252
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.1-beta
Reporter: Arpit Gupta
Assignee: Omkar Vinit Joshi

 {code}
 2013-09-26 08:15:20,507 INFO  ipc.Server (Server.java:run(861)) - IPC Server 
 Responder: starting
 2013-09-26 08:15:20,521 ERROR security.UserGroupInformation 
 (UserGroupInformation.java:doAs(1486)) - PriviledgedActionException 
 as:rm/host@realm (auth:KERBEROS) 
 cause:org.apache.hadoop.security.token.SecretManager$InvalidToken: Renewal 
 request for unknown token
 at 
 org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:388)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:5934)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:453)
 at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:851)
 at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59650)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1483)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1350) Should not add Lost Node by NodeManager reboot

2013-10-25 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805867#comment-13805867
 ] 

Omkar Vinit Joshi commented on YARN-1350:
-

That is mainly for single node cluster to avoid port clashing. For real cluster 
you should define a port there. If you agree I will close this as invalid.

 Should not add Lost Node by NodeManager reboot
 --

 Key: YARN-1350
 URL: https://issues.apache.org/jira/browse/YARN-1350
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0
Reporter: Shinichi Yamashita
 Attachments: NodeState.txt


 In current trunk, when NodeManager reboots, the node information before the 
 reboot is treated as LOST.
 This occurs to confirm only Inactive node information at the time of reboot.
 Therefore Lost Node will exist even if NodeManager works in all nodes.
 We should change it not to register Lost Node by the NodeManager reboot.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Resolved] (YARN-1350) Should not add Lost Node by NodeManager reboot

2013-10-25 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi resolved YARN-1350.
-

Resolution: Invalid
  Assignee: Omkar Vinit Joshi

 Should not add Lost Node by NodeManager reboot
 --

 Key: YARN-1350
 URL: https://issues.apache.org/jira/browse/YARN-1350
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0
Reporter: Shinichi Yamashita
Assignee: Omkar Vinit Joshi
 Attachments: NodeState.txt


 In current trunk, when NodeManager reboots, the node information before the 
 reboot is treated as LOST.
 This occurs to confirm only Inactive node information at the time of reboot.
 Therefore Lost Node will exist even if NodeManager works in all nodes.
 We should change it not to register Lost Node by the NodeManager reboot.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-10-25 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805885#comment-13805885
 ] 

Omkar Vinit Joshi commented on YARN-674:


* The recent test failure doesn't seem to be related to the code. The test 
passes locally. Should I open one ticket for this? 
* Not understanding how to fix that findbug warning .. should I add that too 
into exclude-findbug.xml? I tried this. Even eclipse doesn't complain
{code}
@Override
@SuppressWarnings(unchecked)
public void handle(DelegationTokenRenewerEvent event) {
  if (event.getType().equals(
  DelegationTokenRenewerEventType.VERIFY_AND_START_APPLICATION)) {
DelegationTokenRenewerAppSubmitEvent appSubmitEvt =
(DelegationTokenRenewerAppSubmitEvent) event;
handleDTRenewerEvent(appSubmitEvt);
  } else if (event.getType().equals(
  DelegationTokenRenewerEventType.FINISH_APPLICATION)) {
rmContext.getDelegationTokenRenewer().applicationFinished(event);
  }

}
{code}

 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch, YARN-674.2.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1350) Should not add Lost Node by NodeManager reboot

2013-10-25 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13805908#comment-13805908
 ] 

Omkar Vinit Joshi commented on YARN-1350:
-

you should checkout MiniYarnCluster.

 Should not add Lost Node by NodeManager reboot
 --

 Key: YARN-1350
 URL: https://issues.apache.org/jira/browse/YARN-1350
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0
Reporter: Shinichi Yamashita
Assignee: Omkar Vinit Joshi
 Attachments: NodeState.txt


 In current trunk, when NodeManager reboots, the node information before the 
 reboot is treated as LOST.
 This occurs to confirm only Inactive node information at the time of reboot.
 Therefore Lost Node will exist even if NodeManager works in all nodes.
 We should change it not to register Lost Node by the NodeManager reboot.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-10-24 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-674:
---

Attachment: YARN-674.1.patch

 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-10-24 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13804834#comment-13804834
 ] 

Omkar Vinit Joshi commented on YARN-674:


updating the patch
Modifying the event flow
* In unsecured case nothing will change
* In secured case
** For Recovery
*** ApplicationEvents will be enqueued when we are recovering. When service 
starts they will get processed.
** For normal app submission
*** Event will be enqueued into the separate DelegationTokenRenewer dispatcher 
queue and client's request will be returned immediately.
*** If the token renewal is successful then renewer will send the START/RECOVER 
event else it will fail the app.

* Testing
** Updated unit test to test the updated behavior
** Manually tested it on secured cluster
*** tested app submission with by default HDFS TOKEN and it works.
*** tested same by restarting RM in between

 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-674.1.patch


 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Assigned] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable

2013-10-23 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi reassigned YARN-674:
--

Assignee: Omkar Vinit Joshi  (was: Vinod Kumar Vavilapalli)

 Slow or failing DelegationToken renewals on submission itself make RM 
 unavailable
 -

 Key: YARN-674
 URL: https://issues.apache.org/jira/browse/YARN-674
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi

 This was caused by YARN-280. A slow or a down NameNode for will make it look 
 like RM is unavailable as it may run out of RPC handlers due to blocked 
 client submissions.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1303) Allow multiple commands separating with ; in distributed-shell

2013-10-22 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13802146#comment-13802146
 ] 

Omkar Vinit Joshi commented on YARN-1303:
-

{code}
+For multiple shell scripts, combine them  +
+into one shell script);
{code}
please remove. I think it is intuitive.
Similarly the exception code message

{code}
+if (shellCommand.contains(;) || shellCommand.contains(|)) {
+  throw new IllegalArgumentException(
+  DistributedShell does not support multiple commands  +
+  or command pipeline. Please create a shell script for  +
+  them and use --shell_script option);
+}
+if (shellCommand.contains()) {
+  throw new IllegalArgumentException(
+  Please create a shell script for redirected output  +
+  and use --shell_script option);
+}
+
{code}
This will not work on windows. [link| 
http://superuser.com/questions/62850/execute-multiple-commands-with-1-line-in-windows-commandline]
 Think about it. Make it not OS specific. Btw do we really need to parse and 
tell user that you are indeed using multiple commands instead of allowed one? 
Now we are putting this in help message I think this validation checks will 
complicate the stuff. thoughts?


 Allow multiple commands separating with ; in distributed-shell
 

 Key: YARN-1303
 URL: https://issues.apache.org/jira/browse/YARN-1303
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: applications/distributed-shell
Reporter: Tassapol Athiapinya
Assignee: Xuan Gong
 Fix For: 2.2.1

 Attachments: YARN-1303.1.patch, YARN-1303.2.patch, YARN-1303.3.patch, 
 YARN-1303.3.patch, YARN-1303.4.patch, YARN-1303.4.patch, YARN-1303.5.patch, 
 YARN-1303.6.patch


 In shell, we can do ls; ls to run 2 commands at once. 
 In distributed shell, this is not working. We should improve to allow this to 
 occur. There are practical use cases that I know of to run multiple commands 
 or to set environment variables before a command.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1314) Cannot pass more than 1 argument to shell command

2013-10-22 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13802223#comment-13802223
 ] 

Omkar Vinit Joshi commented on YARN-1314:
-

Thanks [~xgong] latest patch looks good to me. Can you please comment on how 
you tested this manually? 

 Cannot pass more than 1 argument to shell command
 -

 Key: YARN-1314
 URL: https://issues.apache.org/jira/browse/YARN-1314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications/distributed-shell
Reporter: Tassapol Athiapinya
Assignee: Xuan Gong
 Fix For: 2.2.1

 Attachments: YARN-1314.1.patch, YARN-1314.1.patch, YARN-1314.2.patch


 Distributed shell cannot accept more than 1 parameters in argument parts.
 All of these commands are treated as 1 parameter:
 /usr/bin/yarn  org.apache.hadoop.yarn.applications.distributedshell.Client 
 -jar distrubuted shell jar -shell_command echo -shell_args 'My   name
 is  Teddy'
 /usr/bin/yarn  org.apache.hadoop.yarn.applications.distributedshell.Client 
 -jar distrubuted shell jar -shell_command echo -shell_args ''My   name'
 'is  Teddy''
 /usr/bin/yarn  org.apache.hadoop.yarn.applications.distributedshell.Client 
 -jar distrubuted shell jar -shell_command echo -shell_args 'My   name' 
'is  Teddy'



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1321) NMTokenCache is a a singleton, prevents multiple AMs running in a single JVM to work correctly

2013-10-21 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13801095#comment-13801095
 ] 

Omkar Vinit Joshi commented on YARN-1321:
-

bq. Change containsNMToken to containsToken and removeNMToken to removeToken 
for consistency with getToken and setToken? Also, should setToken not be 
putToken?
can you please make it consistent for all apis xxxNMToken() ?

Can you please add a test case for the multi AM use case?

 NMTokenCache is a a singleton, prevents multiple AMs running in a single JVM 
 to work correctly
 --

 Key: YARN-1321
 URL: https://issues.apache.org/jira/browse/YARN-1321
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Blocker
 Fix For: 2.2.1

 Attachments: YARN-1321.patch, YARN-1321.patch, YARN-1321.patch


 NMTokenCache is a singleton. Because of this, if running multiple AMs in a 
 single JVM NMTokens for the same node from different AMs step on each other 
 and starting containers fail due to mismatch tokens.
 The error observed in the client side is something like:
 {code}
 ERROR org.apache.hadoop.security.UserGroupInformation: 
 PriviledgedActionException as:llama (auth:PROXY) via llama (auth:SIMPLE) 
 cause:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request 
 to start container. 
 NMToken for application attempt : appattempt_1382038445650_0002_01 was 
 used for starting container with container token issued for application 
 attempt : appattempt_1382038445650_0001_01
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-895) RM crashes if it restarts while NameNode is in safe mode

2013-10-21 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-895:
---

Summary: RM crashes if it restarts while NameNode is in safe mode  (was: If 
NameNode is in safemode when RM restarts, RM should wait instead of crashing.)

 RM crashes if it restarts while NameNode is in safe mode
 

 Key: YARN-895
 URL: https://issues.apache.org/jira/browse/YARN-895
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-895.1.patch, YARN-895.patch






--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-895) RM crashes if it restarts while NameNode is in safe mode

2013-10-21 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-895:
---

Description: (was: Today if RM restarts while name node is in safe mode 
then RM crashes. During safe mode modifications are not allowed )

 RM crashes if it restarts while NameNode is in safe mode
 

 Key: YARN-895
 URL: https://issues.apache.org/jira/browse/YARN-895
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-895.1.patch, YARN-895.patch






--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-895) RM crashes if it restarts while NameNode is in safe mode

2013-10-21 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-895:
---

Description: Today if RM restarts while name node is in safe mode then RM 
crashes. During safe mode modifications are not allowed 

 RM crashes if it restarts while NameNode is in safe mode
 

 Key: YARN-895
 URL: https://issues.apache.org/jira/browse/YARN-895
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-895.1.patch, YARN-895.patch


 Today if RM restarts while name node is in safe mode then RM crashes. During 
 safe mode modifications are not allowed 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Assigned] (YARN-1121) RMStateStore should flush all pending store events before closing

2013-10-21 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi reassigned YARN-1121:
---

Assignee: Omkar Vinit Joshi

 RMStateStore should flush all pending store events before closing
 -

 Key: YARN-1121
 URL: https://issues.apache.org/jira/browse/YARN-1121
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Bikas Saha
Assignee: Omkar Vinit Joshi
 Fix For: 2.2.1


 on serviceStop it should wait for all internal pending events to drain before 
 stopping.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1121) RMStateStore should flush all pending store events before closing

2013-10-21 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13801409#comment-13801409
 ] 

Omkar Vinit Joshi commented on YARN-1121:
-

[~bikassaha] we have a single dispatcher queue. so should we ignore other 
events when RM is going down and selectively only process rm state store writes 
related events? In any case we will have very short time before we actually get 
kill -9.

 RMStateStore should flush all pending store events before closing
 -

 Key: YARN-1121
 URL: https://issues.apache.org/jira/browse/YARN-1121
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Bikas Saha
Assignee: Omkar Vinit Joshi
 Fix For: 2.2.1


 on serviceStop it should wait for all internal pending events to drain before 
 stopping.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1053) Diagnostic message from ContainerExitEvent is ignored in ContainerImpl

2013-10-18 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1053:


Priority: Blocker  (was: Major)

 Diagnostic message from ContainerExitEvent is ignored in ContainerImpl
 --

 Key: YARN-1053
 URL: https://issues.apache.org/jira/browse/YARN-1053
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi
Priority: Blocker
  Labels: newbie
 Fix For: 2.3.0, 2.2.1

 Attachments: YARN-1053.20130809.patch


 If the container launch fails then we send ContainerExitEvent. This event 
 contains exitCode and diagnostic message. Today we are ignoring diagnostic 
 message while handling this event inside ContainerImpl. Fixing it as it is 
 useful in diagnosing the failure.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery

2013-10-18 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1185:


Attachment: YARN-1185.3.patch

 FileSystemRMStateStore can leave partial files that prevent subsequent 
 recovery
 ---

 Key: YARN-1185
 URL: https://issues.apache.org/jira/browse/YARN-1185
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1185.1.patch, YARN-1185.2.patch, YARN-1185.3.patch


 FileSystemRMStateStore writes directly to the destination file when storing 
 state. However if the RM were to crash in the middle of the write, the 
 recovery method could encounter a partially-written file and either outright 
 crash during recovery or silently load incomplete state.
 To avoid this, the data should be written to a temporary file and renamed to 
 the destination file afterwards.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1321) NMTokenCache should not be a singleton

2013-10-18 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799736#comment-13799736
 ] 

Omkar Vinit Joshi commented on YARN-1321:
-

why are you in fact running multiple AM's inside a same JVM? as per YARN we can 
never have multiple AM's per JVM per process. Definitely not a blocker. Please 
explain the use case for running multiple AMs inside same process? If you 
really want to run it that way ..Why not just update the NMTokenCache but 
default to single AM case but still I don't see why you are doing this?

 NMTokenCache should not be a singleton
 --

 Key: YARN-1321
 URL: https://issues.apache.org/jira/browse/YARN-1321
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Blocker
 Fix For: 2.2.1


 NMTokenCache is a singleton. Because of this, if running multiple AMs in a 
 single JVM NMTokens for the same node from different AMs step on each other 
 and starting containers fail due to mismatch tokens.
 The error observed in the client side is something like:
 {code}
 ERROR org.apache.hadoop.security.UserGroupInformation: 
 PriviledgedActionException as:llama (auth:PROXY) via llama (auth:SIMPLE) 
 cause:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request 
 to start container. 
 NMToken for application attempt : appattempt_1382038445650_0002_01 was 
 used for starting container with container token issued for application 
 attempt : appattempt_1382038445650_0001_01
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

2013-10-17 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798485#comment-13798485
 ] 

Omkar Vinit Joshi commented on YARN-1210:
-

taking it over.

 During RM restart, RM should start a new attempt only when previous attempt 
 exits for real
 --

 Key: YARN-1210
 URL: https://issues.apache.org/jira/browse/YARN-1210
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He

 When RM recovers, it can wait for existing AMs to contact RM back and then 
 kill them forcefully before even starting a new AM. Worst case, RM will start 
 a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
 we'll minimize multiple AMs racing with each other. This can help issues with 
 downstream components like Pig, Hive and Oozie during RM restart.
 In the mean while, new apps will proceed as usual as existing apps wait for 
 recovery.
 This can continue to be useful after work-preserving restart, so that AMs 
 which can properly sync back up with RM can continue to run and those that 
 don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Assigned] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

2013-10-17 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi reassigned YARN-1210:
---

Assignee: Omkar Vinit Joshi  (was: Jian He)

 During RM restart, RM should start a new attempt only when previous attempt 
 exits for real
 --

 Key: YARN-1210
 URL: https://issues.apache.org/jira/browse/YARN-1210
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi

 When RM recovers, it can wait for existing AMs to contact RM back and then 
 kill them forcefully before even starting a new AM. Worst case, RM will start 
 a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
 we'll minimize multiple AMs racing with each other. This can help issues with 
 downstream components like Pig, Hive and Oozie during RM restart.
 In the mean while, new apps will proceed as usual as existing apps wait for 
 recovery.
 This can continue to be useful after work-preserving restart, so that AMs 
 which can properly sync back up with RM can continue to run and those that 
 don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

2013-10-17 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1210:


Attachment: YARN-1210.1.patch

 During RM restart, RM should start a new attempt only when previous attempt 
 exits for real
 --

 Key: YARN-1210
 URL: https://issues.apache.org/jira/browse/YARN-1210
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1210.1.patch


 When RM recovers, it can wait for existing AMs to contact RM back and then 
 kill them forcefully before even starting a new AM. Worst case, RM will start 
 a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
 we'll minimize multiple AMs racing with each other. This can help issues with 
 downstream components like Pig, Hive and Oozie during RM restart.
 In the mean while, new apps will proceed as usual as existing apps wait for 
 recovery.
 This can continue to be useful after work-preserving restart, so that AMs 
 which can properly sync back up with RM can continue to run and those that 
 don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

2013-10-17 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798620#comment-13798620
 ] 

Omkar Vinit Joshi commented on YARN-1210:
-

Summarizing current patch.
* After RMAppAttempts are recovered then all of the attempts are moved into 
LAUNCHED state. After YARN-891 we will know the state of the earlier finished 
application attempts; so then based on that we can decide where the current app 
attempt should transition to. On RECOVER event
** It will move to LAUNCHED state if it is was the last running app attempt
** It will move to FAILED / KILLED /..other terminal application attempt state.
* When NM RESYNCs containers will be killed and then NM will re-register with 
RM passing already running containers. On RM side if any of the container turns 
out to be earlier AM container then we will fail that app attempt and 
immediately start new app attempt. However if we don't get AM's finished 
containerId during furture NM register then after some time AMLivelinessMonitor 
will expire and will fail the running app attempt and start a new one.


 During RM restart, RM should start a new attempt only when previous attempt 
 exits for real
 --

 Key: YARN-1210
 URL: https://issues.apache.org/jira/browse/YARN-1210
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Omkar Vinit Joshi
 Attachments: YARN-1210.1.patch


 When RM recovers, it can wait for existing AMs to contact RM back and then 
 kill them forcefully before even starting a new AM. Worst case, RM will start 
 a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
 we'll minimize multiple AMs racing with each other. This can help issues with 
 downstream components like Pig, Hive and Oozie during RM restart.
 In the mean while, new apps will proceed as usual as existing apps wait for 
 recovery.
 This can continue to be useful after work-preserving restart, so that AMs 
 which can properly sync back up with RM can continue to run and those that 
 don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


  1   2   3   4   5   6   7   8   >