[jira] [Updated] (YARN-4109) Exception on RM scheduler page loading with labels

2015-09-04 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4109:
---
Description: 
Configure node label and load scheduler Page
On each reload of the page the below exception gets thrown in logs


{code}
2015-09-03 11:27:08,544 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error 
handling URI: /cluster/scheduler
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
at 
com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
at 
com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:139)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
at 
com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
at 
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:663)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291)
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:615)
at 
org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1211)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: org.apache.hadoop.yarn.webapp.WebAppException: Error rendering 
block: nestLevel=10 expected 5

[jira] [Commented] (YARN-4103) RM WebServices missing scheme for appattempts logLinks

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730566#comment-14730566
 ] 

Hudson commented on YARN-4103:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #345 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/345/])
YARN-4103. RM WebServices missing scheme for appattempts logLinks. Contributed 
by Jonathan Eagles. (vvasudev: rev 40d222e862063dc6c474cc6e8de0dce6c4395012)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppAttemptInfo.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java


> RM WebServices missing scheme for appattempts logLinks
> --
>
> Key: YARN-4103
> URL: https://issues.apache.org/jira/browse/YARN-4103
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Fix For: 2.7.2
>
> Attachments: YARN-4103.1.patch, YARN-4103.2.patch, YARN-4103.3.patch
>
>
> all App Attempt Info logLinks begin with "//" instead of "http://; or 
> "https://;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4103) RM WebServices missing scheme for appattempts logLinks

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730669#comment-14730669
 ] 

Hudson commented on YARN-4103:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #351 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/351/])
YARN-4103. RM WebServices missing scheme for appattempts logLinks. Contributed 
by Jonathan Eagles. (vvasudev: rev 40d222e862063dc6c474cc6e8de0dce6c4395012)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppAttemptInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java
* hadoop-yarn-project/CHANGES.txt


> RM WebServices missing scheme for appattempts logLinks
> --
>
> Key: YARN-4103
> URL: https://issues.apache.org/jira/browse/YARN-4103
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Fix For: 2.7.2
>
> Attachments: YARN-4103.1.patch, YARN-4103.2.patch, YARN-4103.3.patch
>
>
> all App Attempt Info logLinks begin with "//" instead of "http://; or 
> "https://;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4103) RM WebServices missing scheme for appattempts logLinks

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730551#comment-14730551
 ] 

Hudson commented on YARN-4103:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8402 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8402/])
YARN-4103. RM WebServices missing scheme for appattempts logLinks. Contributed 
by Jonathan Eagles. (vvasudev: rev 40d222e862063dc6c474cc6e8de0dce6c4395012)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppAttemptInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java


> RM WebServices missing scheme for appattempts logLinks
> --
>
> Key: YARN-4103
> URL: https://issues.apache.org/jira/browse/YARN-4103
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Fix For: 2.7.2
>
> Attachments: YARN-4103.1.patch, YARN-4103.2.patch, YARN-4103.3.patch
>
>
> all App Attempt Info logLinks begin with "//" instead of "http://; or 
> "https://;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4103) RM WebServices missing scheme for appattempts logLinks

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730626#comment-14730626
 ] 

Hudson commented on YARN-4103:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #1082 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1082/])
YARN-4103. RM WebServices missing scheme for appattempts logLinks. Contributed 
by Jonathan Eagles. (vvasudev: rev 40d222e862063dc6c474cc6e8de0dce6c4395012)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppAttemptInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java


> RM WebServices missing scheme for appattempts logLinks
> --
>
> Key: YARN-4103
> URL: https://issues.apache.org/jira/browse/YARN-4103
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Fix For: 2.7.2
>
> Attachments: YARN-4103.1.patch, YARN-4103.2.patch, YARN-4103.3.patch
>
>
> all App Attempt Info logLinks begin with "//" instead of "http://; or 
> "https://;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4103) RM WebServices missing scheme for appattempts logLinks

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730812#comment-14730812
 ] 

Hudson commented on YARN-4103:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2272 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2272/])
YARN-4103. RM WebServices missing scheme for appattempts logLinks. Contributed 
by Jonathan Eagles. (vvasudev: rev 40d222e862063dc6c474cc6e8de0dce6c4395012)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppAttemptInfo.java


> RM WebServices missing scheme for appattempts logLinks
> --
>
> Key: YARN-4103
> URL: https://issues.apache.org/jira/browse/YARN-4103
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Fix For: 2.7.2
>
> Attachments: YARN-4103.1.patch, YARN-4103.2.patch, YARN-4103.3.patch
>
>
> all App Attempt Info logLinks begin with "//" instead of "http://; or 
> "https://;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4103) RM WebServices missing scheme for appattempts logLinks

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730820#comment-14730820
 ] 

Hudson commented on YARN-4103:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #334 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/334/])
YARN-4103. RM WebServices missing scheme for appattempts logLinks. Contributed 
by Jonathan Eagles. (vvasudev: rev 40d222e862063dc6c474cc6e8de0dce6c4395012)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppAttemptInfo.java
* hadoop-yarn-project/CHANGES.txt


> RM WebServices missing scheme for appattempts logLinks
> --
>
> Key: YARN-4103
> URL: https://issues.apache.org/jira/browse/YARN-4103
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Fix For: 2.7.2
>
> Attachments: YARN-4103.1.patch, YARN-4103.2.patch, YARN-4103.3.patch
>
>
> all App Attempt Info logLinks begin with "//" instead of "http://; or 
> "https://;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3942) Timeline store to read events from HDFS

2015-09-04 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730887#comment-14730887
 ] 

Jason Lowe commented on YARN-3942:
--

bq. Since we know the file size to be read, could we return a message saying 
something like "scanning file size FOO. Expect BAR latency"?

I'm not a UI expert, but given the timeline store is just a REST backend to the 
real UI this seems tricky to do in practice.  The UI javascript is doing a 
bunch of separate GETs to the various REST endpoints and expecting the results, 
but we'd have to return something else that says "I'm not done yet" and expect 
the UI to do something sane with that.  If we do this over the normal endpoints 
it will break the timelineserver API for existing clients.  Granted, we're 
already sorta breaking it by not supporting some cross-app queries that were 
supported in the past.

> Timeline store to read events from HDFS
> ---
>
> Key: YARN-3942
> URL: https://issues.apache.org/jira/browse/YARN-3942
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-3942.001.patch
>
>
> This adds a new timeline store plugin that is intended as a stop-gap measure 
> to mitigate some of the issues we've seen with ATS v1 while waiting for ATS 
> v2.  The intent of this plugin is to provide a workable solution for running 
> the Tez UI against the timeline server on a large-scale clusters running many 
> thousands of jobs per day.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4105) Capacity Scheduler headroom for DRF is wrong

2015-09-04 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730906#comment-14730906
 ] 

Jason Lowe commented on YARN-4105:
--

Test failures are unrelated.  Committing this.

> Capacity Scheduler headroom for DRF is wrong
> 
>
> Key: YARN-4105
> URL: https://issues.apache.org/jira/browse/YARN-4105
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4105.2.patch, YARN-4105.3.patch, YARN-4105.4.patch, 
> YARN-4105.patch
>
>
> relate to the problem discussed in YARN-1857. But the min method is flawed 
> when we are using DRC. Have run into a real scenario in production where 
> queueCapacity: , qconsumed:  vCores:361>, consumed:  limit:  vCores:755>.  headRoom calculation returns 88064 where there is only 1536 
> left in the queue because DRC effectively compare by vcores. It then caused 
> deadlock because RMcontainer allocator thought there is still space for 
> mapper and won't preempt a reducer in a full queue to schedule a mapper. 
> Propose fix with componentwiseMin. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4103) RM WebServices missing scheme for appattempts logLinks

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730759#comment-14730759
 ] 

Hudson commented on YARN-4103:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2294 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2294/])
YARN-4103. RM WebServices missing scheme for appattempts logLinks. Contributed 
by Jonathan Eagles. (vvasudev: rev 40d222e862063dc6c474cc6e8de0dce6c4395012)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppAttemptInfo.java


> RM WebServices missing scheme for appattempts logLinks
> --
>
> Key: YARN-4103
> URL: https://issues.apache.org/jira/browse/YARN-4103
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Fix For: 2.7.2
>
> Attachments: YARN-4103.1.patch, YARN-4103.2.patch, YARN-4103.3.patch
>
>
> all App Attempt Info logLinks begin with "//" instead of "http://; or 
> "https://;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4059) Preemption should delay assignments back to the preempted queue

2015-09-04 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730856#comment-14730856
 ] 

Jason Lowe commented on YARN-4059:
--

I think at a high level that can work, since we can use the reservation to 
track the time per request.  However there are some details with the way 
reservations currently work that will cause problems.  Here's an extreme 
example:

Cluster is big and almost completely empty.  Rack R is completely full but all 
the other racks are completely empty.  Lots of apps are trying to be scheduled 
on the cluster.  App A is at the front of the scheduling queue, wants nothing 
but a lot of containers on rack R, and has a very big user limit.  When a node 
shows up that isn't in rack R, we'll place a reservation on it which only asks 
for a small fraction of the overall node's capability.  However since a node 
can only contain one reservation at a time, nothing else can be scheduled on 
that node even though it's got plenty of space.  If the app has enough user 
limit to put a reservation on each node not in rack R then we locked out the 
whole cluster for the node-local-wait duration of app A.  Even if we don't lock 
out the whole cluster, app A is essentially locking out an entire node for each 
reservation it is making until it finds locality or the locality wait period 
ends.  That's going to slow down scheduling in general.

The problem is that reservation assumes the node is full, hence there would 
only ever be one reservation per node.  So we would either need to support 
handling multiple reservations on a node or modify the algorithm to use a 
combination of containers and reservations.  We could use reservations when the 
node is not big enough to allocate the container we want to place, but we would 
use a container allocation to "reserve" space on a node if the node actually 
has space.  We would _not_ give the container to the app until the 
node-local-wait expired, and we would kill the container and re-alloc on a node 
with locality if it arrives within the wait period.  That would allow other 
apps to schedule on the node if we have placed all the "reserved while waiting 
for locality" containers and the node still has space or other things.

I think we also need to refine the algorithm a bit so it will move 
reservations/containers as locality improves.  For example app needs host A 
which is totally full but the rest of the nodes on that rack are totally empty. 
 It initially reserves on an off-rack node since that's the first that 
heartbeated.  Again, peephole scheduling isn't helping here.  It would be 
unfortunate to have the app wait around for a node-local allocation only to 
give up and use an off-rack allocation because that's where it happen to 
initially reserve. If we initially reserve off-rack but then later find a 
rack-local placement then we should migrate the reservation to improve the 
fallback allocation if we never get node-local.

> Preemption should delay assignments back to the preempted queue
> ---
>
> Key: YARN-4059
> URL: https://issues.apache.org/jira/browse/YARN-4059
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4059.2.patch, YARN-4059.3.patch, YARN-4059.patch
>
>
> When preempting containers from a queue it can take a while for the other 
> queues to fully consume the resources that were freed up, due to delays 
> waiting for better locality, etc. Those delays can cause the resources to be 
> assigned back to the preempted queue, and then the preemption cycle continues.
> We should consider adding a delay, either based on node heartbeat counts or 
> time, to avoid granting containers to a queue that was recently preempted. 
> The delay should be sufficient to cover the cycles of the preemption monitor, 
> so we won't try to assign containers in-between preemption events for a queue.
> Worst-case scenario for assigning freed resources to other queues is when all 
> the other queues want no locality. No locality means only one container is 
> assigned per heartbeat, so we need to wait for the entire cluster 
> heartbeating in times the number of containers that could run on a single 
> node.
> So the "penalty time" for a queue should be the max of either the preemption 
> monitor cycle time or the amount of time it takes to allocate the cluster 
> with one container per heartbeat. Guessing this will be somewhere around 2 
> minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4105) Capacity Scheduler headroom for DRF is wrong

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730966#comment-14730966
 ] 

Hudson commented on YARN-4105:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8403 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8403/])
YARN-4105. Capacity Scheduler headroom for DRF is wrong. Contributed by Chang 
Li (jlowe: rev 6eaca2e3634a88dc55689e8960352d6248c424d9)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java


> Capacity Scheduler headroom for DRF is wrong
> 
>
> Key: YARN-4105
> URL: https://issues.apache.org/jira/browse/YARN-4105
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Chang Li
>Assignee: Chang Li
> Fix For: 2.7.2
>
> Attachments: YARN-4105.2.patch, YARN-4105.3.patch, YARN-4105.4.patch, 
> YARN-4105.patch
>
>
> relate to the problem discussed in YARN-1857. But the min method is flawed 
> when we are using DRC. Have run into a real scenario in production where 
> queueCapacity: , qconsumed:  vCores:361>, consumed:  limit:  vCores:755>.  headRoom calculation returns 88064 where there is only 1536 
> left in the queue because DRC effectively compare by vcores. It then caused 
> deadlock because RMcontainer allocator thought there is still space for 
> mapper and won't preempt a reducer in a full queue to schedule a mapper. 
> Propose fix with componentwiseMin. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables

2015-09-04 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731056#comment-14731056
 ] 

Junping Du commented on YARN-3901:
--

Thanks Vrushali for updating the patch. This is a great/huge work which also 
means it may need more rounds of review and could receive much more criticize 
than the normal case. Thanks again for being patient.
After go through the code, I had a few comments (with omitted some duplicated 
ideas with Joep or Li's comments):

In HBaseTimelineWriterImpl.java,
For isApplicationFinished() and getApplicationFinishedTime(), the event list in 
TimelineEntity is a SortedSet, so we can use last() to retrieve the last event 
instead of for each every element? It shouldn't be other events after 
FINISHED_EVENT_TYPE. Isn't it? In addition, because the method is general 
enough. In addition, we can consider to move them to TimelineUtils class so can 
be reused by other classes. Also, need to fix some indentation issues in this 
class.

In ColumnHelper.java,
{code}
+  for (Attribute attribute : attributes) {
+if (attribute != null) {
+  p.setAttribute(attribute.getName(), attribute.getValue());
+}
+  }
{code}
Do we expect null element added to attributes? If not, we should complain with 
NPE or other exception instead of ignore it silently.

In ColumnPrefix.java, Indentation issue in Javadoc.

In TimelineWriterUtils.java,
I think getIncomingAttributes() tries to clone an array of attributes with 
appending an extra attribute in AggregationOperations. May be we should have a 
javadoc to describe it. The 3 if else cases sounds unnecessary and can be 
combined.

I didn't go to coprocessor classes quite deeply but I agree with Joep's above 
comments that it need more Javadoc to explain what are outstanding methods 
doing.
In FlowRunCoprocessor.java, getTagFromAttribute() sounds like we are using 
exception to differentiate normal case in matching string with enum elements. 
Can we improve it with using EnumUtils?

In AggregationCompactionDimension.java,
I think the only usage here is to provide a method getAttribute() which return 
an attribute object mixed with app_id (in byte array). If so, why we make this 
an enum class instead of a regular class as APPLICATION_ID is the only element? 
May be more straightforward way is to have a utility class to getAttribute() 
directly.

In AggregationOperations.java,
Indentation issues.

Haven't quite go through code around flow activity table, more comments should 
comes in my 2nd round review.

Some quick check on test code, for TestHBaseTimelineWriterImplFlowRun.java,
{code}
+  Result r1 = table1.get(g);
+  if (r1 != null && !r1.isEmpty()) {
+Map values = r1.getFamilyMap(FlowRunColumnFamily.INFO
+.getBytes());
+assertEquals(2, r1.size());
...
{code}
Do we accept r1 to be null or empty result? I don't think so, so may be we 
should check the size of r1 earlier so we are not ignore the real failure cases?


> Populate flow run data in the flow_run & flow activity tables
> -
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set 

[jira] [Commented] (YARN-2884) Proxying all AM-RM communications

2015-09-04 Thread Kishore Chaliparambil (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730991#comment-14730991
 ] 

Kishore Chaliparambil commented on YARN-2884:
-

I investigated the Findbug and test failures. 
The test failures seem to be transient and does not happen on local builds. 
Also the Findbug report is empty and has no information.


> Proxying all AM-RM communications
> -
>
> Key: YARN-2884
> URL: https://issues.apache.org/jira/browse/YARN-2884
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Carlo Curino
>Assignee: Kishore Chaliparambil
> Attachments: YARN-2884-V1.patch, YARN-2884-V10.patch, 
> YARN-2884-V11.patch, YARN-2884-V12.patch, YARN-2884-V2.patch, 
> YARN-2884-V3.patch, YARN-2884-V4.patch, YARN-2884-V5.patch, 
> YARN-2884-V6.patch, YARN-2884-V7.patch, YARN-2884-V8.patch, YARN-2884-V9.patch
>
>
> We introduce the notion of an RMProxy, running on each node (or once per 
> rack). Upon start the AM is forced (via tokens and configuration) to direct 
> all its requests to a new services running on the NM that provide a proxy to 
> the central RM. 
> This give us a place to:
> 1) perform distributed scheduling decisions
> 2) throttling mis-behaving AMs
> 3) mask the access to a federation of RMs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3223) Resource update during NM graceful decommission

2015-09-04 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou updated YARN-3223:
-
Attachment: (was: YARN-3223-v0.patch)

> Resource update during NM graceful decommission
> ---
>
> Key: YARN-3223
> URL: https://issues.apache.org/jira/browse/YARN-3223
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.1
>Reporter: Junping Du
>Assignee: Brook Zhou
> Attachments: YARN-3223-v0.1.patch
>
>
> During NM graceful decommission, we should handle resource update properly, 
> include: make RMNode keep track of old resource for possible rollback, keep 
> available resource to 0 and used resource get updated when
> container finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4105) Capacity Scheduler headroom for DRF is wrong

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730978#comment-14730978
 ] 

Hudson commented on YARN-4105:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #352 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/352/])
YARN-4105. Capacity Scheduler headroom for DRF is wrong. Contributed by Chang 
Li (jlowe: rev 6eaca2e3634a88dc55689e8960352d6248c424d9)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java


> Capacity Scheduler headroom for DRF is wrong
> 
>
> Key: YARN-4105
> URL: https://issues.apache.org/jira/browse/YARN-4105
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Chang Li
>Assignee: Chang Li
> Fix For: 2.7.2
>
> Attachments: YARN-4105.2.patch, YARN-4105.3.patch, YARN-4105.4.patch, 
> YARN-4105.patch
>
>
> relate to the problem discussed in YARN-1857. But the min method is flawed 
> when we are using DRC. Have run into a real scenario in production where 
> queueCapacity: , qconsumed:  vCores:361>, consumed:  limit:  vCores:755>.  headRoom calculation returns 88064 where there is only 1536 
> left in the queue because DRC effectively compare by vcores. It then caused 
> deadlock because RMcontainer allocator thought there is still space for 
> mapper and won't preempt a reducer in a full queue to schedule a mapper. 
> Propose fix with componentwiseMin. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4108) CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request

2015-09-04 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731048#comment-14731048
 ] 

MENG DING commented on YARN-4108:
-

Hi, [~leftnoteasy]

I also feel that the logic you proposed is a good staring point overall. Just 
want to confirm that I understand it correctly. For pending asks with hard 
locality requirements, I think this logic works the best. However, for other 
pending asks, are we able to achieve optimal preemption (i.e. sufficiently 
preemptable resources with the lowest cost of preemption as per [~jlowe])? For 
example, just because {{node.available + preemptable > 
application.next_request}} doesn't necessarily mean that the preemption cost is 
the lowest on this node. Maybe we need to have a combination of reservation 
continuous looking + delayed scheduling mechanism to ensure that we have done 
calculation of preemption cost on enough hosts for the pending ask. But then I 
feel this approach might be too expensive ...

> CapacityScheduler: Improve preemption to preempt only those containers that 
> would satisfy the incoming request
> --
>
> Key: YARN-4108
> URL: https://issues.apache.org/jira/browse/YARN-4108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> This is sibling JIRA for YARN-2154. We should make sure container preemption 
> is more effective.
> *Requirements:*:
> 1) Can handle case of user-limit preemption
> 2) Can handle case of resource placement requirements, such as: hard-locality 
> (I only want to use rack-1) / node-constraints (YARN-3409) / black-list (I 
> don't want to use rack1 and host\[1-3\])
> 3) Can handle preemption within a queue: cross user preemption (YARN-2113), 
> cross applicaiton preemption (such as priority-based (YARN-1963) / 
> fairness-based (YARN-3319)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-261) Ability to kill AM attempts

2015-09-04 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731080#comment-14731080
 ] 

Andrey Klochkov commented on YARN-261:
--

[~rohithsharma], please feel free to reassign to yourself. I tried to rebase 
but the patch is old and rebasing is not straightforward.

> Ability to kill AM attempts
> ---
>
> Key: YARN-261
> URL: https://issues.apache.org/jira/browse/YARN-261
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api
>Affects Versions: 2.0.3-alpha
>Reporter: Jason Lowe
> Attachments: YARN-261--n2.patch, YARN-261--n3.patch, 
> YARN-261--n4.patch, YARN-261--n5.patch, YARN-261--n6.patch, 
> YARN-261--n7.patch, YARN-261.patch
>
>
> It would be nice if clients could ask for an AM attempt to be killed.  This 
> is analogous to the task attempt kill support provided by MapReduce.
> This feature would be useful in a scenario where AM retries are enabled, the 
> AM supports recovery, and a particular AM attempt is stuck.  Currently if 
> this occurs the user's only recourse is to kill the entire application, 
> requiring them to resubmit a new application and potentially breaking 
> downstream dependent jobs if it's part of a bigger workflow.  Killing the 
> attempt would allow a new attempt to be started by the RM without killing the 
> entire application, and if the AM supports recovery it could potentially save 
> a lot of work.  It could also be useful in workflow scenarios where the 
> failure of the entire application kills the workflow, but the ability to kill 
> an attempt can keep the workflow going if the subsequent attempt succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables

2015-09-04 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731173#comment-14731173
 ] 

Vrushali C commented on YARN-3901:
--

Thanks [~djp] , appreciate the feedback! I too am figuring out how to make this 
more readable and sincerely appreciate everyone's time and efforts in reviewing 
it. 

Some responses:
bq. TestHBaseTimelineWriterImplFlowRun.java, Do we accept r1 to be null or 
empty result? I don't think so, so may be we should check the size of r1 
earlier so we are not ignore the real failure cases?

Right, I will add an assert not null here.

bq. In TimelineWriterUtils.java,
I think getIncomingAttributes() tries to clone an array of attributes with 
appending an extra attribute in AggregationOperations. May be we should have a 
javadoc to describe it. The 3 if else cases sounds unnecessary and can be 
combined.
I think I will update this method a bit more and add more comments so that it 
explains well what the code is doing.

bq. In ColumnHelper.java, Do we expect null element added to attributes? If 
not, we should complain with NPE or other exception instead of ignore it 
silently.
Hmm. So this method in ColumnHelper is called from several places and is nested 
between many calls from hbase writer till here. Since the list of attributes is 
a variable length list of parameters, it could be null or turn out to be empty 
if some function in between decides to remove an attribute, so this was more of 
a safety check. The list of Attributes can be modified at several places in the 
call stack, so it is not actually an error if it comes to this point as an 
empty list. But I will think over this a bit more. 

bq. For isApplicationFinished() and getApplicationFinishedTime(), the event 
list in TimelineEntity is a SortedSet, so we can use last() to retrieve the 
last event instead of for each every element? It shouldn't be other events 
after FINISHED_EVENT_TYPE. Isn't it?
Ah, I did not know that it was a sorted set, will update the code accordingly.

bq.  In addition, because the method is general enough. In addition, we can 
consider to move them to TimelineUtils class so can be reused by other classes. 
Sounds good, will refactor it.

Looks like the indentation is a bit off in some places, I will update the 
formatting as recommeded by [~gtCarrera9], [~jrottinghuis] and [~djp] 


> Populate flow run data in the flow_run & flow activity tables
> -
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4116) refactor ColumnHelper read* methods

2015-09-04 Thread Sangjin Lee (JIRA)
Sangjin Lee created YARN-4116:
-

 Summary: refactor ColumnHelper read* methods
 Key: YARN-4116
 URL: https://issues.apache.org/jira/browse/YARN-4116
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Sangjin Lee


Currently we have several ColumnHelper.read* methods that are slightly 
different in terms of the initial conditions and behave different accordingly. 
We may want to refactor them so that the code reuse is strong and also the API 
stays reasonable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-261) Ability to kill AM attempts

2015-09-04 Thread Andrey Klochkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Klochkov reassigned YARN-261:


Assignee: (was: Andrey Klochkov)

> Ability to kill AM attempts
> ---
>
> Key: YARN-261
> URL: https://issues.apache.org/jira/browse/YARN-261
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api
>Affects Versions: 2.0.3-alpha
>Reporter: Jason Lowe
> Attachments: YARN-261--n2.patch, YARN-261--n3.patch, 
> YARN-261--n4.patch, YARN-261--n5.patch, YARN-261--n6.patch, 
> YARN-261--n7.patch, YARN-261.patch
>
>
> It would be nice if clients could ask for an AM attempt to be killed.  This 
> is analogous to the task attempt kill support provided by MapReduce.
> This feature would be useful in a scenario where AM retries are enabled, the 
> AM supports recovery, and a particular AM attempt is stuck.  Currently if 
> this occurs the user's only recourse is to kill the entire application, 
> requiring them to resubmit a new application and potentially breaking 
> downstream dependent jobs if it's part of a bigger workflow.  Killing the 
> attempt would allow a new attempt to be started by the RM without killing the 
> entire application, and if the AM supports recovery it could potentially save 
> a lot of work.  It could also be useful in workflow scenarios where the 
> failure of the entire application kills the workflow, but the ability to kill 
> an attempt can keep the workflow going if the subsequent attempt succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4108) CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request

2015-09-04 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731192#comment-14731192
 ] 

Wangda Tan commented on YARN-4108:
--

[~mding],

Agree, so far I haven't considered about how to lower preemption cost. IMHO, 
it's best to find a solution can minimize preemption & make sure all preempted 
containers can be used. But we can do things correct first before optimize it 
:).

> CapacityScheduler: Improve preemption to preempt only those containers that 
> would satisfy the incoming request
> --
>
> Key: YARN-4108
> URL: https://issues.apache.org/jira/browse/YARN-4108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> This is sibling JIRA for YARN-2154. We should make sure container preemption 
> is more effective.
> *Requirements:*:
> 1) Can handle case of user-limit preemption
> 2) Can handle case of resource placement requirements, such as: hard-locality 
> (I only want to use rack-1) / node-constraints (YARN-3409) / black-list (I 
> don't want to use rack1 and host\[1-3\])
> 3) Can handle preemption within a queue: cross user preemption (YARN-2113), 
> cross applicaiton preemption (such as priority-based (YARN-1963) / 
> fairness-based (YARN-3319)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4105) Capacity Scheduler headroom for DRF is wrong

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731234#comment-14731234
 ] 

Hudson commented on YARN-4105:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #335 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/335/])
YARN-4105. Capacity Scheduler headroom for DRF is wrong. Contributed by Chang 
Li (jlowe: rev 6eaca2e3634a88dc55689e8960352d6248c424d9)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java


> Capacity Scheduler headroom for DRF is wrong
> 
>
> Key: YARN-4105
> URL: https://issues.apache.org/jira/browse/YARN-4105
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Chang Li
>Assignee: Chang Li
> Fix For: 2.7.2
>
> Attachments: YARN-4105.2.patch, YARN-4105.3.patch, YARN-4105.4.patch, 
> YARN-4105.patch
>
>
> relate to the problem discussed in YARN-1857. But the min method is flawed 
> when we are using DRC. Have run into a real scenario in production where 
> queueCapacity: , qconsumed:  vCores:361>, consumed:  limit:  vCores:755>.  headRoom calculation returns 88064 where there is only 1536 
> left in the queue because DRC effectively compare by vcores. It then caused 
> deadlock because RMcontainer allocator thought there is still space for 
> mapper and won't preempt a reducer in a full queue to schedule a mapper. 
> Propose fix with componentwiseMin. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4105) Capacity Scheduler headroom for DRF is wrong

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731103#comment-14731103
 ] 

Hudson commented on YARN-4105:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #1083 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1083/])
YARN-4105. Capacity Scheduler headroom for DRF is wrong. Contributed by Chang 
Li (jlowe: rev 6eaca2e3634a88dc55689e8960352d6248c424d9)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* hadoop-yarn-project/CHANGES.txt


> Capacity Scheduler headroom for DRF is wrong
> 
>
> Key: YARN-4105
> URL: https://issues.apache.org/jira/browse/YARN-4105
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Chang Li
>Assignee: Chang Li
> Fix For: 2.7.2
>
> Attachments: YARN-4105.2.patch, YARN-4105.3.patch, YARN-4105.4.patch, 
> YARN-4105.patch
>
>
> relate to the problem discussed in YARN-1857. But the min method is flawed 
> when we are using DRC. Have run into a real scenario in production where 
> queueCapacity: , qconsumed:  vCores:361>, consumed:  limit:  vCores:755>.  headRoom calculation returns 88064 where there is only 1536 
> left in the queue because DRC effectively compare by vcores. It then caused 
> deadlock because RMcontainer allocator thought there is still space for 
> mapper and won't preempt a reducer in a full queue to schedule a mapper. 
> Propose fix with componentwiseMin. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4115) Reduce loglevel of ContainerManagementProtocolProxy to Debug

2015-09-04 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-4115:

Attachment: YARN-4115.001.patch

Change the default log level to Debug

> Reduce loglevel of ContainerManagementProtocolProxy to Debug
> 
>
> Key: YARN-4115
> URL: https://issues.apache.org/jira/browse/YARN-4115
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Minor
> Attachments: YARN-4115.001.patch
>
>
> We see log spams of Aug 28, 1:57:52.441 PMINFO
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy 
> Opening proxy : :8041



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4081) Add support for multiple resource types in the Resource class

2015-09-04 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-4081:

Attachment: YARN-4081-YARN-3926.005.patch

[~leftnoteasy] pointed out that the current patch allows NONE and UNBOUNDED to 
be modified. Uploaded a new patch to fix that.

> Add support for multiple resource types in the Resource class
> -
>
> Key: YARN-4081
> URL: https://issues.apache.org/jira/browse/YARN-4081
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: YARN-4081-YARN-3926.001.patch, 
> YARN-4081-YARN-3926.002.patch, YARN-4081-YARN-3926.003.patch, 
> YARN-4081-YARN-3926.004.patch, YARN-4081-YARN-3926.005.patch
>
>
> For adding support for multiple resource types, we need to add support for 
> this in the Resource class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4105) Capacity Scheduler headroom for DRF is wrong

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731157#comment-14731157
 ] 

Hudson commented on YARN-4105:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2295 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2295/])
YARN-4105. Capacity Scheduler headroom for DRF is wrong. Contributed by Chang 
Li (jlowe: rev 6eaca2e3634a88dc55689e8960352d6248c424d9)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java


> Capacity Scheduler headroom for DRF is wrong
> 
>
> Key: YARN-4105
> URL: https://issues.apache.org/jira/browse/YARN-4105
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Chang Li
>Assignee: Chang Li
> Fix For: 2.7.2
>
> Attachments: YARN-4105.2.patch, YARN-4105.3.patch, YARN-4105.4.patch, 
> YARN-4105.patch
>
>
> relate to the problem discussed in YARN-1857. But the min method is flawed 
> when we are using DRC. Have run into a real scenario in production where 
> queueCapacity: , qconsumed:  vCores:361>, consumed:  limit:  vCores:755>.  headRoom calculation returns 88064 where there is only 1536 
> left in the queue because DRC effectively compare by vcores. It then caused 
> deadlock because RMcontainer allocator thought there is still space for 
> mapper and won't preempt a reducer in a full queue to schedule a mapper. 
> Propose fix with componentwiseMin. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4087) Followup fixes after YARN-2019 regarding RM behavior when state-store error occurs

2015-09-04 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731178#comment-14731178
 ] 

Jian He commented on YARN-4087:
---

thanks Junping ! I ran the timeout test locally, passing fine.

> Followup fixes after YARN-2019 regarding RM behavior when state-store error 
> occurs
> --
>
> Key: YARN-4087
> URL: https://issues.apache.org/jira/browse/YARN-4087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-4087.1.patch, YARN-4087.2.patch, YARN-4087.3.patch, 
> YARN-4087.5.patch
>
>
> Several fixes:
> 1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
> production environment.
> 2. Fixe state-store to also notify app/attempt if state-store error is 
> ignored so that app/attempt is not stuck at *_SAVING state
> 3. If HA is enabled and if there's any state-store error, after the retry 
> operation failed, we always transition RM to standby state.  Otherwise, we 
> may see two active RMs running. YARN-4107 is one example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4115) Reduce loglevel of ContainerManagementProtocolProxy to Debug

2015-09-04 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-4115:
---

 Summary: Reduce loglevel of ContainerManagementProtocolProxy to 
Debug
 Key: YARN-4115
 URL: https://issues.apache.org/jira/browse/YARN-4115
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
Priority: Minor


We see log spams of Aug 28, 1:57:52.441 PM  INFO
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy 
Opening proxy : :8041



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4105) Capacity Scheduler headroom for DRF is wrong

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731203#comment-14731203
 ] 

Hudson commented on YARN-4105:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #346 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/346/])
YARN-4105. Capacity Scheduler headroom for DRF is wrong. Contributed by Chang 
Li (jlowe: rev 6eaca2e3634a88dc55689e8960352d6248c424d9)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java


> Capacity Scheduler headroom for DRF is wrong
> 
>
> Key: YARN-4105
> URL: https://issues.apache.org/jira/browse/YARN-4105
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Chang Li
>Assignee: Chang Li
> Fix For: 2.7.2
>
> Attachments: YARN-4105.2.patch, YARN-4105.3.patch, YARN-4105.4.patch, 
> YARN-4105.patch
>
>
> relate to the problem discussed in YARN-1857. But the min method is flawed 
> when we are using DRC. Have run into a real scenario in production where 
> queueCapacity: , qconsumed:  vCores:361>, consumed:  limit:  vCores:755>.  headRoom calculation returns 88064 where there is only 1536 
> left in the queue because DRC effectively compare by vcores. It then caused 
> deadlock because RMcontainer allocator thought there is still space for 
> mapper and won't preempt a reducer in a full queue to schedule a mapper. 
> Propose fix with componentwiseMin. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3676) Disregard 'assignMultiple' directive while scheduling apps with NODE_LOCAL resource requests

2015-09-04 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731244#comment-14731244
 ] 

Anubhav Dhoot commented on YARN-3676:
-

Thanks [~asuresh] for working on this. I see the patch continues assigning on 
the node if you have *any* app which has a specific request on that node. But 
the scheduling attempt (via queueMgr.getRootQueue().assignContainer(node)) does 
not restrict which apps will get allocation on that node. So one could end up 
assigning the next container on the node for an app which may not have a 
specific request for that node. 
I see two choices. 
a) Smaller change - We should allow subsequent assignments only for the node 
local only apps? And you already have that list in the map. That can end up 
prioritizing the application's node local request over other applications.
b) BIgger change - Once we have picked the app based on priority, we allow it 
to assign multiple containers if there are multiple node local requests for 
that node.
 

> Disregard 'assignMultiple' directive while scheduling apps with NODE_LOCAL 
> resource requests
> 
>
> Key: YARN-3676
> URL: https://issues.apache.org/jira/browse/YARN-3676
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: YARN-3676.1.patch, YARN-3676.2.patch, YARN-3676.3.patch, 
> YARN-3676.4.patch, YARN-3676.5.patch
>
>
> AssignMultiple is generally set to false to prevent overloading a Node (for 
> eg, new NMs that have just joined)
> A possible scheduling optimization would be to disregard this directive for 
> apps whose allowed locality is NODE_LOCAL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4105) Capacity Scheduler headroom for DRF is wrong

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731273#comment-14731273
 ] 

Hudson commented on YARN-4105:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2273 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2273/])
YARN-4105. Capacity Scheduler headroom for DRF is wrong. Contributed by Chang 
Li (jlowe: rev 6eaca2e3634a88dc55689e8960352d6248c424d9)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* hadoop-yarn-project/CHANGES.txt


> Capacity Scheduler headroom for DRF is wrong
> 
>
> Key: YARN-4105
> URL: https://issues.apache.org/jira/browse/YARN-4105
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Chang Li
>Assignee: Chang Li
> Fix For: 2.7.2
>
> Attachments: YARN-4105.2.patch, YARN-4105.3.patch, YARN-4105.4.patch, 
> YARN-4105.patch
>
>
> relate to the problem discussed in YARN-1857. But the min method is flawed 
> when we are using DRC. Have run into a real scenario in production where 
> queueCapacity: , qconsumed:  vCores:361>, consumed:  limit:  vCores:755>.  headRoom calculation returns 88064 where there is only 1536 
> left in the queue because DRC effectively compare by vcores. It then caused 
> deadlock because RMcontainer allocator thought there is still space for 
> mapper and won't preempt a reducer in a full queue to schedule a mapper. 
> Propose fix with componentwiseMin. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4113) RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER

2015-09-04 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731454#comment-14731454
 ] 

Wangda Tan commented on YARN-4113:
--

[~sunilg]. Thanks, please go ahead!

> RM should respect retry-interval when uses RetryPolicies.RETRY_FOREVER
> --
>
> Key: YARN-4113
> URL: https://issues.apache.org/jira/browse/YARN-4113
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Sunil G
>Priority: Critical
>
> Found one issue in RMProxy how to initialize RetryPolicy: In 
> RMProxy#createRetryPolicy. When rmConnectWaitMS is set to -1 (wait forever), 
> it uses RetryPolicies.RETRY_FOREVER which doesn't respect 
> {{yarn.resourcemanager.connect.retry-interval.ms}} setting.
> RetryPolicies.RETRY_FOREVER uses 0 as the interval, when I run the test 
> without properly setup localhost name: 
> {{TestYarnClient#testShouldNotRetryForeverForNonNetworkExceptions}}, it wrote 
> 14G DEBUG exception message to system before it dies. This will be very bad 
> if we do the same thing in a production cluster.
> We should fix two places:
> - Make RETRY_FOREVER can take retry-interval as constructor parameter.
> - Respect retry-interval when we uses RETRY_FOREVER policy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4115) Reduce loglevel of ContainerManagementProtocolProxy to Debug

2015-09-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731405#comment-14731405
 ] 

Hadoop QA commented on YARN-4115:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  18m 57s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   9m 14s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 25s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 21s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 31s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 32s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 35s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   0m 53s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |   7m  0s | Tests failed in 
hadoop-yarn-client. |
| | |  49m 32s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.client.api.impl.TestNMClient |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12754242/YARN-4115.001.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 30db1ad |
| hadoop-yarn-client test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9010/artifact/patchprocess/testrun_hadoop-yarn-client.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9010/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9010/console |


This message was automatically generated.

> Reduce loglevel of ContainerManagementProtocolProxy to Debug
> 
>
> Key: YARN-4115
> URL: https://issues.apache.org/jira/browse/YARN-4115
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>Priority: Minor
> Attachments: YARN-4115.001.patch
>
>
> We see log spams of Aug 28, 1:57:52.441 PMINFO
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy 
> Opening proxy : :8041



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4024) YARN RM should avoid unnecessary resolving IP when NMs doing heartbeat

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731759#comment-14731759
 ] 

Hudson commented on YARN-4024:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2276 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2276/])
YARN-4024. YARN RM should avoid unnecessary resolving IP when NMs doing 
heartbeat. (Hong Zhiguo via wangda) (wangda: rev 
bcc85e3bab78bcacd430eac23141774465b96ef9)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestNodesListManager.java


> YARN RM should avoid unnecessary resolving IP when NMs doing heartbeat
> --
>
> Key: YARN-4024
> URL: https://issues.apache.org/jira/browse/YARN-4024
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Hong Zhiguo
> Fix For: 2.8.0
>
> Attachments: YARN-4024-draft-v2.patch, YARN-4024-draft-v3.patch, 
> YARN-4024-draft.patch, YARN-4024-v4.patch, YARN-4024-v5.patch, 
> YARN-4024-v6.patch, YARN-4024-v7.patch
>
>
> Currently, YARN RM NodesListManager will resolve IP address every time when 
> node doing heartbeat. When DNS server becomes slow, NM heartbeat will be 
> blocked and cannot make progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4087) Followup fixes after YARN-2019 regarding RM behavior when state-store error occurs

2015-09-04 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731135#comment-14731135
 ] 

Junping Du commented on YARN-4087:
--

Hi [~jianhe], thanks for the patch! Can you confirm the test failure is not 
related to your patch?

> Followup fixes after YARN-2019 regarding RM behavior when state-store error 
> occurs
> --
>
> Key: YARN-4087
> URL: https://issues.apache.org/jira/browse/YARN-4087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-4087.1.patch, YARN-4087.2.patch, YARN-4087.3.patch, 
> YARN-4087.5.patch
>
>
> Several fixes:
> 1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
> production environment.
> 2. Fixe state-store to also notify app/attempt if state-store error is 
> ignored so that app/attempt is not stuck at *_SAVING state
> 3. If HA is enabled and if there's any state-store error, after the retry 
> operation failed, we always transition RM to standby state.  Otherwise, we 
> may see two active RMs running. YARN-4107 is one example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4059) Preemption should delay assignments back to the preempted queue

2015-09-04 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731176#comment-14731176
 ] 

Wangda Tan commented on YARN-4059:
--

[~jlowe], Thanks again!

I think they're all very good points!

Maybe it's the time to consider how to do global scheduling, it maybe not 
completely global scheduling, but we can do some "adjustments" using global 
scheduling mechanism.

Will think about it and post more comments.

> Preemption should delay assignments back to the preempted queue
> ---
>
> Key: YARN-4059
> URL: https://issues.apache.org/jira/browse/YARN-4059
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4059.2.patch, YARN-4059.3.patch, YARN-4059.patch
>
>
> When preempting containers from a queue it can take a while for the other 
> queues to fully consume the resources that were freed up, due to delays 
> waiting for better locality, etc. Those delays can cause the resources to be 
> assigned back to the preempted queue, and then the preemption cycle continues.
> We should consider adding a delay, either based on node heartbeat counts or 
> time, to avoid granting containers to a queue that was recently preempted. 
> The delay should be sufficient to cover the cycles of the preemption monitor, 
> so we won't try to assign containers in-between preemption events for a queue.
> Worst-case scenario for assigning freed resources to other queues is when all 
> the other queues want no locality. No locality means only one container is 
> assigned per heartbeat, so we need to wait for the entire cluster 
> heartbeating in times the number of containers that could run on a single 
> node.
> So the "penalty time" for a queue should be the max of either the preemption 
> monitor cycle time or the amount of time it takes to allocate the cluster 
> with one container per heartbeat. Guessing this will be somewhere around 2 
> minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4108) CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request

2015-09-04 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731183#comment-14731183
 ] 

Wangda Tan commented on YARN-4108:
--

[~sunilg],

bq. Or is there any other advantage of making to-be-preempted containers from 
allocation logic such as user-limit
The biggest benefit is, we will automatically check: which containers to 
preempt if we want to ask some other containers. This is quite important to me: 
with this behavior, all preempted containers are confirmed could be leveraged 
by less-satisfied applications. If we don't do this in scheduler allocation 
logic, we may end up with copying some of these logics to preemption policy OR 
do something like "dry-run" from outside, which is not as straightforward as 
doing that directly in allocation logic to me.

> CapacityScheduler: Improve preemption to preempt only those containers that 
> would satisfy the incoming request
> --
>
> Key: YARN-4108
> URL: https://issues.apache.org/jira/browse/YARN-4108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> This is sibling JIRA for YARN-2154. We should make sure container preemption 
> is more effective.
> *Requirements:*:
> 1) Can handle case of user-limit preemption
> 2) Can handle case of resource placement requirements, such as: hard-locality 
> (I only want to use rack-1) / node-constraints (YARN-3409) / black-list (I 
> don't want to use rack1 and host\[1-3\])
> 3) Can handle preemption within a queue: cross user preemption (YARN-2113), 
> cross applicaiton preemption (such as priority-based (YARN-1963) / 
> fairness-based (YARN-3319)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors

2015-09-04 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731501#comment-14731501
 ] 

Jason Lowe commented on YARN-2410:
--

Thanks for the patch, Kuhu!

Do we really need getShuffle/setShuffle and a new shuffle field?  Much simpler 
to access pipelineFact.SHUFFLE directly or add a getShuffle method to 
HttpPipelineFactory and use that.  We shouldn't be redundantly tracking shuffle 
in the ShuffleHandler class.

Rather than catch and swallowing the exception from metrics.operationComplete 
(which shouldn't happen), can't we just let this propagate up?  I would think 
it would eventually trigger exceptionCaught to be called which should do the 
right thing like any other exception on the channel.

The ChannelHandlerContext has the corresponding channel, so do we really need 
SendMapOutputParams to store it separately?

SendMapOutputParams should be treated like an immutable object.  There's no 
need to set any of its contents after it is created, so we should just remove 
all the set methods.  Same applies to ReduceContext.  Actually I'm not really 
sure why SendMapOutputParams exists separate from ReduceContext.  There should 
be a one-to-one relationship there.  I think we could just promote all of the 
members of SendMapOutputParams into ReduceContext and cut down on some of the 
excess boilerplate.

ReduceContext has an unused Configuration variable and getConf is never called.

Seems like we could associate the ReduceContext with the ReduceMapFileCount 
listener object directly when we construct it.  That way we don't have to fish 
it out of the channel attachment and the code plays nicer with other things 
that might want to use the channel attachment for something.

It looks like we can now send an INTERNAL_SERVER_ERROR followed by a NOT_FOUND 
error since sendMap will return null when it sends an internal error.  Maybe 
sendMap should take care of sending all the appropriate errors directly.

Why was reduceContext added as a TestShuffleHandler instance variable?  It's 
specific to the new test.

The test would be a bit more readable with some factoring out of some code to 
utility methods, e.g.: createMockChannel method which takes care of mocking up 
all the stuff needed to mock a channel.  Also the overridden ShuffleHandler is 
enough code that it should just be a separate class within the test rather than 
inline in the method.

Nit: The comment for mapreduce.shuffle.max.send.map.count should mention 
simultaneous or concurrent otherwise it implies it will only send that many 
outputs total.  Also it may be more clear if the property were named something 
like mapreduce.shuffle.max.session-open-files or something similar, although 
I'm not super excited about that name either.

Nit: some (all?) of the returns in ReduceMapFileCount.operationComplete would 
be easier to follow using an {{else}} clause, e.g.:
{code}
  if (waitCount == 0) {
metrics.operationComplete(future);
future.getChannel().close();
  } else {
shuffle.sendMap(rc.getSendMapOutputParams().getCtx(),
rc.getSendMapOutputParams().getInfoMap());
  }
{code}

Nit: whitespace between field definitions and internal class definitions and 
also between method definitions would help readability.

Nit: Per the coding conventions class and instance variables should appear 
before constructors which in turn appear before methods

Nit: variables should be declared when they are initialized if possible, e.g.: 
nextId and mapId in sendMap.

Nit: Please organize the imports in the test, there's a mix of static and 
non-static imports.


> Nodemanager ShuffleHandler can possible exhaust file descriptors
> 
>
> Key: YARN-2410
> URL: https://issues.apache.org/jira/browse/YARN-2410
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: Nathan Roberts
>Assignee: Kuhu Shukla
> Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, 
> YARN-2410-v3.patch, YARN-2410-v4.patch, YARN-2410-v5.patch
>
>
> The async nature of the shufflehandler can cause it to open a huge number of
> file descriptors, when it runs out it crashes.
> Scenario:
> Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
> Let's say all 6K reduces hit a node at about same time asking for their
> outputs. Each reducer will ask for all 40 map outputs over a single socket in 
> a
> single request (not necessarily all 40 at once, but with coalescing it is
> likely to be a large number).
> sendMapOutput() will open the file for random reading and then perform an 
> async transfer of the particular portion of this file(). This will 
> theoretically
> happen 6000*40=24 times which will run the NM out of file 

[jira] [Commented] (YARN-1651) CapacityScheduler side changes to support increase/decrease container resource.

2015-09-04 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731505#comment-14731505
 ] 

Wangda Tan commented on YARN-1651:
--

Hi Meng,
Thanks for comments:

bq. We probably need to address this properly in the JIRA that tracks container 
resource increase roll back. (I think Container resource increase expiration 
should be tracked as a Scheduler Event, e.g., 
SchedulerEventType.CONTAINER_INCREASE_EXPIRE)
I think we can do that either via adding CONTAINER_INCREASE_EXPIRE or directly 
call decrease of scheduler from RMContainer, I'm not sure which one is better, 
let's figure it out when doing it.

bq. It seems that this function throws exception whenever there is a duplicated 
id. Shall we handle the case where if there are both increase and decrease 
requests for the same id, we can ignore the increase but keep the decrease 
request?
I have thought about this before, I think it's hard to decide when two request 
with same containerId but different target resource exists, which one will be 
chosen. And it's not like an expected allocate request as well. So I prefer to 
reject both.

bq. Will it be better to combine all sanity checks into one function
Done

bq. For validateIncreaseDecreaseRequest, we don't check minimum allocation now, 
is it intended?
Yes it's intended, because we will normalize it later, so no need to throw 
exception.

bq. This function is used by both pullNewlyIncreasedContainers(), and 
pullNewlyDecreasedContainers(). Why do we need to call 
updateContainerAndNMToken for decreased containers? It also unnecessarily send 
a ACQUIRE_UPDATED_CONTAINER event for every decreased container?
This is majorly makes logic correct and consistent, we don't use the container 
token for now, but I think we should make it updated before return to app. 
Unless we have any performance issue of doing this, I prefer to keep existing 
behavior.

bq. We should probably check null before adding updatedContainer?
It seems no need to do the null check here. When it becomes null? I prefer to 
keep it as-is and it will throw NPE if any fatal issue happens.

bq. RMNodeImpl.pullNewlyIncreasedContainers()
Implemented.

bq. AppSchedulingInfo#notifyContainerStopped not being used.
Removed, we handled this in LeafQueue#completedContainer.

bq. I think the following is a typo, should be if (cannotAllocateAnything), 
right?
Correct, fixed.

bq. Not sure if I understand the logic. Why only break when 
node.getReservedContainer() == null? Shouldn't we break out of the loop here no 
matter what?
Nice catch! I fixed this, we should break when we allocated or reserved 
anything.

bq. I think earlier in the allocateIncreaseRequest() function, if a new 
increase is successfully allocated, 
application.increaseContainer(increaseRequest) will have removed the increase 
request already?
Another nice catch! Yes we should already handled it in 
application.increaseContainer

bq. RMContainerImpl...Shouldn't it be changed to...
Yes, it should do as you said, updated.

bq. Also, is container.containerIncreased really needed?
It's needed when we don't know if a acquired event is for an increasedContainer 
or decreasedContainer, added isIncreaseContainer to acquire event (now it's 
RMContainerUpdatesAcquiredEvent). And removed 
RMContainerImpl.containerIncreased.

Typos: fixed.

> CapacityScheduler side changes to support increase/decrease container 
> resource.
> ---
>
> Key: YARN-1651
> URL: https://issues.apache.org/jira/browse/YARN-1651
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager, scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-1651-1.YARN-1197.patch, YARN-1651-2.YARN-1197.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4081) Add support for multiple resource types in the Resource class

2015-09-04 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731561#comment-14731561
 ] 

Wangda Tan commented on YARN-4081:
--

Thanks for update, [~vvasudev].

Few comments,
Regarding to Resources:

1)
For NONE/UNBOUNDED, maybe we don't need to initialize a real map for it. For 
example, when query resource information of NONE, it should return 
ResourceInformation.value=0 for all given resource type. And the returned 
getResources() map is not a real map too, it should return a fake map that 
always returns ResourceInformation for any given resource type.

2)
Is it possible to merge implementation of NONE/UNBOUNDED?

And I found ResourceRequestInfo changes aren't related to this patch, should 
they be better moved to the other following patch? (such as support multiple 
resource types in REST API).

> Add support for multiple resource types in the Resource class
> -
>
> Key: YARN-4081
> URL: https://issues.apache.org/jira/browse/YARN-4081
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: YARN-4081-YARN-3926.001.patch, 
> YARN-4081-YARN-3926.002.patch, YARN-4081-YARN-3926.003.patch, 
> YARN-4081-YARN-3926.004.patch, YARN-4081-YARN-3926.005.patch
>
>
> For adding support for multiple resource types, we need to add support for 
> this in the Resource class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables

2015-09-04 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731650#comment-14731650
 ] 

Sangjin Lee commented on YARN-3901:
---

[~gtCarrera9], I think we're not far off actually. I've been testing using the 
v.3 patch, and with a few more changes that Vrushali will be doing, it should 
be pretty close to a reasonably complete state. At this point, it would be more 
efficient to bring this to completion. My 2 cents.

> Populate flow run data in the flow_run & flow activity tables
> -
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4024) YARN RM should avoid unnecessary resolving IP when NMs doing heartbeat

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731604#comment-14731604
 ] 

Hudson commented on YARN-4024:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #349 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/349/])
YARN-4024. YARN RM should avoid unnecessary resolving IP when NMs doing 
heartbeat. (Hong Zhiguo via wangda) (wangda: rev 
bcc85e3bab78bcacd430eac23141774465b96ef9)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestNodesListManager.java


> YARN RM should avoid unnecessary resolving IP when NMs doing heartbeat
> --
>
> Key: YARN-4024
> URL: https://issues.apache.org/jira/browse/YARN-4024
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Hong Zhiguo
> Fix For: 2.8.0
>
> Attachments: YARN-4024-draft-v2.patch, YARN-4024-draft-v3.patch, 
> YARN-4024-draft.patch, YARN-4024-v4.patch, YARN-4024-v5.patch, 
> YARN-4024-v6.patch, YARN-4024-v7.patch
>
>
> Currently, YARN RM NodesListManager will resolve IP address every time when 
> node doing heartbeat. When DNS server becomes slow, NM heartbeat will be 
> blocked and cannot make progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4117) End to end unit test with mini YARN cluster for AMRMProxy Service

2015-09-04 Thread Kishore Chaliparambil (JIRA)
Kishore Chaliparambil created YARN-4117:
---

 Summary: End to end unit test with mini YARN cluster for AMRMProxy 
Service
 Key: YARN-4117
 URL: https://issues.apache.org/jira/browse/YARN-4117
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Reporter: Kishore Chaliparambil
Assignee: Subru Krishnan


Today many apps like Distributed Shell, REEF, etc rely on the fact that the 
HADOOP_CONF_DIR of the NM is on the classpath to discover the scheduler 
address. This JIRA proposes the addition of an explicit discovery mechanism for 
the scheduler address



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4117) End to end unit test with mini YARN cluster for AMRMProxy Service

2015-09-04 Thread Kishore Chaliparambil (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kishore Chaliparambil updated YARN-4117:

Description: (was: Today many apps like Distributed Shell, REEF, etc 
rely on the fact that the HADOOP_CONF_DIR of the NM is on the classpath to 
discover the scheduler address. This JIRA proposes the addition of an explicit 
discovery mechanism for the scheduler address)

> End to end unit test with mini YARN cluster for AMRMProxy Service
> -
>
> Key: YARN-4117
> URL: https://issues.apache.org/jira/browse/YARN-4117
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, resourcemanager
>Reporter: Kishore Chaliparambil
>Assignee: Kishore Chaliparambil
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4106) NodeLabels for NM in distributed mode is not updated even after clusterNodelabel addition in RM

2015-09-04 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731620#comment-14731620
 ] 

Wangda Tan commented on YARN-4106:
--

[~bibinchundatt],
Thanks for update,
Some comments: 
- Not sure if I completely understand the logic. {{updateNodeLabelsFromConfig}} 
of {{ConfigurationNodeLabelsProvider}} is invoked only when TimerTask 
scheduled. And TimerTask of AbstractNodeLabelsProvider delayed started for 
{{intervalTime}}. Does this mean we cannot get labels when initializing 
ConfigurationNodeLabelsProvider? If so, I think we need to call 
updateNodeLabelsFromConfig in serviceInit of ConfigurationNodeLabelsProvider. 
(Another choice is make the delay time from intervalTime to 0, but it runs in 
another thread, so the behavior is undetermined. To make it simple, I suggest 
to add it in serviceInit).
- Add a test for new behavior of this JIRA?

> NodeLabels for NM in distributed mode is not updated even after 
> clusterNodelabel addition in RM 
> 
>
> Key: YARN-4106
> URL: https://issues.apache.org/jira/browse/YARN-4106
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4106.patch, 0002-YARN-4106.patch, 
> 0003-YARN-4106.patch, 0004-YARN-4106.patch
>
>
> NodeLabels for NM in distributed mode is not updated even after 
> clusterNodelabel addition in RM
> Steps to reproduce
> ===
> # Configure nodelabel in distributed mode
> yarn.node-labels.configuration-type=distributed
> provider = config
> yarn.nodemanager.node-labels.provider.fetch-interval-ms=12ms
> # Start RM the NM
> # Once NM is registration is done add nodelabels in RM
> Nodelabels not getting updated in RM side 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4117) End to end unit test with mini YARN cluster for AMRMProxy Service

2015-09-04 Thread Kishore Chaliparambil (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kishore Chaliparambil updated YARN-4117:

Description: YARN-2884 introduces a proxy between AM and RM. This JIRA 
proposes an end to end unit test using mini YARN cluster to the AMRMProxy 
service. This test will validate register, allocate and finish application and 
token renewal.

> End to end unit test with mini YARN cluster for AMRMProxy Service
> -
>
> Key: YARN-4117
> URL: https://issues.apache.org/jira/browse/YARN-4117
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, resourcemanager
>Reporter: Kishore Chaliparambil
>Assignee: Kishore Chaliparambil
>
> YARN-2884 introduces a proxy between AM and RM. This JIRA proposes an end to 
> end unit test using mini YARN cluster to the AMRMProxy service. This test 
> will validate register, allocate and finish application and token renewal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4117) End to end unit test with mini YARN cluster for AMRMProxy Service

2015-09-04 Thread Kishore Chaliparambil (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kishore Chaliparambil reassigned YARN-4117:
---

Assignee: Kishore Chaliparambil  (was: Subru Krishnan)

> End to end unit test with mini YARN cluster for AMRMProxy Service
> -
>
> Key: YARN-4117
> URL: https://issues.apache.org/jira/browse/YARN-4117
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, resourcemanager
>Reporter: Kishore Chaliparambil
>Assignee: Kishore Chaliparambil
>
> Today many apps like Distributed Shell, REEF, etc rely on the fact that the 
> HADOOP_CONF_DIR of the NM is on the classpath to discover the scheduler 
> address. This JIRA proposes the addition of an explicit discovery mechanism 
> for the scheduler address



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4083) Add a discovery mechanism for the scheduler address

2015-09-04 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731631#comment-14731631
 ] 

Allen Wittenauer commented on YARN-4083:


How does this work when the container is actually a Linux container and not a 
fake yarn-level container?

> Add a discovery mechanism for the scheduler address
> ---
>
> Key: YARN-4083
> URL: https://issues.apache.org/jira/browse/YARN-4083
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Subru Krishnan
>
> Today many apps like Distributed Shell, REEF, etc rely on the fact that the 
> HADOOP_CONF_DIR of the NM is on the classpath to discover the scheduler 
> address. This JIRA proposes the addition of an explicit discovery mechanism 
> for the scheduler address



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables

2015-09-04 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731639#comment-14731639
 ] 

Li Lu commented on YARN-3901:
-

Hi [~vrushalic], since part of this JIRA is blocking the web UI work, but we 
may still need sometime to reach a stable state on this JIRA, is it possible to 
separate the flow activity part? In this way we may unblock the flow activity 
table queries, as well as the web services. However, if it's too hard to 
separate the JIRA we can proceed all tasks in this JIRA. Thoughts? 

> Populate flow run data in the flow_run & flow activity tables
> -
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables

2015-09-04 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731648#comment-14731648
 ] 

Sangjin Lee commented on YARN-3901:
---

(1) FlowScanner cell order issue
When I added the reader code and started testing it against the flow run unit 
tests, I found that reading the END_TIME column on the flow run table didn't 
work. The flow run column read on END_TIME is essentially 
{{Result.getValue()}}. However, HBase was failing to find the END_TIME column 
although it clearly existed in the result. It was basically failing at the 
binary search:

{code}
  public Cell getColumnLatestCell(byte [] family, byte [] qualifier) {
Cell [] kvs = rawCells(); // side effect possibly.
if (kvs == null || kvs.length == 0) {
  return null;
}
int pos = binarySearch(kvs, family, qualifier);
if (pos == -1) {
  return null;
}
if (CellUtil.matchingColumn(kvs[pos], family, qualifier)) {
  return kvs[pos];
}
return null;
  }
{code}

The binary search was failing because the cells in the result were stored in 
the wrong order.

The cells were stored in the wrong order because it was being added by our 
co-processor (in FlowScanner.nextInternal()).

{code}
189 if (runningSum.size() > 0) {
190   for (Map.Entry newCellSum : runningSum.entrySet()) {
191 // create a new cell that represents the flow metric
192 Cell c = newCell(metricCell.get(newCellSum.getKey()),
193 newCellSum.getValue());
194 cells.add(c);
195   }
196 }
197 if (currentMinCell != null) {
198   cells.add(currentMinCell);
199 }
200 if (currentMaxCell != null) {
201   cells.add(currentMaxCell);
202 }
{code}

And this order is preserved all the way to the reader. The fix is to add the 
cells in the right order via KeyValueComparator. This fix is included in my 
patch on YARN-4074. This will be fixed in this JIRA.

> Populate flow run data in the flow_run & flow activity tables
> -
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables

2015-09-04 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731651#comment-14731651
 ] 

Li Lu commented on YARN-3901:
-

[~sjlee0] sure, then let's keep all the work here! 

> Populate flow run data in the flow_run & flow activity tables
> -
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables

2015-09-04 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731654#comment-14731654
 ] 

Sangjin Lee commented on YARN-3901:
---

(2) colliding puts in the co-processor
We found another issue with the write side of things via unit test. It fails 
only occasionally (and more often on some environments than others). It happens 
when 2 puts on the same column are coming in very closely (namely within 1 
millisecond). The code in question is {{FlowRunCoprocessor.prePut()}}:

{code}
  for (Map.Entry entry : put.getFamilyCellMap()
  .entrySet()) {
List newCells = new ArrayList<>(entry.getValue().size());
for (Cell cell : entry.getValue()) {
  // for each cell in the put add the tags
  // Assumption is that all the cells in
  // one put are the same operation
  newCells.add(CellUtil.createCell(CellUtil.cloneRow(cell),
  CellUtil.cloneFamily(cell), CellUtil.cloneQualifier(cell),
  cell.getTimestamp(), KeyValue.Type.Put,
  CellUtil.cloneValue(cell), Tag.fromList(tags)));
}
newFamilyMap.put(entry.getKey(), newCells);
  } // for each entry
{code}

If 2 cells for example carry the same timestamp, then the later one ends up 
overwriting the previous one, effectively losing one put. This was triggered by 
one of the tests in {{TestHBaseTimelineWriterImplFlowRun.java}}.

It's an edge case which is rather unlikely to happen normally, but is an issue 
nonetheless. And how to solve this problem is pretty complicated. We'll soon 
post possible approaches for handling this.

But at any rate, I suspect we could isolate this issue into a separate JIRA, 
and tackle it post-UI-POC. I'd appreciate your feedback.

> Populate flow run data in the flow_run & flow activity tables
> -
>
> Key: YARN-3901
> URL: https://issues.apache.org/jira/browse/YARN-3901
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
> Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4024) YARN RM should avoid unnecessary resolving IP when NMs doing heartbeat

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731660#comment-14731660
 ] 

Hudson commented on YARN-4024:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #355 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/355/])
YARN-4024. YARN RM should avoid unnecessary resolving IP when NMs doing 
heartbeat. (Hong Zhiguo via wangda) (wangda: rev 
bcc85e3bab78bcacd430eac23141774465b96ef9)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestNodesListManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java


> YARN RM should avoid unnecessary resolving IP when NMs doing heartbeat
> --
>
> Key: YARN-4024
> URL: https://issues.apache.org/jira/browse/YARN-4024
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Hong Zhiguo
> Fix For: 2.8.0
>
> Attachments: YARN-4024-draft-v2.patch, YARN-4024-draft-v3.patch, 
> YARN-4024-draft.patch, YARN-4024-v4.patch, YARN-4024-v5.patch, 
> YARN-4024-v6.patch, YARN-4024-v7.patch
>
>
> Currently, YARN RM NodesListManager will resolve IP address every time when 
> node doing heartbeat. When DNS server becomes slow, NM heartbeat will be 
> blocked and cannot make progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4087) Followup fixes after YARN-2019 regarding RM behavior when state-store error occurs

2015-09-04 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-4087:
--
Attachment: YARN-4087.6.patch

> Followup fixes after YARN-2019 regarding RM behavior when state-store error 
> occurs
> --
>
> Key: YARN-4087
> URL: https://issues.apache.org/jira/browse/YARN-4087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-4087.1.patch, YARN-4087.2.patch, YARN-4087.3.patch, 
> YARN-4087.5.patch, YARN-4087.6.patch
>
>
> Several fixes:
> 1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
> production environment.
> 2. Fixe state-store to also notify app/attempt if state-store error is 
> ignored so that app/attempt is not stuck at *_SAVING state
> 3. If HA is enabled and if there's any state-store error, after the retry 
> operation failed, we always transition RM to standby state.  Otherwise, we 
> may see two active RMs running. YARN-4107 is one example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4087) Followup fixes after YARN-2019 regarding RM behavior when state-store error occurs

2015-09-04 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-4087:
--
Description: 
Several fixes:
1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
production environment.
2. If HA is enabled and if there's any state-store error, after the retry 
operation failed, we always transition RM to standby state.  Otherwise, we may 
see two active RMs running. YARN-4107 is one example.

  was:
Several fixes:
1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
production environment.
2. Fixe state-store to also notify app/attempt if state-store error is ignored 
so that app/attempt is not stuck at *_SAVING state
3. If HA is enabled and if there's any state-store error, after the retry 
operation failed, we always transition RM to standby state.  Otherwise, we may 
see two active RMs running. YARN-4107 is one example.


> Followup fixes after YARN-2019 regarding RM behavior when state-store error 
> occurs
> --
>
> Key: YARN-4087
> URL: https://issues.apache.org/jira/browse/YARN-4087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-4087.1.patch, YARN-4087.2.patch, YARN-4087.3.patch, 
> YARN-4087.5.patch, YARN-4087.6.patch
>
>
> Several fixes:
> 1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
> production environment.
> 2. If HA is enabled and if there's any state-store error, after the retry 
> operation failed, we always transition RM to standby state.  Otherwise, we 
> may see two active RMs running. YARN-4107 is one example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4087) Followup fixes after YARN-2019 regarding RM behavior when state-store error occurs

2015-09-04 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-4087:
--
Attachment: YARN-4087.6.patch

> Followup fixes after YARN-2019 regarding RM behavior when state-store error 
> occurs
> --
>
> Key: YARN-4087
> URL: https://issues.apache.org/jira/browse/YARN-4087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-4087.1.patch, YARN-4087.2.patch, YARN-4087.3.patch, 
> YARN-4087.5.patch, YARN-4087.6.patch
>
>
> Several fixes:
> 1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
> production environment.
> 2. If HA is enabled and if there's any state-store error, after the retry 
> operation failed, we always transition RM to standby state.  Otherwise, we 
> may see two active RMs running. YARN-4107 is one example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4087) Followup fixes after YARN-2019 regarding RM behavior when state-store error occurs

2015-09-04 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-4087:
--
Attachment: (was: YARN-4087.6.patch)

> Followup fixes after YARN-2019 regarding RM behavior when state-store error 
> occurs
> --
>
> Key: YARN-4087
> URL: https://issues.apache.org/jira/browse/YARN-4087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-4087.1.patch, YARN-4087.2.patch, YARN-4087.3.patch, 
> YARN-4087.5.patch, YARN-4087.6.patch
>
>
> Several fixes:
> 1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
> production environment.
> 2. If HA is enabled and if there's any state-store error, after the retry 
> operation failed, we always transition RM to standby state.  Otherwise, we 
> may see two active RMs running. YARN-4107 is one example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4118) Newly submitted app maybe stuck at saving state if store operation failure is ignored in ZKRMStateStore

2015-09-04 Thread Jian He (JIRA)
Jian He created YARN-4118:
-

 Summary: Newly submitted app maybe stuck at saving state if store 
operation failure is ignored in ZKRMStateStore
 Key: YARN-4118
 URL: https://issues.apache.org/jira/browse/YARN-4118
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He


In YARN-2019, we took a decision to ignore the failure and not fail the RM when 
ZK is unavailable.
However, it leaves newly submitted app stuck at saving state.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4087) Followup fixes after YARN-2019 regarding RM behavior when state-store error occurs

2015-09-04 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731681#comment-14731681
 ] 

Xuan Gong commented on YARN-4087:
-

+1 For the Latest patch.
[~djp] Do you have any other comments ?

> Followup fixes after YARN-2019 regarding RM behavior when state-store error 
> occurs
> --
>
> Key: YARN-4087
> URL: https://issues.apache.org/jira/browse/YARN-4087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-4087.1.patch, YARN-4087.2.patch, YARN-4087.3.patch, 
> YARN-4087.5.patch, YARN-4087.6.patch
>
>
> Several fixes:
> 1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
> production environment.
> 2. If HA is enabled and if there's any state-store error, after the retry 
> operation failed, we always transition RM to standby state.  Otherwise, we 
> may see two active RMs running. YARN-4107 is one example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2884) Proxying all AM-RM communications

2015-09-04 Thread Kishore Chaliparambil (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731693#comment-14731693
 ] 

Kishore Chaliparambil commented on YARN-2884:
-

The test failure is not related to the patch.

Also could not address couple of issues:
 1) Checkstyle: YarnConfiguration.java - File length exceeds 2000 lines.
 2) Checkstyle: Missing package-info.java file

> Proxying all AM-RM communications
> -
>
> Key: YARN-2884
> URL: https://issues.apache.org/jira/browse/YARN-2884
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Carlo Curino
>Assignee: Kishore Chaliparambil
> Attachments: YARN-2884-V1.patch, YARN-2884-V10.patch, 
> YARN-2884-V11.patch, YARN-2884-V12.patch, YARN-2884-V13.patch, 
> YARN-2884-V2.patch, YARN-2884-V3.patch, YARN-2884-V4.patch, 
> YARN-2884-V5.patch, YARN-2884-V6.patch, YARN-2884-V7.patch, 
> YARN-2884-V8.patch, YARN-2884-V9.patch
>
>
> We introduce the notion of an RMProxy, running on each node (or once per 
> rack). Upon start the AM is forced (via tokens and configuration) to direct 
> all its requests to a new services running on the NM that provide a proxy to 
> the central RM. 
> This give us a place to:
> 1) perform distributed scheduling decisions
> 2) throttling mis-behaving AMs
> 3) mask the access to a federation of RMs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1651) CapacityScheduler side changes to support increase/decrease container resource.

2015-09-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731696#comment-14731696
 ] 

Hadoop QA commented on YARN-1651:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  17m 31s | Findbugs (version ) appears to 
be broken on YARN-1197. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 20 new or modified test files. |
| {color:red}-1{color} | javac |   7m 57s | The applied patch generated  1  
additional warning messages. |
| {color:green}+1{color} | javadoc |  10m  5s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 52s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |  30m  2s | The patch has 162  line(s) 
that end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 31s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   5m 25s | The patch appears to introduce 7 
new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | mapreduce tests |   9m 16s | Tests passed in 
hadoop-mapreduce-client-app. |
| {color:green}+1{color} | tools/hadoop tests |   0m 51s | Tests passed in 
hadoop-sls. |
| {color:green}+1{color} | yarn tests |   6m 59s | Tests passed in 
hadoop-yarn-client. |
| {color:green}+1{color} | yarn tests |   0m 24s | Tests passed in 
hadoop-yarn-server-common. |
| {color:red}-1{color} | yarn tests |  58m 15s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | | 151m 13s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-yarn-server-common |
| Failed unit tests | 
hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation |
|   | hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12754278/YARN-1651-3.YARN-1197.patch
 |
| Optional Tests | javac unit findbugs checkstyle javadoc |
| git revision | YARN-1197 / f86eae1 |
| javac | 
https://builds.apache.org/job/PreCommit-YARN-Build/9013/artifact/patchprocess/diffJavacWarnings.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/9013/artifact/patchprocess/whitespace.txt
 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/9013/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-common.html
 |
| hadoop-mapreduce-client-app test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9013/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt
 |
| hadoop-sls test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9013/artifact/patchprocess/testrun_hadoop-sls.txt
 |
| hadoop-yarn-client test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9013/artifact/patchprocess/testrun_hadoop-yarn-client.txt
 |
| hadoop-yarn-server-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9013/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9013/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9013/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9013/console |


This message was automatically generated.

> CapacityScheduler side changes to support increase/decrease container 
> resource.
> ---
>
> Key: YARN-1651
> URL: https://issues.apache.org/jira/browse/YARN-1651
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager, scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-1651-1.YARN-1197.patch, 
> YARN-1651-2.YARN-1197.patch, YARN-1651-3.YARN-1197.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4106) NodeLabels for NM in distributed mode is not updated even after clusterNodelabel addition in RM

2015-09-04 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731697#comment-14731697
 ] 

Naganarasimha G R commented on YARN-4106:
-

Thanks for the comments [~leftnoteasy],
bq. Does this mean we cannot get labels when initializing 
ConfigurationNodeLabelsProvider?
Actually {{timerTask.run();}} in {{serviceStart}} is responsible to fetch the 
labels for the very first time. Reason for this modification is, if some class 
extends this in future and forgets to set the labels during the start then it 
will be missed hence trying to take care of this in AbstractProvider class 
itself and secondly we cannot  choose {{intervalTime to 0}} because there is 
configuration that timer task need not run at all, so timer itself will not be 
created in this case, considering all this, he has modified as per the patch.
bq. Add a test for new behavior of this JIRA
Well for second issue  : i felt changes were related to the design level and 
scenarios are the same. only thing we have missed is to check whether the timer 
is triggered after the configured interval and the new labels are set, i think 
we can have one test case for this. Apart from this were you referring to the 
test case for first issue also ?


> NodeLabels for NM in distributed mode is not updated even after 
> clusterNodelabel addition in RM 
> 
>
> Key: YARN-4106
> URL: https://issues.apache.org/jira/browse/YARN-4106
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4106.patch, 0002-YARN-4106.patch, 
> 0003-YARN-4106.patch, 0004-YARN-4106.patch
>
>
> NodeLabels for NM in distributed mode is not updated even after 
> clusterNodelabel addition in RM
> Steps to reproduce
> ===
> # Configure nodelabel in distributed mode
> yarn.node-labels.configuration-type=distributed
> provider = config
> yarn.nodemanager.node-labels.provider.fetch-interval-ms=12ms
> # Start RM the NM
> # Once NM is registration is done add nodelabels in RM
> Nodelabels not getting updated in RM side 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4024) YARN RM should avoid unnecessary resolving IP when NMs doing heartbeat

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731725#comment-14731725
 ] 

Hudson commented on YARN-4024:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #337 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/337/])
YARN-4024. YARN RM should avoid unnecessary resolving IP when NMs doing 
heartbeat. (Hong Zhiguo via wangda) (wangda: rev 
bcc85e3bab78bcacd430eac23141774465b96ef9)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestNodesListManager.java
* hadoop-yarn-project/CHANGES.txt


> YARN RM should avoid unnecessary resolving IP when NMs doing heartbeat
> --
>
> Key: YARN-4024
> URL: https://issues.apache.org/jira/browse/YARN-4024
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Hong Zhiguo
> Fix For: 2.8.0
>
> Attachments: YARN-4024-draft-v2.patch, YARN-4024-draft-v3.patch, 
> YARN-4024-draft.patch, YARN-4024-v4.patch, YARN-4024-v5.patch, 
> YARN-4024-v6.patch, YARN-4024-v7.patch
>
>
> Currently, YARN RM NodesListManager will resolve IP address every time when 
> node doing heartbeat. When DNS server becomes slow, NM heartbeat will be 
> blocked and cannot make progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4118) Newly submitted app maybe stuck at saving state if store operation failure is ignored in ZKRMStateStore

2015-09-04 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G reassigned YARN-4118:
-

Assignee: Sunil G

> Newly submitted app maybe stuck at saving state if store operation failure is 
> ignored in ZKRMStateStore
> ---
>
> Key: YARN-4118
> URL: https://issues.apache.org/jira/browse/YARN-4118
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Sunil G
>
> In YARN-2019, we took a decision to ignore the failure and not fail the RM 
> when ZK is unavailable.
> However, it leaves newly submitted app stuck at saving state.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4024) YARN RM should avoid unnecessary resolving IP when NMs doing heartbeat

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731754#comment-14731754
 ] 

Hudson commented on YARN-4024:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2298 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2298/])
YARN-4024. YARN RM should avoid unnecessary resolving IP when NMs doing 
heartbeat. (Hong Zhiguo via wangda) (wangda: rev 
bcc85e3bab78bcacd430eac23141774465b96ef9)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestNodesListManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java


> YARN RM should avoid unnecessary resolving IP when NMs doing heartbeat
> --
>
> Key: YARN-4024
> URL: https://issues.apache.org/jira/browse/YARN-4024
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Hong Zhiguo
> Fix For: 2.8.0
>
> Attachments: YARN-4024-draft-v2.patch, YARN-4024-draft-v3.patch, 
> YARN-4024-draft.patch, YARN-4024-v4.patch, YARN-4024-v5.patch, 
> YARN-4024-v6.patch, YARN-4024-v7.patch
>
>
> Currently, YARN RM NodesListManager will resolve IP address every time when 
> node doing heartbeat. When DNS server becomes slow, NM heartbeat will be 
> blocked and cannot make progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4087) Followup fixes after YARN-2019 regarding RM behavior when state-store error occurs

2015-09-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731755#comment-14731755
 ] 

Hadoop QA commented on YARN-4087:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  19m 50s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 49s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m  6s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   2m 50s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 30s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 39s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 25s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   2m  1s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |  54m 28s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | | 104m 41s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12754308/YARN-4087.6.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / bcc85e3 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9015/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9015/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9015/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9015/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9015/console |


This message was automatically generated.

> Followup fixes after YARN-2019 regarding RM behavior when state-store error 
> occurs
> --
>
> Key: YARN-4087
> URL: https://issues.apache.org/jira/browse/YARN-4087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-4087.1.patch, YARN-4087.2.patch, YARN-4087.3.patch, 
> YARN-4087.5.patch, YARN-4087.6.patch
>
>
> Several fixes:
> 1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
> production environment.
> 2. If HA is enabled and if there's any state-store error, after the retry 
> operation failed, we always transition RM to standby state.  Otherwise, we 
> may see two active RMs running. YARN-4107 is one example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1651) CapacityScheduler side changes to support increase/decrease container resource.

2015-09-04 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-1651:
-
Attachment: YARN-1651-3.YARN-1197.patch

Attached ver.3 patch, also synchronized repository to latest trunk.

> CapacityScheduler side changes to support increase/decrease container 
> resource.
> ---
>
> Key: YARN-1651
> URL: https://issues.apache.org/jira/browse/YARN-1651
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager, scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-1651-1.YARN-1197.patch, 
> YARN-1651-2.YARN-1197.patch, YARN-1651-3.YARN-1197.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2884) Proxying all AM-RM communications

2015-09-04 Thread Kishore Chaliparambil (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kishore Chaliparambil updated YARN-2884:

Attachment: YARN-2884-V13.patch

Fixed the one findbug issue that was found in the patch 12.

> Proxying all AM-RM communications
> -
>
> Key: YARN-2884
> URL: https://issues.apache.org/jira/browse/YARN-2884
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Carlo Curino
>Assignee: Kishore Chaliparambil
> Attachments: YARN-2884-V1.patch, YARN-2884-V10.patch, 
> YARN-2884-V11.patch, YARN-2884-V12.patch, YARN-2884-V13.patch, 
> YARN-2884-V2.patch, YARN-2884-V3.patch, YARN-2884-V4.patch, 
> YARN-2884-V5.patch, YARN-2884-V6.patch, YARN-2884-V7.patch, 
> YARN-2884-V8.patch, YARN-2884-V9.patch
>
>
> We introduce the notion of an RMProxy, running on each node (or once per 
> rack). Upon start the AM is forced (via tokens and configuration) to direct 
> all its requests to a new services running on the NM that provide a proxy to 
> the central RM. 
> This give us a place to:
> 1) perform distributed scheduling decisions
> 2) throttling mis-behaving AMs
> 3) mask the access to a federation of RMs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4024) YARN RM should avoid unnecessary resolving IP when NMs doing heartbeat

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731579#comment-14731579
 ] 

Hudson commented on YARN-4024:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8407 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8407/])
YARN-4024. YARN RM should avoid unnecessary resolving IP when NMs doing 
heartbeat. (Hong Zhiguo via wangda) (wangda: rev 
bcc85e3bab78bcacd430eac23141774465b96ef9)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestNodesListManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml


> YARN RM should avoid unnecessary resolving IP when NMs doing heartbeat
> --
>
> Key: YARN-4024
> URL: https://issues.apache.org/jira/browse/YARN-4024
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Hong Zhiguo
> Fix For: 2.8.0
>
> Attachments: YARN-4024-draft-v2.patch, YARN-4024-draft-v3.patch, 
> YARN-4024-draft.patch, YARN-4024-v4.patch, YARN-4024-v5.patch, 
> YARN-4024-v6.patch, YARN-4024-v7.patch
>
>
> Currently, YARN RM NodesListManager will resolve IP address every time when 
> node doing heartbeat. When DNS server becomes slow, NM heartbeat will be 
> blocked and cannot make progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs

2015-09-04 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-4074:
--
Attachment: YARN-4074-YARN-2928.POC.003.patch

POC v.3 patch posted.

Key changes include
- switched from Get.setMaxResultSize() to PageFilter (more on that below)
- major refactoring of HBaseTimelineReaderImpl
-- introduced TimelineEntityReader and the hierarchy of classes to isolate 
proper reading per type
- added unit tests to test HBaseTimelineReaderImpl for flow activity and flow 
runs
- fixed an issue with FlowScanner where the cells were returned in the wrong 
order so it was breaking Column.readResult()
- made *RowKey classes real object classes, and added the parseRowKey method 
that returns an instance of the RowKey
- fixed the order of the add and pollLast
- renamed FlowEntity to FlowRunEntity
- added the compareTo() method for FlowActivityEntity
- passed the type into the FlowActivityEntity constructor
- set configs for FlowActivityEntity and FlowRunEntity to null
- improved the way we get string values from info for FlowActivityEntity and 
FlowRunEntity
- added getNumberOfRuns() to FlowActivityEntity

It is actually pretty close to being ready, but since YARN-3901 is still 
outstanding, I'm not making it an official patch yet.

As for the PageFilter issue, I concluded setMaxResultSize() is not the right 
API to use to limit the number of rows. I believe the PageFilter is the right 
thing to use. I also added the counting logic to get the right number of 
records even if the result iterator advances.

As for the FlowScanner issue mentioned above, [~vrushalic] and [~jrottinghuis] 
debugged this to track down a bug in YARN-3901. As such, this change will 
likely be made in the final YARN-3901 patch. I just included it here for 
completeness and to make the unit code pass.

You should be able to apply the YARN-3901 v.3 patch and then this patch 
cleanly. Let me know if you have any questions.

I'd greatly appreciate review feedback. I understand it's a lot of code...

> [timeline reader] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4074
> URL: https://issues.apache.org/jira/browse/YARN-4074
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Attachments: YARN-4074-YARN-2928.POC.001.patch, 
> YARN-4074-YARN-2928.POC.002.patch, YARN-4074-YARN-2928.POC.003.patch
>
>
> Implement support for querying for flows and flow runs.
> We should be able to query for the most recent N flows, etc.
> This includes changes to the {{TimelineReader}} API if necessary, as well as 
> implementation of the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3635) Get-queue-mapping should be a common interface of YarnScheduler

2015-09-04 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3635:
-
Attachment: YARN-3635.7.patch

Rebased to latest trunk (ver.7)

> Get-queue-mapping should be a common interface of YarnScheduler
> ---
>
> Key: YARN-3635
> URL: https://issues.apache.org/jira/browse/YARN-3635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Wangda Tan
>Assignee: Tan, Wangda
> Attachments: YARN-3635.1.patch, YARN-3635.2.patch, YARN-3635.3.patch, 
> YARN-3635.4.patch, YARN-3635.5.patch, YARN-3635.6.patch, YARN-3635.7.patch
>
>
> Currently, both of fair/capacity scheduler support queue mapping, which makes 
> scheduler can change queue of an application after submitted to scheduler.
> One issue of doing this in specific scheduler is: If the queue after mapping 
> has different maximum_allocation/default-node-label-expression of the 
> original queue, {{validateAndCreateResourceRequest}} in RMAppManager checks 
> the wrong queue.
> I propose to make the queue mapping as a common interface of scheduler, and 
> RMAppManager set the queue after mapping before doing validations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4087) Followup fixes after YARN-2019 regarding RM behavior when state-store error occurs

2015-09-04 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731675#comment-14731675
 ] 

Jian He commented on YARN-4087:
---

Had some offline discussion with Vinod and Xuan, simply ignoring failures and 
let app continue is also not good, since app will be lost after restart.
Uploaded a new patch which removes this part of change. I'll open a new jira 
regarding how to handle inconsistent state if ZK is unavailable. 



> Followup fixes after YARN-2019 regarding RM behavior when state-store error 
> occurs
> --
>
> Key: YARN-4087
> URL: https://issues.apache.org/jira/browse/YARN-4087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-4087.1.patch, YARN-4087.2.patch, YARN-4087.3.patch, 
> YARN-4087.5.patch, YARN-4087.6.patch
>
>
> Several fixes:
> 1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
> production environment.
> 2. Fixe state-store to also notify app/attempt if state-store error is 
> ignored so that app/attempt is not stuck at *_SAVING state
> 3. If HA is enabled and if there's any state-store error, after the retry 
> operation failed, we always transition RM to standby state.  Otherwise, we 
> may see two active RMs running. YARN-4107 is one example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4024) YARN RM should avoid unnecessary resolving IP when NMs doing heartbeat

2015-09-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731680#comment-14731680
 ] 

Hudson commented on YARN-4024:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #1087 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1087/])
YARN-4024. YARN RM should avoid unnecessary resolving IP when NMs doing 
heartbeat. (Hong Zhiguo via wangda) (wangda: rev 
bcc85e3bab78bcacd430eac23141774465b96ef9)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestNodesListManager.java
* hadoop-yarn-project/CHANGES.txt


> YARN RM should avoid unnecessary resolving IP when NMs doing heartbeat
> --
>
> Key: YARN-4024
> URL: https://issues.apache.org/jira/browse/YARN-4024
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Hong Zhiguo
> Fix For: 2.8.0
>
> Attachments: YARN-4024-draft-v2.patch, YARN-4024-draft-v3.patch, 
> YARN-4024-draft.patch, YARN-4024-v4.patch, YARN-4024-v5.patch, 
> YARN-4024-v6.patch, YARN-4024-v7.patch
>
>
> Currently, YARN RM NodesListManager will resolve IP address every time when 
> node doing heartbeat. When DNS server becomes slow, NM heartbeat will be 
> blocked and cannot make progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2015-09-04 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731679#comment-14731679
 ] 

Wangda Tan commented on YARN-4091:
--

Thanks folks working on this design 
[~sunilg]/[~zhiguohong]/[~rohithsharma]/[~nijel]/[~Naganarasimha]!

Took a look at design doc, and I also thought about these stuffs recently:

*Some general issues we need to think about before going too far:*
1) Since we can have thousands of nodes per sec, and there can be thousands of 
applications running concurrently in a cluster, we must consider what's the 
overhead of recording all these stuffs.
2) Do we really need record this per container?
3) How can YARN show this to customer (especially for admin).

*From my experience, how to troubleshoot resource allocation issue is:*
1) Why I have available resources in NMs, but my application cannot leverage 
it. 
2) Why allocate to other app (queue/user) instead of me.

And my typical approach to look at these issues is:
1) Enable debug logging of scheduler
2) Grep a host_name (which customer declares it has available resources), see 
what happened within one node heartbeat.

So for me, how this feature could be useful to me:
1) It's able to capture one node heartbeat information
2) Captured information has hierarchy
3) It may looks like
{code}
heartbeat
goto queue - a
goto queue - a.a1
goto app_1
goto app_1.priority
goto 
app_1.priority.resource_request
check - queue capacity 
(passed)
check - user limit 
(passed)
check - node locality 
failed 
goto app_1 ..
goto queue -b
{code}

IAW, it's a human readable version of DBEUG log for a single node heartbeat.

And I think admin can benefit from this as well.

Another point is, we don't need to do this for every node heartbeat, doing that 
on demand for one single node heartbeat should be enough for most of cases. 
Admin should know which node to look at.

*Some rough ideas about how the REST API looks like:*
REST Response:
- "What happened" (such as skip-becomes-of-locality / 
node-partition-not-matched, etc. AND status such as usedCapacity, etc.) and 
"Who" (queue/user/app)
- Parent event (We may need hierarchy of these events)

REST Request:
- It seems send a nodeId to look should be enough for now.

This could be a async API, client request to get next allocation report of a 
given NodeId, and scheduler response report when it becomes ready.
API of internal could reference to HTrace, not sure if we can directly leverage 
HTrace to do such logging. I like basic API deisng of HTrace, but we may not 
need complexity like Sampler/Storage, etc.

Thoughts?

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Improvement on debugdiagnostic information - YARN.pdf
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2884) Proxying all AM-RM communications

2015-09-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731682#comment-14731682
 ] 

Hadoop QA commented on YARN-2884:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  20m 59s | Findbugs (version ) appears to 
be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 6 new or modified test files. |
| {color:green}+1{color} | javac |   8m  4s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 15s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 26s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   2m 20s | The applied patch generated  1 
new checkstyle issues (total was 211, now 211). |
| {color:red}-1{color} | checkstyle |   3m  4s | The applied patch generated  1 
new checkstyle issues (total was 0, now 1). |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   7m  0s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 23s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   2m  2s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |   0m 25s | Tests passed in 
hadoop-yarn-server-common. |
| {color:green}+1{color} | yarn tests |   7m 56s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| {color:red}-1{color} | yarn tests |  58m 24s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | | 121m 35s | |
\\
\\
|| Reason || Tests ||
| Timed out tests | 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation
 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12754282/YARN-2884-V13.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / bcc85e3 |
| Pre-patch Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/9012/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/9012/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 
https://builds.apache.org/job/PreCommit-YARN-Build/9012/artifact/patchprocess/diffcheckstylehadoop-yarn-server-common.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9012/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9012/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9012/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9012/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9012/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9012/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9012/console |


This message was automatically generated.

> Proxying all AM-RM communications
> -
>
> Key: YARN-2884
> URL: https://issues.apache.org/jira/browse/YARN-2884
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Carlo Curino
>Assignee: Kishore Chaliparambil
> Attachments: YARN-2884-V1.patch, YARN-2884-V10.patch, 
> YARN-2884-V11.patch, YARN-2884-V12.patch, YARN-2884-V13.patch, 
> YARN-2884-V2.patch, YARN-2884-V3.patch, YARN-2884-V4.patch, 
> YARN-2884-V5.patch, YARN-2884-V6.patch, YARN-2884-V7.patch, 
> YARN-2884-V8.patch, YARN-2884-V9.patch
>
>
> We introduce the notion of an RMProxy, running on each node (or once per 
> rack). Upon start the AM is forced (via tokens and configuration) to direct 
> all its requests to a new services running on the NM that provide a proxy to 
> 

[jira] [Commented] (YARN-4091) Improvement: Introduce more debug/diagnostics information to detail out scheduler activity

2015-09-04 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731689#comment-14731689
 ] 

Wangda Tan commented on YARN-4091:
--

I found my above suggestion is a little similar to YARN-4104, we're thinking 
very similar thing, [~zhiguohong]! :) 
Instead of dry-run, I'd like to get real data on demand. And we need hierarchy 
of these data as well.

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> --
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Improvement on debugdiagnostic information - YARN.pdf
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2884) Proxying all AM-RM communications

2015-09-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730425#comment-14730425
 ] 

Hadoop QA commented on YARN-2884:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  21m 24s | Pre-patch trunk has 7 extant 
Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 6 new or modified test files. |
| {color:green}+1{color} | javac |   8m  4s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 16s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   2m 35s | The applied patch generated  1 
new checkstyle issues (total was 211, now 211). |
| {color:green}+1{color} | whitespace |   0m  2s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 32s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   7m  4s | The patch appears to introduce 1 
new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 23s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   2m  0s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |   0m 24s | Tests passed in 
hadoop-yarn-server-common. |
| {color:red}-1{color} | yarn tests |   6m 24s | Tests failed in 
hadoop-yarn-server-nodemanager. |
| {color:red}-1{color} | yarn tests |   0m 19s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  62m 32s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-yarn-server-nodemanager |
| Failed unit tests | 
hadoop.yarn.server.nodemanager.containermanager.TestContainerManager |
|   | hadoop.yarn.server.nodemanager.TestNodeManagerReboot |
|   | 
hadoop.yarn.server.nodemanager.containermanager.monitor.TestContainersMonitor |
|   | hadoop.yarn.server.nodemanager.TestNodeManagerResync |
|   | hadoop.yarn.server.nodemanager.TestNodeStatusUpdater |
|   | hadoop.yarn.server.nodemanager.amrmproxy.TestAMRMProxyService |
|   | 
hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch |
|   | 
hadoop.yarn.server.nodemanager.containermanager.TestContainerManagerRecovery |
|   | 
hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService
 |
|   | hadoop.yarn.server.nodemanager.TestNodeManagerShutdown |
| Failed build | hadoop-yarn-server-resourcemanager |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12754145/YARN-2884-V12.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / c83d13c |
| Pre-patch Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/9006/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/9006/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/9006/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9006/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9006/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9006/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9006/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9006/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9006/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9006/console |


This message was automatically generated.

> Proxying all AM-RM communications
> -
>
> Key: YARN-2884
> URL: https://issues.apache.org/jira/browse/YARN-2884
> Project: Hadoop YARN
>  Issue 

[jira] [Commented] (YARN-4106) NodeLabels for NM in distributed mode is not updated even after clusterNodelabel addition in RM

2015-09-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730432#comment-14730432
 ] 

Hadoop QA commented on YARN-4106:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 41s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 59s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m  5s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 37s | The applied patch generated  3 
new checkstyle issues (total was 38, now 41). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 31s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 14s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |   7m  3s | Tests failed in 
hadoop-yarn-server-nodemanager. |
| | |  46m  9s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.nodemanager.nodelabels.TestConfigurationNodeLabelsProvider |
|   | hadoop.yarn.server.nodemanager.TestNodeManagerReboot |
|   | hadoop.yarn.server.nodemanager.TestNodeManagerResync |
|   | hadoop.yarn.server.nodemanager.TestNodeStatusUpdaterForLabels |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12754152/0003-YARN-4106.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / c83d13c |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/9007/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9007/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9007/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9007/console |


This message was automatically generated.

> NodeLabels for NM in distributed mode is not updated even after 
> clusterNodelabel addition in RM 
> 
>
> Key: YARN-4106
> URL: https://issues.apache.org/jira/browse/YARN-4106
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4106.patch, 0002-YARN-4106.patch, 
> 0003-YARN-4106.patch
>
>
> NodeLabels for NM in distributed mode is not updated even after 
> clusterNodelabel addition in RM
> Steps to reproduce
> ===
> # Configure nodelabel in distributed mode
> yarn.node-labels.configuration-type=distributed
> provider = config
> yarn.nodemanager.node-labels.provider.fetch-interval-ms=12ms
> # Start RM the NM
> # Once NM is registration is done add nodelabels in RM
> Nodelabels not getting updated in RM side 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2884) Proxying all AM-RM communications

2015-09-04 Thread Kishore Chaliparambil (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kishore Chaliparambil updated YARN-2884:

Attachment: YARN-2884-V12.patch

This patch fixes the issue with rolling Tokens in the proxy service. I tested 
the rolling tokens feature by setting the time interval to a smaller value and 
submitting long running job as a different user. The jobs finished 
successfully. 

The unused methods from the AMRMProxyTokenSecretManager has been removed too.

I will create a new JIRA for adding test cases with some simulations for 
clusters and proxy.


> Proxying all AM-RM communications
> -
>
> Key: YARN-2884
> URL: https://issues.apache.org/jira/browse/YARN-2884
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Carlo Curino
>Assignee: Kishore Chaliparambil
> Attachments: YARN-2884-V1.patch, YARN-2884-V10.patch, 
> YARN-2884-V11.patch, YARN-2884-V12.patch, YARN-2884-V2.patch, 
> YARN-2884-V3.patch, YARN-2884-V4.patch, YARN-2884-V5.patch, 
> YARN-2884-V6.patch, YARN-2884-V7.patch, YARN-2884-V8.patch, YARN-2884-V9.patch
>
>
> We introduce the notion of an RMProxy, running on each node (or once per 
> rack). Upon start the AM is forced (via tokens and configuration) to direct 
> all its requests to a new services running on the NM that provide a proxy to 
> the central RM. 
> This give us a place to:
> 1) perform distributed scheduling decisions
> 2) throttling mis-behaving AMs
> 3) mask the access to a federation of RMs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure

2015-09-04 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730374#comment-14730374
 ] 

Varun Vasudev commented on YARN-3591:
-

+1 for the latest patch. I'll commit this tomorrow if no one objects.

> Resource Localisation on a bad disk causes subsequent containers failure 
> -
>
> Key: YARN-3591
> URL: https://issues.apache.org/jira/browse/YARN-3591
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Lavkesh Lahngir
>Assignee: Lavkesh Lahngir
> Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
> YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch, 
> YARN-3591.6.patch, YARN-3591.7.patch, YARN-3591.8.patch, YARN-3591.9.patch
>
>
> It happens when a resource is localised on the disk, after localising that 
> disk has gone bad. NM keeps paths for localised resources in memory.  At the 
> time of resource request isResourcePresent(rsrc) will be called which calls 
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and 
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because 
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which 
> will call open() natively. If the disk is good it should return an array of 
> paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4087) Followup fixes after YARN-2019 regarding RM behavior when state-store error occurs

2015-09-04 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-4087:
--
Attachment: YARN-4087.5.patch

> Followup fixes after YARN-2019 regarding RM behavior when state-store error 
> occurs
> --
>
> Key: YARN-4087
> URL: https://issues.apache.org/jira/browse/YARN-4087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-4087.1.patch, YARN-4087.2.patch, YARN-4087.3.patch, 
> YARN-4087.5.patch
>
>
> Several fixes:
> 1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
> production environment.
> 2. Fixe state-store to also notify app/attempt if state-store error is 
> ignored so that app/attempt is not stuck at *_SAVING state
> 3. If HA is enabled and if there's any state-store error, after the retry 
> operation failed, we always transition RM to standby state.  Otherwise, we 
> may see two active RMs running. YARN-4107 is one example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4106) NodeLabels for NM in distributed mode is not updated even after clusterNodelabel addition in RM

2015-09-04 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730380#comment-14730380
 ] 

Naganarasimha G R commented on YARN-4106:
-

[~bibinchundatt],
Good catch, Basically this was added initially for testing and later removed 
but got missed over here so basically its sufficient that {{intervalTime}} 
itself is used as {{delay}} in {{AbstractNodeLabelsProvider.serviceStart}} and 
also we can remove {{startTime}} & in 
{{ConfigurationNodeLabelsProvider.serviceInit}} we can remove {{startTime}} and 
instead of calling {{updateNodeLabelsFromConfig}} in {{serviceInit}} we can 
override and invoke {{timertask.run}} in {{serviceStart()}}. Please correct the 
test cases accordingly.


> NodeLabels for NM in distributed mode is not updated even after 
> clusterNodelabel addition in RM 
> 
>
> Key: YARN-4106
> URL: https://issues.apache.org/jira/browse/YARN-4106
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4106.patch, 0002-YARN-4106.patch
>
>
> NodeLabels for NM in distributed mode is not updated even after 
> clusterNodelabel addition in RM
> Steps to reproduce
> ===
> # Configure nodelabel in distributed mode
> yarn.node-labels.configuration-type=distributed
> provider = config
> yarn.nodemanager.node-labels.provider.fetch-interval-ms=12ms
> # Start RM the NM
> # Once NM is registration is done add nodelabels in RM
> Nodelabels not getting updated in RM side 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4106) NodeLabels for NM in distributed mode is not updated even after clusterNodelabel addition in RM

2015-09-04 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4106:
---
Attachment: 0003-YARN-4106.patch

[~Naganarasimha]

Thnks for comments and suggestion.
I have updated the patch as per your suggestion.
Please do review.

> NodeLabels for NM in distributed mode is not updated even after 
> clusterNodelabel addition in RM 
> 
>
> Key: YARN-4106
> URL: https://issues.apache.org/jira/browse/YARN-4106
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4106.patch, 0002-YARN-4106.patch, 
> 0003-YARN-4106.patch
>
>
> NodeLabels for NM in distributed mode is not updated even after 
> clusterNodelabel addition in RM
> Steps to reproduce
> ===
> # Configure nodelabel in distributed mode
> yarn.node-labels.configuration-type=distributed
> provider = config
> yarn.nodemanager.node-labels.provider.fetch-interval-ms=12ms
> # Start RM the NM
> # Once NM is registration is done add nodelabels in RM
> Nodelabels not getting updated in RM side 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4087) Followup fixes after YARN-2019 regarding RM behavior when state-store error occurs

2015-09-04 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-4087:
--
Attachment: (was: YARN-4087.4.patch)

> Followup fixes after YARN-2019 regarding RM behavior when state-store error 
> occurs
> --
>
> Key: YARN-4087
> URL: https://issues.apache.org/jira/browse/YARN-4087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-4087.1.patch, YARN-4087.2.patch, YARN-4087.3.patch
>
>
> Several fixes:
> 1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
> production environment.
> 2. Fixe state-store to also notify app/attempt if state-store error is 
> ignored so that app/attempt is not stuck at *_SAVING state
> 3. If HA is enabled and if there's any state-store error, after the retry 
> operation failed, we always transition RM to standby state.  Otherwise, we 
> may see two active RMs running. YARN-4107 is one example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4110) RMappImpl and RmAppAttemptImpl should override hashcode() & equals()

2015-09-04 Thread nijel (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nijel updated YARN-4110:

Attachment: YARN-4110_1.patch

Attached the patch
Please review

> RMappImpl and RmAppAttemptImpl should override hashcode() & equals()
> 
>
> Key: YARN-4110
> URL: https://issues.apache.org/jira/browse/YARN-4110
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: nijel
> Attachments: YARN-4110_1.patch
>
>
> It is observed that RMAppImpl and RMAppAttemptImpl does not have hashcode() 
> and equals() implementations. These state objects should override these 
> implementations.
> # For RMAppImpl, we can use of ApplicationId#hashcode and 
> ApplicationId#equals.
> # Similarly, RMAppAttemptImpl, ApplicationAttemptId#hashcode and 
> ApplicationAttemptId#equals



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4114) In RM REST API for cluster/scheduler nodelabels in queues not listed properly

2015-09-04 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-4114:
--

 Summary: In RM  REST API for cluster/scheduler nodelabels in 
queues not listed properly
 Key: YARN-4114
 URL: https://issues.apache.org/jira/browse/YARN-4114
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt


*Request used*

Accept: application/xml
GET http:///ws/v1/cluster/scheduler

Response

{code}

10.0
...
RUNNING
...
false
1
3
0
0
0
..

{code}

Nodelabels for queue should be as below


   1
   3


or 

1,3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4087) Followup fixes after YARN-2019 regarding RM behavior when state-store error occurs

2015-09-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14730522#comment-14730522
 ] 

Hadoop QA commented on YARN-4087:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  19m 11s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 43s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 49s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   2m 46s | The applied patch generated  2 
new checkstyle issues (total was 297, now 297). |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 29s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 35s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 23s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   1m 58s | Tests passed in 
hadoop-yarn-common. |
| {color:red}-1{color} | yarn tests |  52m  9s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | | 101m  6s | |
\\
\\
|| Reason || Tests ||
| Timed out tests | 
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12754156/YARN-4087.5.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / c83d13c |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/9008/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9008/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9008/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9008/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9008/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9008/console |


This message was automatically generated.

> Followup fixes after YARN-2019 regarding RM behavior when state-store error 
> occurs
> --
>
> Key: YARN-4087
> URL: https://issues.apache.org/jira/browse/YARN-4087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-4087.1.patch, YARN-4087.2.patch, YARN-4087.3.patch, 
> YARN-4087.5.patch
>
>
> Several fixes:
> 1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in 
> production environment.
> 2. Fixe state-store to also notify app/attempt if state-store error is 
> ignored so that app/attempt is not stuck at *_SAVING state
> 3. If HA is enabled and if there's any state-store error, after the retry 
> operation failed, we always transition RM to standby state.  Otherwise, we 
> may see two active RMs running. YARN-4107 is one example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)