[jira] [Updated] (YARN-4156) TestAMRestart#testAMBlacklistPreventsRestartOnSameNode assumes CapacityScheduler

2015-09-19 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-4156:
---
Summary: TestAMRestart#testAMBlacklistPreventsRestartOnSameNode assumes 
CapacityScheduler  (was: testAMBlacklistPreventsRestartOnSameNode assumes 
CapacityScheduler)

> TestAMRestart#testAMBlacklistPreventsRestartOnSameNode assumes 
> CapacityScheduler
> 
>
> Key: YARN-4156
> URL: https://issues.apache.org/jira/browse/YARN-4156
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-4156.001.patch
>
>
> The assumes the scheduler is CapacityScheduler without configuring it as 
> such. This causes it to fail if the default is something else such as the 
> FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4191) Expose ApplicationMaster RPC port in ResourceManager REST endpoint

2015-09-19 Thread Richard Lee (JIRA)
Richard Lee created YARN-4191:
-

 Summary: Expose ApplicationMaster RPC port in ResourceManager REST 
endpoint
 Key: YARN-4191
 URL: https://issues.apache.org/jira/browse/YARN-4191
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.7.1
Reporter: Richard Lee
Priority: Minor


Currently, the ResourceManager REST endpoint returns only the trackingUrl for 
the ApplicationMaster.  Some AMs, however, have their REST endpoints on the RPC 
port.  However, the RM does not expose the AM RPC port via REST for some reason.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3920) FairScheduler: Limit node reservations to large containers

2015-09-19 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3920:
---
Summary: FairScheduler: Limit node reservations to large containers  (was: 
FairScheduler container reservation on a node should be configurable to limit 
it to large containers)

> FairScheduler: Limit node reservations to large containers
> --
>
> Key: YARN-3920
> URL: https://issues.apache.org/jira/browse/YARN-3920
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
> Attachments: YARN-3920.004.patch, YARN-3920.004.patch, 
> YARN-3920.004.patch, YARN-3920.004.patch, YARN-3920.005.patch, 
> yARN-3920.001.patch, yARN-3920.002.patch, yARN-3920.003.patch
>
>
> Reserving a node for a container was designed for preventing large containers 
> from starvation from small requests that keep getting into a node. Today we 
> let this be used even for a small container request. This has a huge impact 
> on scheduling since we block other scheduling requests until that reservation 
> is fulfilled. We should make this configurable so its impact can be minimized 
> by limiting it for large container requests as originally intended. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics

2015-09-19 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877294#comment-14877294
 ] 

Varun Saxena commented on YARN-3816:


By the way, calculating area under the curve along the time dimension, would it 
be useful by itself ?
Average based on this area under the curve seems more useful. 

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> 
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, 
> YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, 
> YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, 
> YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4155) TestLogAggregationService.testLogAggregationServiceWithInterval failing

2015-09-19 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877387#comment-14877387
 ] 

Bibin A Chundatt commented on YARN-4155:


[~ste...@apache.org]

Also tried locally in my setup its passing
{noformat}
---
 T E S T S
---
Running 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService
Tests run: 34, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 112.588 sec - 
in 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService

Results :

Tests run: 34, Failures: 0, Errors: 0, Skipped: 0

{noformat}

> TestLogAggregationService.testLogAggregationServiceWithInterval failing
> ---
>
> Key: YARN-4155
> URL: https://issues.apache.org/jira/browse/YARN-4155
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4155.patch, 0001-YARN-4155.patch
>
>
> Test failing on Jenkins: 
> {{TestLogAggregationService.testLogAggregationServiceWithInterval}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics

2015-09-19 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876928#comment-14876928
 ] 

Sangjin Lee commented on YARN-3816:
---

My apologies for truly belated review comments. I just had time to go over this 
in some depth after working on YARN-4074. I think the latest patch is much more 
aligned with the overall design, and thanks much for working on that patiently 
[~djp].

First off, this overlaps with YARN-4074 and YARN-4075 that are getting wrapped 
up. So it would be good if this goes in after those 2 JIRAs. Let me know if 
you're OK with that.

Also, I do have some basic questions and issues to discuss, and I'll mention 
them here. But I'm comfortable with having follow-on JIRAs after this one to 
address some of these that turn out to be major changes.

*(aggregating metrics from all types of entities to application)*
It appears that the current code will aggregate metrics from all types of 
entities to the application. This seems problematic to me.

The main goal of this aggregation is to roll up metrics from individual 
*containers* to the application. But just by having the same metric id, any 
entity can have its metric aggregated by this (incorrectly). For example, any 
arbitrary entity can simply declare a metric named "MEMORY". By virtue of that, 
it would get aggregated and added to the application-level value. There can be 
variations of this: for example, the same metrics can be reported by the 
container entity, app attempt entity, and so on. Then the values may be 
aggregated double or triple.

I think we should ensure strongly that the aggregation happens only along the 
path of YARN container entities to application to prevent these accidental 
cases.

On a semi-related note, what happens if clients send metrics directly at the 
application entity level? We should expect most framework-specific AMs to do 
that. For example, MR AM already has all the job-level counters, and it can 
(and should) report those job-level counters as metrics at the YARN application 
entity. Is that case handled correctly, or will we end up getting incorrect 
values (double counting) in that situation?

On to individual files:

(TimelineMetric.java)
- l.122: Although the method name is {{accumulateTo()}}, most of the variables 
and comments say "aggregate". Can we clean them up to say "accumulate"?

(TimelineMetricCalculator.java)
- we should add the annotations (public? unstable?)
- l.34: if {{n1 == null}}, shouldn't we return {{-n2}}?
- for both {{sub()}} and {{sum()}}, would it be simpler just to handle the 
arithmetic as longs even if they're integers?

(yarn-default.xml)
- The default defined in YarnConfiguration is true, but in yarn-default.xml it 
is false; which is correct? We should reconcile them.

(NMTimelinePublisher.java)
- Shouldn't these metrics set {{toAggregate}} to true (because the default is 
false)? These metrics are *THE* main ones we want to aggregate from containers 
to application, right? For that matter, should the default itself for 
{{toAggregate}} on {{TimelineMetric}} be true? I feel we should aggregate 
unless specified otherwise, not the other way around. Thoughts?

(TimelineCollector.java)
- l.124: nit: you can simply call {{aggregateMetrics()}} instead of 
{{TimelineCollector.aggregateMetrics()}}
- l.130: the same for {{appendAggregatedMetricsToEntities()}}
- l.212: What is the point of nulling out the value for metric id in 
{{perIdAggregatedNum}}? It doesn't seem necessary.

(TimelineReaderWebServices.java)
- I'm not so sure if we need a separate REST end point for "aggregates". If I 
understand correctly, they are all stored in the same application table under 
the same app id. What does it mean to have a separate REST URL for aggregates? 
Can we query for the application and be done?

(HBaseTimelineWriterImpl.java)
- I see that you're appending the {{toAggregate flag}} to the column name. I 
think it is fine for now, but we will need to look at this again, as there are 
other dimensions of metrics that need to be persisted. Some examples include 
single value v. time series, long v. float (possibly), and so on. We will need 
to arrive at a conclusion on how to encode them all cleanly and efficiently. We 
can address this later together with [~varun_saxena] as he's dealing with a 
related JIRA.

(HBaseTimelineReaderImpl.java)
- l.506: nit: it can just be
{code}
boolean toAggregate = toAggregateStr.equals("1");
{code}


> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> 
>
> Key: YARN-3816
> URL: https://issues.apache.org/jira/browse/YARN-3816
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: 

[jira] [Updated] (YARN-4167) NPE on RMActiveServices#serviceStop when store is null

2015-09-19 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4167:
---
Attachment: 0002-YARN-4167.patch

Renamed patch and uploaded the same again to trigger Jenkins


> NPE on RMActiveServices#serviceStop when store is null
> --
>
> Key: YARN-4167
> URL: https://issues.apache.org/jira/browse/YARN-4167
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: 0001-YARN-4167.patch, 0001-YARN-4167.patch, 
> 0002-YARN-4167.patch
>
>
> Configure 
> {{yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs}} 
> mismatching with {{yarn.nm.liveness-monitor.expiry-interval-ms}}
> On startup NPE is thrown on {{RMActiveServices#serviceStop}}
> {noformat}
> 2015-09-16 12:23:29,504 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED; cause: 
> java.lang.IllegalArgumentException: 
> yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should 
> be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms
> java.lang.IllegalArgumentException: 
> yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs should 
> be more than 3 X yarn.nm.liveness-monitor.expiry-interval-ms
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.(RMContainerTokenSecretManager.java:82)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.createContainerTokenSecretManager(RMSecretManagerService.java:109)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.RMSecretManagerService.(RMSecretManagerService.java:57)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createRMSecretManagerService(ResourceManager.java:)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:423)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193)
> 2015-09-16 12:23:29,507 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error closing 
> store.
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:608)
>  at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>  at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
>  at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:963)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1193
> {noformat}
> *Impact Area*: RM failover with wrong configuration



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs

2015-09-19 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876931#comment-14876931
 ] 

Varun Saxena commented on YARN-4074:


LGTM.

> [timeline reader] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4074
> URL: https://issues.apache.org/jira/browse/YARN-4074
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Sangjin Lee
> Attachments: YARN-4074-YARN-2928.007.patch, 
> YARN-4074-YARN-2928.008.patch, YARN-4074-YARN-2928.POC.001.patch, 
> YARN-4074-YARN-2928.POC.002.patch, YARN-4074-YARN-2928.POC.003.patch, 
> YARN-4074-YARN-2928.POC.004.patch, YARN-4074-YARN-2928.POC.005.patch, 
> YARN-4074-YARN-2928.POC.006.patch
>
>
> Implement support for querying for flows and flow runs.
> We should be able to query for the most recent N flows, etc.
> This includes changes to the {{TimelineReader}} API if necessary, as well as 
> implementation of the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4075) [reader REST API] implement support for querying for flows and flow runs

2015-09-19 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876934#comment-14876934
 ] 

Varun Saxena commented on YARN-4075:


Yes. Will rebase after 4074 goes in. Have to remove user as optional query 
param too, as discussed on 4074.

> [reader REST API] implement support for querying for flows and flow runs
> 
>
> Key: YARN-4075
> URL: https://issues.apache.org/jira/browse/YARN-4075
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
> Attachments: YARN-4075-YARN-2928.POC.1.patch, 
> YARN-4075-YARN-2928.POC.2.patch
>
>
> We need to be able to query for flows and flow runs via REST.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart

2015-09-19 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876958#comment-14876958
 ] 

Jian He commented on YARN-4000:
---

One more issue, there may be container leak.
Depending on when NM re-register, it is possible that some containers are 
recovered back even after application gets the kill signal, in which case 
containers are leaked.

One solution I can think of is that, given that 
CapacityScheduler#doneApplicationAttempt and recoverContainersOnNode are 
synchronized, we can check whether RMAppAttempt is at 
final(FINISHED/FAILED/KILLED) state inside recoverContainersOnNode and skip 
recovering this container if it is.
It would be great if you can have a test case for this.

> RM crashes with NPE if leaf queue becomes parent queue during restart
> -
>
> Key: YARN-4000
> URL: https://issues.apache.org/jira/browse/YARN-4000
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Varun Saxena
> Attachments: YARN-4000.01.patch, YARN-4000.02.patch, 
> YARN-4000.03.patch, YARN-4000.04.patch, YARN-4000.05.patch
>
>
> This is a similar situation to YARN-2308.  If an application is active in 
> queue A and then the RM restarts with a changed capacity scheduler 
> configuration where queue A becomes a parent queue to other subqueues then 
> the RM will crash with a NullPointerException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-19 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4140:
---
Attachment: 0007-YARN-4140.patch

Hi [~leftnoteasy]

Details of testcase failures
{noformat}
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
hadoop.yarn.server.resourcemanager.TestClientRMService 
{noformat}
These two are not related to patch uploaded
{noformat}
hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
{noformat}
Fixed 

Jenkins not triggered due to {{git pull failure}}. Uploading new patch to 
trigger jenkins again

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> 

[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-19 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877032#comment-14877032
 ] 

Bibin A Chundatt commented on YARN-4140:


Failures are not due to current patch.
{noformat}
hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService
hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter 
{noformat}
Reason class not found and {{TestNodeLabelContainerAllocation}} testcases 
failed on trunk even with out patch.

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, 

[jira] [Commented] (YARN-4178) [storage implementation] app id as string can cause incorrect ordering

2015-09-19 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877046#comment-14877046
 ] 

Varun Saxena commented on YARN-4178:


bq. we certainly have to store the application_ part.
I think we can use ApplicationId class for it. If prefix changes, ApplicationId 
will change as well. As I said above we can use Application#toString to 
reconvert it. Wouldn't it be fair to assume that any changes in application id 
format will be reflected in ApplicationId class.

> [storage implementation] app id as string can cause incorrect ordering
> --
>
> Key: YARN-4178
> URL: https://issues.apache.org/jira/browse/YARN-4178
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
>
> Currently the app id is used in various places as part of row keys and in 
> column names. However, they are treated as strings for the most part. This 
> will cause a problem with ordering when the id portion of the app id rolls 
> over to the next digit.
> For example, "app_1234567890_100" will be considered *earlier* than 
> "app_1234567890_99". We should correct this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4178) [storage implementation] app id as string can cause incorrect ordering

2015-09-19 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877044#comment-14877044
 ] 

Varun Saxena commented on YARN-4178:


As appId is part of entity table row key, on second thoughts, containers and 
app attempts shouldnt be an issue.

> [storage implementation] app id as string can cause incorrect ordering
> --
>
> Key: YARN-4178
> URL: https://issues.apache.org/jira/browse/YARN-4178
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Sangjin Lee
>Assignee: Varun Saxena
>
> Currently the app id is used in various places as part of row keys and in 
> column names. However, they are treated as strings for the most part. This 
> will cause a problem with ordering when the id portion of the app id rolls 
> over to the next digit.
> For example, "app_1234567890_100" will be considered *earlier* than 
> "app_1234567890_99". We should correct this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4152) NM crash with NPE when LogAggregationService#stopContainer called for absent container

2015-09-19 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4152:
---
Attachment: 0003-YARN-4152.patch

Hi [~sunilg]

Thnks for comments.Updated patch as per comments
The operation performed was kill app and context container entry is not 
available.

> NM crash with NPE when LogAggregationService#stopContainer called for absent 
> container
> --
>
> Key: YARN-4152
> URL: https://issues.apache.org/jira/browse/YARN-4152
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4152.patch, 0002-YARN-4152.patch, 
> 0003-YARN-4152.patch
>
>
> NM crash during of log aggregation.
> Ran Pi job with 500 container and killed application in between
> *Logs*
> {code}
> 2015-09-12 18:44:25,597 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
> from container container_e51_1442063466801_0001_01_99 is : 143
> 2015-09-12 18:44:25,670 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Event EventType: KILL_CONTAINER sent to absent container 
> container_e51_1442063466801_0001_01_000101
> 2015-09-12 18:44:25,670 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Removing container_e51_1442063466801_0001_01_000101 from application 
> application_1442063466801_0001
> 2015-09-12 18:44:25,670 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.stopContainer(LogAggregationService.java:422)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:456)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:68)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> 2015-09-12 18:44:25,692 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
> event CONTAINER_STOP for appId application_1442063466801_0001
> 2015-09-12 18:44:25,692 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Exiting, bbye..
> 2015-09-12 18:44:25,692 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=dsperf   
> OPERATION=Container Finished - SucceededTARGET=ContainerImpl
> RESULT=SUCCESS  APPID=application_1442063466801_0001
> CONTAINERID=container_e51_1442063466801_0001_01_000100
> {code}
> *Analysis*
> Looks like for absent container also {{stopContainer}} is called 
> {code}
>   case CONTAINER_FINISHED:
> LogHandlerContainerFinishedEvent containerFinishEvent =
> (LogHandlerContainerFinishedEvent) event;
> stopContainer(containerFinishEvent.getContainerId(),
> containerFinishEvent.getExitCode());
> break;
> {code}
> *Event EventType: KILL_CONTAINER sent to absent container 
> container_e51_1442063466801_0001_01_000101*
> Should skip when {{null==context.getContainers().get(containerId)}} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4131) Add API and CLI to kill container on given containerId

2015-09-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877090#comment-14877090
 ] 

Steve Loughran commented on YARN-4131:
--

[~adhoot] too late I'm afraid. Summary and work is going on in YARN-1897; get 
Ming Ma's (used in production) code in

> Add API and CLI to kill container on given containerId
> --
>
> Key: YARN-4131
> URL: https://issues.apache.org/jira/browse/YARN-4131
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: applications, client
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: YARN-4131-demo-2.patch, YARN-4131-demo.patch, 
> YARN-4131-v1.1.patch, YARN-4131-v1.2.patch, YARN-4131-v1.patch
>
>
> Per YARN-3337, we need a handy tools to kill container in some scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-09-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877022#comment-14877022
 ] 

Hadoop QA commented on YARN-4140:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 40s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 47s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 58s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 51s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 31s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 26s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |  57m 24s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  96m 39s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService
 |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
 |
|   | hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12761269/0007-YARN-4140.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / e3ace31 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9218/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9218/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9218/console |


This message was automatically generated.

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> 

[jira] [Commented] (YARN-4152) NM crash with NPE when LogAggregationService#stopContainer called for absent container

2015-09-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877026#comment-14877026
 ] 

Hadoop QA commented on YARN-4152:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 38s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 51s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m  5s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 37s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 29s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 13s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   7m 55s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  46m 48s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12761276/0003-YARN-4152.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / e3ace31 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9219/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9219/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9219/console |


This message was automatically generated.

> NM crash with NPE when LogAggregationService#stopContainer called for absent 
> container
> --
>
> Key: YARN-4152
> URL: https://issues.apache.org/jira/browse/YARN-4152
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4152.patch, 0002-YARN-4152.patch, 
> 0003-YARN-4152.patch
>
>
> NM crash during of log aggregation.
> Ran Pi job with 500 container and killed application in between
> *Logs*
> {code}
> 2015-09-12 18:44:25,597 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
> from container container_e51_1442063466801_0001_01_99 is : 143
> 2015-09-12 18:44:25,670 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Event EventType: KILL_CONTAINER sent to absent container 
> container_e51_1442063466801_0001_01_000101
> 2015-09-12 18:44:25,670 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Removing container_e51_1442063466801_0001_01_000101 from application 
> application_1442063466801_0001
> 2015-09-12 18:44:25,670 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.stopContainer(LogAggregationService.java:422)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:456)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:68)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> 2015-09-12 18:44:25,692 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
> event CONTAINER_STOP for appId application_1442063466801_0001
> 2015-09-12 18:44:25,692 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Exiting, bbye..
> 2015-09-12 18:44:25,692 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=dsperf   
> OPERATION=Container Finished - SucceededTARGET=ContainerImpl
> RESULT=SUCCESS  

[jira] [Updated] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels

2015-09-19 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4176:
---
Attachment: 0001-YARN-4176.patch

Attaching patch for review.

[~Naganarasimha]
# Have taken care of configuration naming .
# Haven't changed Syste.mills as of now.

> Resync NM nodelabels with RM every x interval for distributed nodelabels
> 
>
> Key: YARN-4176
> URL: https://issues.apache.org/jira/browse/YARN-4176
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4176.patch
>
>
> This JIRA is for handling the below set of issue
> # Distributed nodelabels after NM registered with RM if cluster nodelabels 
> are removed and added then NM doesnt resend labels in heartbeat again untils 
> any change in labels
> # NM registration failed with Nodelabels should resend labels again to RM 
> The above cases can be handled by  resync nodeLabels with RM every x interval
> # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} 
> and  will resend nodelabels to RM based on config no matter what the 
> registration fails or success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4155) TestLogAggregationService.testLogAggregationServiceWithInterval failing

2015-09-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877131#comment-14877131
 ] 

Steve Loughran commented on YARN-4155:
--

failed on local test run
{code}
---
 T E S T S
---
Running 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService
Tests run: 34, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 109.991 sec 
<<< FAILURE! - in 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService
testLocalFileDeletionOnDiskFull(org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService)
  Time elapsed: 0.176 sec  <<< FAILURE!
java.lang.AssertionError: Log file 
[/Users/stevel/Hadoop/commit/apache-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService-remoteLogDir/nobody/logs/application_1234_0001/0.0.0.0_]
 not found
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService.verifyLocalFileDeletion(TestLogAggregationService.java:235)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService.testLocalFileDeletionOnDiskFull(TestLogAggregationService.java:285)


Results :

Failed tests: 
  
TestLogAggregationService.testLocalFileDeletionOnDiskFull:285->verifyLocalFileDeletion:235
 Log file 
[/Users/stevel/Hadoop/commit/apache-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService-remoteLogDir/nobody/logs/application_1234_0001/0.0.0.0_]
 not found

Tests run: 34, Failures: 1, Errors: 0, Skipped: 0
{code}

> TestLogAggregationService.testLogAggregationServiceWithInterval failing
> ---
>
> Key: YARN-4155
> URL: https://issues.apache.org/jira/browse/YARN-4155
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4155.patch, 0001-YARN-4155.patch
>
>
> Test failing on Jenkins: 
> {{TestLogAggregationService.testLogAggregationServiceWithInterval}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4155) TestLogAggregationService.testLogAggregationServiceWithInterval failing

2015-09-19 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877142#comment-14877142
 ] 

Bibin A Chundatt commented on YARN-4155:


[~ste...@apache.org]
Seems this time 
{{org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService.testLocalFileDeletionOnDiskFull}}
 is failing . Checked {{testLogAggregationServiceWithInterval}} in patch. 
Should i handle {{testLocalFileDeletionOnDiskFull}} in this jira?

> TestLogAggregationService.testLogAggregationServiceWithInterval failing
> ---
>
> Key: YARN-4155
> URL: https://issues.apache.org/jira/browse/YARN-4155
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4155.patch, 0001-YARN-4155.patch
>
>
> Test failing on Jenkins: 
> {{TestLogAggregationService.testLogAggregationServiceWithInterval}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4176) Resync NM nodelabels with RM every x interval for distributed nodelabels

2015-09-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877151#comment-14877151
 ] 

Hadoop QA commented on YARN-4176:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  18m 11s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   8m  8s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 24s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 25s | The applied patch generated  1 
new checkstyle issues (total was 211, now 211). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 55s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |   0m 23s | Tests failed in 
hadoop-yarn-api. |
| {color:red}-1{color} | yarn tests |   7m 53s | Tests failed in 
hadoop-yarn-server-nodemanager. |
| | |  52m  5s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.conf.TestYarnConfigurationFields |
|   | hadoop.yarn.server.nodemanager.TestNodeStatusUpdaterForLabels |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12761291/0001-YARN-4176.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / c39ddc3 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/9220/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9220/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9220/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9220/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9220/console |


This message was automatically generated.

> Resync NM nodelabels with RM every x interval for distributed nodelabels
> 
>
> Key: YARN-4176
> URL: https://issues.apache.org/jira/browse/YARN-4176
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4176.patch
>
>
> This JIRA is for handling the below set of issue
> # Distributed nodelabels after NM registered with RM if cluster nodelabels 
> are removed and added then NM doesnt resend labels in heartbeat again untils 
> any change in labels
> # NM registration failed with Nodelabels should resend labels again to RM 
> The above cases can be handled by  resync nodeLabels with RM every x interval
> # Add property {{yarn.nodemanager.node-labels.provider.resync-interval-ms}} 
> and  will resend nodelabels to RM based on config no matter what the 
> registration fails or success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4177) yarn.util.Clock should not be used to time a duration or time interval

2015-09-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877134#comment-14877134
 ] 

Steve Loughran commented on YARN-4177:
--

As of this week, i don't trust monotonic clocks on multi-socket servers: 
http://steveloughran.blogspot.co.uk/2015/09/time-on-multi-core-multi-socket-servers.html



> yarn.util.Clock should not be used to time a duration or time interval
> --
>
> Key: YARN-4177
> URL: https://issues.apache.org/jira/browse/YARN-4177
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Xianyin Xin
>Assignee: Xianyin Xin
> Attachments: YARN-4177.001.patch, YARN-4177.002.patch
>
>
> There're many places uses Clock to time intervals, which is dangerous as 
> commented by [~ste...@apache.org] in HADOOP-12409. Instead, we should use 
> hadoop.util.Timer#monotonicNow() to get monotonic time. Or we could provide a 
> MonotonicClock in yarn.util considering the consistency of code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4155) TestLogAggregationService.testLogAggregationServiceWithInterval failing

2015-09-19 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877157#comment-14877157
 ] 

Bibin A Chundatt commented on YARN-4155:


Also tried running locally through eclipse its passing in my setup.


> TestLogAggregationService.testLogAggregationServiceWithInterval failing
> ---
>
> Key: YARN-4155
> URL: https://issues.apache.org/jira/browse/YARN-4155
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4155.patch, 0001-YARN-4155.patch
>
>
> Test failing on Jenkins: 
> {{TestLogAggregationService.testLogAggregationServiceWithInterval}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)