date:20160226


 [ 
https://issues.apache.org/jira/browse/YARN-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-2886:
--
Parent Issue: YARN-4742  (was: YARN-2877)

> Estimating waiting time in NM container queues
> --
>
> Key: YARN-2886
> URL: https://issues.apache.org/jira/browse/YARN-2886
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Konstantinos Karanasos
>Assignee: Konstantinos Karanasos
>
> This JIRA is about estimating the waiting time of each NM queue.
> Having these estimates is crucial for the distributed scheduling of container 
> requests, as it allows the LocalRM to decide in which NMs to queue the 
> queuable container requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4631) Add specialized Token support for DistributedSchedulingProtocol


 [ 
https://issues.apache.org/jira/browse/YARN-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-4631:
--
Parent Issue: YARN-4742  (was: YARN-2877)

> Add specialized Token support for DistributedSchedulingProtocol
> ---
>
> Key: YARN-4631
> URL: https://issues.apache.org/jira/browse/YARN-4631
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> The {{DistributedSchedulingProtocol}} introduced in YARN-2885 which extends 
> the {{ApplicationMasterProtocol}}. This protocol should support its own Token 
> type, and not just reuse the AMRMToken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4742) [Umbrella] Enhancements to Distributed Scheduling


[ 
https://issues.apache.org/jira/browse/YARN-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170421#comment-15170421
 ] 

Arun Suresh commented on YARN-4742:
---

The subtasks specified here can be worked on independently and need not be 
worked on a feature branch 

> [Umbrella] Enhancements to Distributed Scheduling
> -
>
> Key: YARN-4742
> URL: https://issues.apache.org/jira/browse/YARN-4742
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> This is an Umbrella JIRA to track enhancements / improvements that can be 
> made to the core Distributed Scheduling framework : YARN-2877



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4742) [Umbrella] Enhancements to Distributed Scheduling


 [ 
https://issues.apache.org/jira/browse/YARN-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-4742:
--
Summary: [Umbrella] Enhancements to Distributed Scheduling  (was: 
Enhancements to Distributed Scheduling)

> [Umbrella] Enhancements to Distributed Scheduling
> -
>
> Key: YARN-4742
> URL: https://issues.apache.org/jira/browse/YARN-4742
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> This is an Umbrella JIRA to track enhancements / improvements that can be 
> made to the core Distributed Scheduling framework : YARN-2877
> These can



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4742) [Umbrella] Enhancements to Distributed Scheduling


 [ 
https://issues.apache.org/jira/browse/YARN-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-4742:
--
Description: This is an Umbrella JIRA to track enhancements / improvements 
that can be made to the core Distributed Scheduling framework : YARN-2877  
(was: This is an Umbrella JIRA to track enhancements / improvements that can be 
made to the core Distributed Scheduling framework : YARN-2877

These can)

> [Umbrella] Enhancements to Distributed Scheduling
> -
>
> Key: YARN-4742
> URL: https://issues.apache.org/jira/browse/YARN-4742
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> This is an Umbrella JIRA to track enhancements / improvements that can be 
> made to the core Distributed Scheduling framework : YARN-2877



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4742) Enhancements to Distributed Scheduling


 [ 
https://issues.apache.org/jira/browse/YARN-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-4742:
--
Description: 
This is an Umbrella JIRA to track enhancements / improvements that can be made 
to the core Distributed Scheduling framework : YARN-2877

These can

  was:This is an Umbrella JIRA to track enhancements / improvements that can be 
made to the core Distributed Scheduling framework : YARN-2877


> Enhancements to Distributed Scheduling
> --
>
> Key: YARN-4742
> URL: https://issues.apache.org/jira/browse/YARN-4742
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> This is an Umbrella JIRA to track enhancements / improvements that can be 
> made to the core Distributed Scheduling framework : YARN-2877
> These can



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2995) Enhance UI to show cluster resource utilization of various container types


 [ 
https://issues.apache.org/jira/browse/YARN-2995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-2995:
--
Parent Issue: YARN-4742  (was: YARN-2877)

> Enhance UI to show cluster resource utilization of various container types
> --
>
> Key: YARN-2995
> URL: https://issues.apache.org/jira/browse/YARN-2995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sriram Rao
>
> This JIRA proposes to extend the Resource manager UI to show how cluster 
> resources are being used to run *guaranteed start* and *queueable* 
> containers.  For example, a graph that shows over time, the fraction of  
> running containers that are *guaranteed start* and the fraction of running 
> containers that are *queueable*. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4503) Allow for a pluggable policy to decide if a ResourceRequest is GUARANTEED or not


 [ 
https://issues.apache.org/jira/browse/YARN-4503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-4503:
--
Parent Issue: YARN-4742  (was: YARN-2877)

> Allow for a pluggable policy to decide if a ResourceRequest is GUARANTEED or 
> not
> 
>
> Key: YARN-4503
> URL: https://issues.apache.org/jira/browse/YARN-4503
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> As per discussions on the YARN-2882 thread, specifically [this 
> comment|https://issues.apache.org/jira/browse/YARN-2882?focusedCommentId=15065547=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15065547],
>  we would require a pluggable policy that can decide if a ResourceRequest is 
> GUARANTEED or OPPORTUNISTIC



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2895) Integrate distributed scheduling with capacity scheduler


 [ 
https://issues.apache.org/jira/browse/YARN-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-2895:
--
Parent Issue: YARN-4742  (was: YARN-2877)

> Integrate distributed scheduling with capacity scheduler
> 
>
> Key: YARN-2895
> URL: https://issues.apache.org/jira/browse/YARN-2895
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, resourcemanager, scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> There're some benefit to integrate distributed scheduling mechanism (LocalRM) 
> with capacity scheduler:
> - Resource usage of opportunistic container can be tracked by central RM and 
> capacity could be enforced
> - Opportunity to transfer opportunistic container to conservative container 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4742) Enhancements to Distributed Scheduling

Arun Suresh created YARN-4742:
-

 Summary: Enhancements to Distributed Scheduling
 Key: YARN-4742
 URL: https://issues.apache.org/jira/browse/YARN-4742
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Arun Suresh
Assignee: Arun Suresh


This is an Umbrella JIRA to track enhancements / improvements that can be made 
to the core Distributed Scheduling framework : YARN-2877



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

2016-02-26 Thread Silnov (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170406#comment-15170406
 ] 

Silnov commented on YARN-4728:
--

Varun Saxena,thanks for your response!
I have checked MAPREDUCE-6513. The scenario is similar to that as you said. 
I'll get some knowledge from it:) 

> MapReduce job doesn't make any progress for a very very long time after one 
> Node become unusable.
> -
>
> Key: YARN-4728
> URL: https://issues.apache.org/jira/browse/YARN-4728
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.6.0
> Environment: hadoop 2.6.0
> yarn
>Reporter: Silnov
>Priority: Critical
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of data) 
> every day.
> Sometimes, I found my job remain the same progression for a very very long 
> time. So I have to kill the job mannually and re-submit it to the cluster. It 
> works well before(re-submit the job and it run to the end), but something go 
> wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the 
> progression doesn't change for a long time, and each time has a different 
> progress value.e.g.33.01%,45.8%,73.21%).
> I begin to check the web UI for the hadoop, then I find there are 98 map 
> suspend while all the running reduce task have consumed all the avaliable  
> memory. I stop the yarn and add configuration below  into yarn-site.xml and 
> then restart the yarn.
> yarn.app.mapreduce.am.job.reduce.rampup.limit
> 0.1
> yarn.app.mapreduce.am.job.reduce.preemption.limit
> 1.0
> (wanting the yarn to preempt the reduce task's resource to run suspending map 
> task)
> After restart the yarn,I submit the job with the property 
> mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value for 
> a very very long time)
> I check the web UI for the hadoop again,and find that the suspended map task 
> is newed with the previous note:"TaskAttempt killed because it ran on 
> unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> **Deactivating Node node02:21349 as it is now LOST.
> **node02:21349 Node Transitioned from RUNNING to LOST.
> I think this may happen because my network across the cluster is not good 
> which cause the RM don't receive the NM's heartbeat in time.
> But I wonder that why the yarn framework can't preempt the running reduce 
> task's resource to run the suspend map task?(this cause the job remain the 
> same progress value for a very very long time:( )
> Any one can help?
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

2016-02-26 Thread Silnov (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170403#comment-15170403
 ] 

Silnov commented on YARN-4728:
--

zhihai xu,thanks for your response!
I will try to make some changes following your advice!

> MapReduce job doesn't make any progress for a very very long time after one 
> Node become unusable.
> -
>
> Key: YARN-4728
> URL: https://issues.apache.org/jira/browse/YARN-4728
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, nodemanager, resourcemanager
>Affects Versions: 2.6.0
> Environment: hadoop 2.6.0
> yarn
>Reporter: Silnov
>Priority: Critical
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of data) 
> every day.
> Sometimes, I found my job remain the same progression for a very very long 
> time. So I have to kill the job mannually and re-submit it to the cluster. It 
> works well before(re-submit the job and it run to the end), but something go 
> wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the 
> progression doesn't change for a long time, and each time has a different 
> progress value.e.g.33.01%,45.8%,73.21%).
> I begin to check the web UI for the hadoop, then I find there are 98 map 
> suspend while all the running reduce task have consumed all the avaliable  
> memory. I stop the yarn and add configuration below  into yarn-site.xml and 
> then restart the yarn.
> yarn.app.mapreduce.am.job.reduce.rampup.limit
> 0.1
> yarn.app.mapreduce.am.job.reduce.preemption.limit
> 1.0
> (wanting the yarn to preempt the reduce task's resource to run suspending map 
> task)
> After restart the yarn,I submit the job with the property 
> mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value for 
> a very very long time)
> I check the web UI for the hadoop again,and find that the suspended map task 
> is newed with the previous note:"TaskAttempt killed because it ran on 
> unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> **Deactivating Node node02:21349 as it is now LOST.
> **node02:21349 Node Transitioned from RUNNING to LOST.
> I think this may happen because my network across the cluster is not good 
> which cause the RM don't receive the NM's heartbeat in time.
> But I wonder that why the yarn framework can't preempt the running reduce 
> task's resource to run the suspend map task?(this cause the job remain the 
> same progress value for a very very long time:( )
> Any one can help?
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted


[ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170397#comment-15170397
 ] 

Varun Saxena commented on YARN-4700:


This event can be emitted on recovery for RUNNING apps as well.

> ATS storage has one extra record each time the RM got restarted
> ---
>
> Key: YARN-4700
> URL: https://issues.apache.org/jira/browse/YARN-4700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Li Lu
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> When testing the new web UI for ATS v2, I noticed that we're creating one 
> extra record for each finished application (but still hold in the RM state 
> store) each time the RM got restarted. It's quite possible that we add the 
> cluster start timestamp into the default cluster id, thus each time we're 
> creating a new record for one application (cluster id is a part of the row 
> key). We need to fix this behavior, probably by having a better default 
> cluster id. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted


[ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170387#comment-15170387
 ] 

Varun Saxena commented on YARN-4700:


If we do not fix it in RM, as was the conclusion in those 2 JIRAs' and we use 
current timestamp, we will have to peek into the app table to see if the app 
already exists or not and if yes, do not enter in activity table.
The reason as it seems from the discussion on those JIRAs' is that ATS events 
are asynchronous(because we use a dispatcher in between) so its better to 
replay the events.

Maybe we can give priority to app start and finish events and make it 
synchronous for V2 as the app collector will be running within RM but this 
would block app finishing till ATS event completes.

> ATS storage has one extra record each time the RM got restarted
> ---
>
> Key: YARN-4700
> URL: https://issues.apache.org/jira/browse/YARN-4700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Li Lu
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> When testing the new web UI for ATS v2, I noticed that we're creating one 
> extra record for each finished application (but still hold in the RM state 
> store) each time the RM got restarted. It's quite possible that we add the 
> cluster start timestamp into the default cluster id, thus each time we're 
> creating a new record for one application (cluster id is a part of the row 
> key). We need to fix this behavior, probably by having a better default 
> cluster id. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted


[ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170376#comment-15170376
 ] 

Varun Saxena commented on YARN-4700:


bq.  I think we're using the day timestamp for a reason as this table is 
supposed to be a flow (daily) activity table. And some considerations are given 
to long running apps that will cross the day boundaries.
Let us assume RM does not restart. In that case, we will only get the start 
event and finish event once each. In that case, event timestamp will be close 
to current timestamp.
And if those are the only events we get, issue with long running apps(extending 
over more than 2 days) will anyways be there. For instance if  we get start 
event on day 1 and finish event on day 3 and if there is no other app for this 
flow, this will lead to no activity on day 2 even if we use current timestamp..
There was YARN-4069 which was filed for this issue and its with me. I was 
thinking of scheduling a global timer in RM which can emit  ATS events for all 
the running apps at a certain point of time.This should resolve long running 
app issue. This is not marked for 1st milestone though so no progress made.

> ATS storage has one extra record each time the RM got restarted
> ---
>
> Key: YARN-4700
> URL: https://issues.apache.org/jira/browse/YARN-4700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Li Lu
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> When testing the new web UI for ATS v2, I noticed that we're creating one 
> extra record for each finished application (but still hold in the RM state 
> store) each time the RM got restarted. It's quite possible that we add the 
> cluster start timestamp into the default cluster id, thus each time we're 
> creating a new record for one application (cluster id is a part of the row 
> key). We need to fix this behavior, probably by having a better default 
> cluster id. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4624) NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI

2016-02-26 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170335#comment-15170335
 ] 

Rohith Sharma K S commented on YARN-4624:
-

[~brahmareddy] can you check findbugs warnings?

> NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI
> ---
>
> Key: YARN-4624
> URL: https://issues.apache.org/jira/browse/YARN-4624
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
> Attachments: SchedulerUIWithOutLabelMapping.png, YARN-2674-002.patch, 
> YARN-4624-003.patch, YARN-4624.patch
>
>
> Scenario:
> ===
> Configure nodelables and add to cluster
> Start the cluster
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.PartitionQueueCapacitiesInfo.getMaxAMLimitPercentage(PartitionQueueCapacitiesInfo.java:114)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:105)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:94)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117)
>   at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueueBlock.render(CapacitySchedulerPage.java:293)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117)
>   at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:447)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4741) RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue

2016-02-26 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170307#comment-15170307
 ] 

Rohith Sharma K S commented on YARN-4741:
-

I am  not pretty sure whether it is same YARN-3990. Based on the affect version 
I am suspecting it might be a same issue. On the other hand, looking into event 
type, it may be new issue also.
Anyway [~sjlee0] can  you cross verify the fix of YARN-3990 is present in your 
cluster? 

> RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async 
> dispatcher event queue
> ---
>
> Key: YARN-4741
> URL: https://issues.apache.org/jira/browse/YARN-4741
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Sangjin Lee
>Priority: Critical
>
> We had a pretty major incident with the RM where it was continually flooded 
> with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event 
> queue.
> In our setup, we had the RM HA or stateful restart *disabled*, but NM 
> work-preserving restart *enabled*. Due to other issues, we did a cluster-wide 
> NM restart.
> Some time during the restart (which took multiple hours), we started seeing 
> the async dispatcher event queue building. Normally it would log 1,000. In 
> this case, it climbed all the way up to tens of millions of events.
> When we looked at the RM log, it was full of the following messages:
> {noformat}
> 2016-02-18 01:47:29,530 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid 
> event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> 2016-02-18 01:47:29,535 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
> this event at current state
> 2016-02-18 01:47:29,535 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid 
> event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> 2016-02-18 01:47:29,538 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
> this event at current state
> 2016-02-18 01:47:29,538 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid 
> event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> {noformat}
> And that node in question was restarted a few minutes earlier.
> When we inspected the RM heap, it was full of 
> RMNodeFinishedContainersPulledByAMEvents.
> Suspecting the NM work-preserving restart, we disabled it and did another 
> cluster-wide rolling restart. Initially that seemed to have helped reduce the 
> queue size, but the queue built back up to several millions and continued for 
> an extended period. We had to restart the RM to resolve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2883) Queuing of container requests in the NM


[ 
https://issues.apache.org/jira/browse/YARN-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170234#comment-15170234
 ] 

Hadoop QA commented on YARN-2883:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 16m 19s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 21s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
45s {color} | {color:green} yarn-2877 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 51s 
{color} | {color:green} yarn-2877 passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 9s 
{color} | {color:green} yarn-2877 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
36s {color} | {color:green} yarn-2877 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 17s 
{color} | {color:green} yarn-2877 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
39s {color} | {color:green} yarn-2877 passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 41s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common in 
yarn-2877 has 3 extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 14s 
{color} | {color:green} yarn-2877 passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 5s 
{color} | {color:green} yarn-2877 passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
21s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 7s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 7s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 23s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 23s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 41s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 26 new + 
222 unchanged - 1 fixed = 248 total (was 223) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 28s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
40s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 55 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 12s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 26s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 49s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_72. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 25s 
{color} | {color:green} hadoop-yarn-server-common in the patch passed with JDK 
v1.8.0_72. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 14s {color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK 
v1.8.0_72. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 30s 
{color} | {color:green}

[jira] [Commented] (YARN-4731) container-executor should not follow symlinks in recursive_unlink_children


[ 
https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170178#comment-15170178
 ] 

Hadoop QA commented on YARN-4731:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 11m 2s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
45s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 29s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
11s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
24s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
10s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 12s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_72. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 26s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
17s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 40m 12s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12790248/YARN-4731.002.patch |
| JIRA Issue | YARN-4731 |
| Optional Tests |  asflicense  compile  cc  mvnsite  javac  unit  |
| uname | Linux 5c9152fd5ef1 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / d1d4e16 |
| Default Java | 1.7.0_95 |
| Multi-JDK versions |  /usr/lib/jvm/java-8-oracle:1.8.0_72 
/usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 |
| JDK v1.7.0_95  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/10655/testReport/ |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/10655/console |
| Powered by | Apache Yetus 0.2.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> container-executor should not follow symlinks in recursive_unlink_children
> --
>
> Key: YARN-4731
> URL: https://issues.apache.org/jira/browse/YARN-4731
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects

[jira] [Updated] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers

2016-02-26 Thread Konstantinos Karanasos (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantinos Karanasos updated YARN-1011:
-
Attachment: patch-for-yarn-1011.patch

Hi guys, as I mentioned to [~kasha] in an offline discussion we had also with 
[~asuresh], I recently did some very similar changes to the RM and the 
CapacityScheduler, which could be of use here.

We implemented a system, which we call Yaq (will be presented in EuroSys in a 
couple of months, so I will soon be able to share the paper as well), that 
allows queuing of tasks at the NMs (using some bits from YARN-2883). If you 
consider queuing as a way of over-committing resources, the setting has many 
commonalities with the current JIRA.

The basic idea is that each NM advertises a number of queue slots (i.e., 
containers that are allowed to be queued at the NM), in order to have some 
tasks ready to be started immediately once resources become available. This way 
we can mask the allocation latency when waiting for the RM to assign and send 
new tasks to the NM.
Then the central scheduler (the CapacityScheduler in our case), performs 
placement in the following way:
# When there are still available resources (I hard-coded it to be <95% 
utilization, Karthik mentioned 80% above), the scheduling is heartbeat-driven, 
like it is today.
# Above 95% utilization, there is an additional asynchronous thread that orders 
the nodes (in the current implementation based on the expected wait time of 
each node) and then starts filling up the queue slots. The reason for having 
this thread is that (1) we don't want a container to be given a queue slot, if 
there is a run slot available, and (2) we want to have the global view of all 
nodes so that we favor nodes that have queues that are less loaded.

As you can see, the scheduling logic is very similar to what [~kasha] described 
above, so I think we can reuse those bits.

What is also similar to the over-commitment that is described in this JIRA, is 
that we extended the {{SchedulerNode}} so that it can account for two types of 
resources, namely run slots (which were already there) and queue slots.
Although in the current implementation we do not over-commit resources (queued 
tasks start only after allocated resources become available), this just has to 
do with how the NM decides to treat the additional tasks it receives (and can 
be easily changed to actually over-commit resources). As far as the 
RM/scheduler is concerned, the logic is the same: you have the allocated 
resources (run slots) and then you have some additional resources that are 
advertised by the nodes and can be used either for queuing containers (in Yaq's 
case) or for starting additional tasks based on the actual resource utilization 
(in the present JIRA's case). So, for the RM, the logic should be the same.

I am attaching an initial patch that contains the classes that are related to 
the RM. It is against branch-2.7.1. Since I left out many class that were 
specific to YARN, the patch will not compile, but you should be able to see all 
the changes I did to the {{SchedulerNode}}, the {{CapacityScheduler}}, the 
{{*Queue}} classes, etc. I would be happy to share more code if needed.

> [Umbrella] Schedule containers based on utilization of currently allocated 
> containers
> -
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun C Murthy
> Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, 
> yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are 
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated 
> containers and, if appropriate, allocate more (speculative?) containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4720) Skip unnecessary NN operations in log aggregation

2016-02-26 Thread Jun Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170152#comment-15170152
 ] 

Jun Gong commented on YARN-4720:


Thanks [~mingma] for the review, suggestion and commit!

> Skip unnecessary NN operations in log aggregation
> -
>
> Key: YARN-4720
> URL: https://issues.apache.org/jira/browse/YARN-4720
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Ming Ma
>Assignee: Jun Gong
> Fix For: 2.8.0, 2.9.0
>
> Attachments: YARN-4720.01.patch, YARN-4720.02.patch, 
> YARN-4720.03.patch, YARN-4720.04.patch, YARN-4720.05.patch
>
>
> Log aggregation service could have unnecessary NN operations in the following 
> scenarios:
> * No new local log has been created since the last upload for the long 
> running service scenario.
> * NM uses {{ContainerLogAggregationPolicy}} that skips log aggregation for 
> certain containers.
> In the following code snippet, even though {{pendingContainerInThisCycle}} is 
> empty, it still creates the writer and then removes the file later. Thus it 
> introduces unnecessary create/getfileinfo/delete NN calls when NM doesn't 
> aggregate logs for an app.
>   
> {noformat}
> AppLogAggregatorImpl.java
> ..
> writer =
> new LogWriter(this.conf, this.remoteNodeTmpLogFileForApp,
> this.userUgi);
> ..
>   for (ContainerId container : pendingContainerInThisCycle) {
> ..
>   }
> ..
> if (remoteFS.exists(remoteNodeTmpLogFileForApp)) {
>   if (rename) {
> remoteFS.rename(remoteNodeTmpLogFileForApp, renamedPath);
>   } else {
> remoteFS.delete(remoteNodeTmpLogFileForApp, false);
>   }
> }
> ..
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4741) RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue


[ 
https://issues.apache.org/jira/browse/YARN-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170145#comment-15170145
 ] 

Sangjin Lee commented on YARN-4741:
---

I do see the node in question trying to get in sync with the RM with the 
applications it thinks it still owns. The trigger might be related to that. 
Still, it's not clear why the queue was still flooded with those events even 
after the *second* restart that disabled the NM work-preserving restart.

> RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async 
> dispatcher event queue
> ---
>
> Key: YARN-4741
> URL: https://issues.apache.org/jira/browse/YARN-4741
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Sangjin Lee
>Priority: Critical
>
> We had a pretty major incident with the RM where it was continually flooded 
> with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event 
> queue.
> In our setup, we had the RM HA or stateful restart *disabled*, but NM 
> work-preserving restart *enabled*. Due to other issues, we did a cluster-wide 
> NM restart.
> Some time during the restart (which took multiple hours), we started seeing 
> the async dispatcher event queue building. Normally it would log 1,000. In 
> this case, it climbed all the way up to tens of millions of events.
> When we looked at the RM log, it was full of the following messages:
> {noformat}
> 2016-02-18 01:47:29,530 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid 
> event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> 2016-02-18 01:47:29,535 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
> this event at current state
> 2016-02-18 01:47:29,535 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid 
> event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> 2016-02-18 01:47:29,538 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
> this event at current state
> 2016-02-18 01:47:29,538 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid 
> event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> {noformat}
> And that node in question was restarted a few minutes earlier.
> When we inspected the RM heap, it was full of 
> RMNodeFinishedContainersPulledByAMEvents.
> Suspecting the NM work-preserving restart, we disabled it and did another 
> cluster-wide rolling restart. Initially that seemed to have helped reduce the 
> queue size, but the queue built back up to several millions and continued for 
> an extended period. We had to restart the RM to resolve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4062) Add the flush and compaction functionality via coprocessors and scanners for flow run table


[ 
https://issues.apache.org/jira/browse/YARN-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170144#comment-15170144
 ] 

Hadoop QA commented on YARN-4062:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 13m 9s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 2m 7s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
42s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 59s 
{color} | {color:green} YARN-2928 passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 21s 
{color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
40s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 56s 
{color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
32s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
59s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 55s 
{color} | {color:green} YARN-2928 passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 8s 
{color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
44s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 57s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 57s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 16s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 16s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 36s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 3 new + 
210 unchanged - 1 fixed = 213 total (was 211) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 50s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
23s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
1s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 48s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 3s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 22s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_72. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 14s 
{color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed 
with JDK v1.8.0_72. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_95. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 11s 
{color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed 
with JDK v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
19s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
|

[jira] [Commented] (YARN-4117) End to end unit test with mini YARN cluster for AMRMProxy Service


[ 
https://issues.apache.org/jira/browse/YARN-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170135#comment-15170135
 ] 

Hadoop QA commented on YARN-4117:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 19s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
35s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 45s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 8s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
32s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 14s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
40s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
50s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 54s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 13s 
{color} | {color:red} hadoop-yarn-server-tests in the patch failed. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red} 1m 28s 
{color} | {color:red} hadoop-yarn in the patch failed with JDK v1.8.0_72. 
{color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 1m 28s {color} 
| {color:red} hadoop-yarn in the patch failed with JDK v1.8.0_72. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red} 1m 35s 
{color} | {color:red} hadoop-yarn in the patch failed with JDK v1.7.0_95. 
{color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 1m 35s {color} 
| {color:red} hadoop-yarn in the patch failed with JDK v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
31s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 15s 
{color} | {color:red} hadoop-yarn-server-tests in the patch failed. {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
34s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 12s 
{color} | {color:red} hadoop-yarn-server-tests in the patch failed. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 41s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 12s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_72. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 13s {color} 
| {color:red} hadoop-yarn-server-tests in the patch failed with JDK v1.8.0_72. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 66m 4s {color} 
| {color:red} hadoop-yarn-client in the patch failed with JDK v1.8.0_72. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 39s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.7.0_95. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 14s {color} 
| {color:red} hadoop-yarn-server-tests in the patch failed with JDK v1.7.0_95.

[jira] [Updated] (YARN-4741) RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue

2016-02-26 Thread Karthik Kambatla (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-4741:
---
Priority: Critical  (was: Major)

> RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async 
> dispatcher event queue
> ---
>
> Key: YARN-4741
> URL: https://issues.apache.org/jira/browse/YARN-4741
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Sangjin Lee
>Priority: Critical
>
> We had a pretty major incident with the RM where it was continually flooded 
> with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event 
> queue.
> In our setup, we had the RM HA or stateful restart *disabled*, but NM 
> work-preserving restart *enabled*. Due to other issues, we did a cluster-wide 
> NM restart.
> Some time during the restart (which took multiple hours), we started seeing 
> the async dispatcher event queue building. Normally it would log 1,000. In 
> this case, it climbed all the way up to tens of millions of events.
> When we looked at the RM log, it was full of the following messages:
> {noformat}
> 2016-02-18 01:47:29,530 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid 
> event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> 2016-02-18 01:47:29,535 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
> this event at current state
> 2016-02-18 01:47:29,535 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid 
> event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> 2016-02-18 01:47:29,538 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
> this event at current state
> 2016-02-18 01:47:29,538 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid 
> event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> {noformat}
> And that node in question was restarted a few minutes earlier.
> When we inspected the RM heap, it was full of 
> RMNodeFinishedContainersPulledByAMEvents.
> Suspecting the NM work-preserving restart, we disabled it and did another 
> cluster-wide rolling restart. Initially that seemed to have helped reduce the 
> queue size, but the queue built back up to several millions and continued for 
> an extended period. We had to restart the RM to resolve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4741) RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue

Sangjin Lee created YARN-4741:
-

 Summary: RM is flooded with 
RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue
 Key: YARN-4741
 URL: https://issues.apache.org/jira/browse/YARN-4741
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Sangjin Lee


We had a pretty major incident with the RM where it was continually flooded 
with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event 
queue.

In our setup, we had the RM HA or stateful restart *disabled*, but NM 
work-preserving restart *enabled*. Due to other issues, we did a cluster-wide 
NM restart.

Some time during the restart (which took multiple hours), we started seeing the 
async dispatcher event queue building. Normally it would log 1,000. In this 
case, it climbed all the way up to tens of millions of events.

When we looked at the RM log, it was full of the following messages:
{noformat}
2016-02-18 01:47:29,530 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event 
FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
2016-02-18 01:47:29,535 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
this event at current state
2016-02-18 01:47:29,535 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event 
FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
2016-02-18 01:47:29,538 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
this event at current state
2016-02-18 01:47:29,538 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event 
FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
{noformat}

And that node in question was restarted a few minutes earlier.

When we inspected the RM heap, it was full of 
RMNodeFinishedContainersPulledByAMEvents.

Suspecting the NM work-preserving restart, we disabled it and did another 
cluster-wide rolling restart. Initially that seemed to have helped reduce the 
queue size, but the queue built back up to several millions and continued for 
an extended period. We had to restart the RM to resolve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted

2016-02-26 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170102#comment-15170102
 ] 

Li Lu commented on YARN-4700:
-

bq. In the code that writes to the flow activity table, can we check the 
application status and make a decision not to write them?
I'm not very familiar with the replayed event sequence, but will we receive an 
application finished event for each of the finished applications? If so, it 
will make distinguishing the real events from the replayed events very 
difficult? [~Naganarasimha] what's your experience working with the SMP? 
Thanks! 

> ATS storage has one extra record each time the RM got restarted
> ---
>
> Key: YARN-4700
> URL: https://issues.apache.org/jira/browse/YARN-4700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Li Lu
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> When testing the new web UI for ATS v2, I noticed that we're creating one 
> extra record for each finished application (but still hold in the RM state 
> store) each time the RM got restarted. It's quite possible that we add the 
> cluster start timestamp into the default cluster id, thus each time we're 
> creating a new record for one application (cluster id is a part of the row 
> key). We need to fix this behavior, probably by having a better default 
> cluster id. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2883) Queuing of container requests in the NM


[ 
https://issues.apache.org/jira/browse/YARN-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170103#comment-15170103
 ] 

Arun Suresh commented on YARN-2883:
---

Thanks for the patch [~kkaranasos]. Couple of comments from my initial fly-by 
of the patch.

* Don’t think we should add the additional *newContainerStatus* method in 
ContainerStatus, since the standard pattern I see is to use the setter.
* In {{Context}}, briefly give a comment about what the Key & Value of the maps 
returned by *getQueuedContainers* and *getKilledQueuedContainers*
* It may be better to create another class like 
{{DistributedSchedulingContext}} tha exposes both the above methods and then 
{{Context}} can just expose {{getDistributedSchedulingContext}} which can 
return null if DistSched is not enabled. 
* See above comment on changes to NMContext (create a DistSchedContext and add 
both the new maps in there)
* In {{ContainerManagerImpl}}, we  should register the 
{{ContainerExecutionEventType}} with the dispatcher in its contructor where 
most of the other dispatchers are registered.. And I guess it should be there 
even if AMRMProxy is enabled or not (in which case no events will be generated)

*ContainersMonitorImpl*
* Can we replace Logical with Allocated (I am guessing that is what it means) ?
* Again, lets move all the state related to DistributedScheduling 
(logicalGuarContainers, logicalOpportContainers, logicalContainersUtilization, 
queuedGuarRequests, queuedOpportRequests and opportContainersToKill) to a 
separate {{DistributedSchedulingState}} object.. Will be easier to reason about 
how the existing code has changed.
* Maybe we should mover all the right shifting ( >> 20) into a utility function 
inside ProcessTreeInfo ?

*General Comments*
* Couple of classes have unused imports.. 
* We need some Unit tests

Will provide more comments shortly..




> Queuing of container requests in the NM
> ---
>
> Key: YARN-2883
> URL: https://issues.apache.org/jira/browse/YARN-2883
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Konstantinos Karanasos
>Assignee: Konstantinos Karanasos
> Attachments: YARN-2883-yarn-2877.001.patch
>
>
> We propose to add a queue in each NM, where queueable container requests can 
> be held.
> Based on the available resources in the node and the containers in the queue, 
> the NM will decide when to allow the execution of a queued container.
> In order to ensure the instantaneous start of a guaranteed-start container, 
> the NM may decide to pre-empt/kill running queueable containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted


[ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170065#comment-15170065
 ] 

Sangjin Lee commented on YARN-4700:
---

Wait, I think we're using the day timestamp for a reason as this table is 
supposed to be a flow (daily) activity table. And some considerations are given 
to long running apps that will cross the day boundaries. I'd like us to stick 
with that unless there is a compelling reason not to?

In the code that writes to the flow activity table, can we check the 
application status and make a decision not to write them?

cc [~jrottinghuis]

> ATS storage has one extra record each time the RM got restarted
> ---
>
> Key: YARN-4700
> URL: https://issues.apache.org/jira/browse/YARN-4700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Li Lu
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> When testing the new web UI for ATS v2, I noticed that we're creating one 
> extra record for each finished application (but still hold in the RM state 
> store) each time the RM got restarted. It's quite possible that we add the 
> cluster start timestamp into the default cluster id, thus each time we're 
> creating a new record for one application (cluster id is a part of the row 
> key). We need to fix this behavior, probably by having a better default 
> cluster id. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2965) Enhance Node Managers to monitor and report the resource usage on machines

2016-02-26 Thread Inigo Goiri (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170064#comment-15170064
 ] 

Inigo Goiri commented on YARN-2965:
---

YARN-3534 already collects the resource utilization in the NM and YARN-4055 
sends them to the RM.
I would like to use this JIRA to collect disk and network utilization from 
HADOOP-12210 and HADOOP-12211, add them to {{ResourceUtilization}} as MB/s, and 
let YARN-4055 take care of sending all the utilization to the RM.

Not sure what is the status of the different efforts towards this direction at 
this point. [~vinodkv], [~kasha] thoughts?

> Enhance Node Managers to monitor and report the resource usage on machines
> --
>
> Key: YARN-2965
> URL: https://issues.apache.org/jira/browse/YARN-2965
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Robert Grandl
>Assignee: Robert Grandl
> Attachments: ddoc_RT.docx
>
>
> This JIRA is about augmenting Node Managers to monitor the resource usage on 
> the machine, aggregates these reports and exposes them to the RM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4711) NM is going down with NPE's due to single thread processing of events by Timeline client


[ 
https://issues.apache.org/jira/browse/YARN-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170052#comment-15170052
 ] 

Sangjin Lee commented on YARN-4711:
---

Thanks!

> NM is going down with NPE's due to single thread processing of events by 
> Timeline client
> 
>
> Key: YARN-4711
> URL: https://issues.apache.org/jira/browse/YARN-4711
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
>
> After YARN-3367, while testing the latest 2928 branch came across few NPEs 
> due to which NM is shutting down.
> {code}
> 2016-02-21 23:19:54,078 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:306)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:296)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:213)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerFinishedEvent(NMTimelinePublisher.java:192)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.access$400(NMTimelinePublisher.java:63)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:289)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:280)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> On analysis found that the there was delay in processing of events, as after 
> YARN-3367 all the events were getting processed by a single thread inside the 
> timeline client. 
> Additionally found one scenario where there is possibility of NPE:
> * TimelineEntity.toString() when {{real}} is not null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4731) container-executor should not follow symlinks in recursive_unlink_children


 [ 
https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated YARN-4731:
---
Attachment: YARN-4731.002.patch

> container-executor should not follow symlinks in recursive_unlink_children
> --
>
> Key: YARN-4731
> URL: https://issues.apache.org/jira/browse/YARN-4731
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: YARN-4731.001.patch, YARN-4731.002.patch
>
>
> Enable LCE and CGroups
> Submit a mapreduce job
> {noformat}
> 2016-02-24 18:56:46,889 INFO 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting 
> absolute path : 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
> 2016-02-24 18:56:46,894 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 255. Privileged Execution Operation 
> Output:
> main : command provided 3
> main : run as user is dsperf
> main : requested yarn user is dsperf
> failed to rmdir job.jar: Not a directory
> Error while deleting 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01:
>  20 (Not a directory)
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  dsperf, dsperf, 3, 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01]
> 2016-02-24 18:56:46,894 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: 
> DeleteAsUser for 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
>  returned with exit code: 255
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=255:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
> at org.apache.hadoop.util.Shell.run(Shell.java:838)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
> ... 10 more
> {noformat}
> As a result nodemanager-local directory are not getting deleted for each 
> application
> {noformat}
> total 36
> drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./
> drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../
> -rw--- 1 hdfs hadoop  340 Feb 25 08:25 container_tokens
> lrwxrwxrwx 1 hdfs hadoop  111 Feb 25 08:25 job.jar -> 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/11/job.jar/
> lrwxrwxrwx 1 hdfs hadoop  111 Feb 25 08:25 job.xml -> 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/13/job.xml*
> drwxr-s--- 2 hdfs hadoop 4096 Feb 25 08:25 jobSubmitDir/
> -rwx-- 1 hdfs hadoop 5348 Feb 25 08:25 launch_container.sh*
> drwxr-s--- 2 hdfs hadoop 4096 Feb 25 08:25 tmp/
> {noformat}



--
This message was sent by Atlassian JIRA

[jira] [Comment Edited] (YARN-4731) container-executor should not follow symlinks in recursive_unlink_children


[ 
https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169619#comment-15169619
 ] 

Colin Patrick McCabe edited comment on YARN-4731 at 2/26/16 10:49 PM:
--

Thanks for finding this bug.  Unfortunately, I think the patch has some 
issues... it introduces a race condition where the path could change during our 
traversal.

The issue with patch v1 is a TOCTOU (time of check / time of use) race 
condition.  Here is one example:
1. {{container-executor}} checks /foo to make sure that it's not a symlink; it 
isn't
2. An attacker moves /foo out of the way and re-creates /foo as a symlink to 
/etc
3. {{container-executor}} deletes /foo (which is really actually /etc at this 
point)

The v2 version I posted avoids this race condition by using O_NOFOLLOW to open 
the files in step 3.

Also, one note:  we should also be using the {{dirfd}} and {{name}}, not 
{{fullpath}}.  "fullpath" is purely provided for debugging and log messages.  
The directory could be renamed while we're traversing it; we don't want the 
removal to fail in this case.


was (Author: cmccabe):
Thanks for finding this bug.  Unfortunately, I think the patch has some 
issues... it introduces a race condition where the path could change during our 
traversal.  Let me see if I can find a way to do this through fstat or a 
related call.  Please hold on for now...

> container-executor should not follow symlinks in recursive_unlink_children
> --
>
> Key: YARN-4731
> URL: https://issues.apache.org/jira/browse/YARN-4731
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: YARN-4731.001.patch
>
>
> Enable LCE and CGroups
> Submit a mapreduce job
> {noformat}
> 2016-02-24 18:56:46,889 INFO 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting 
> absolute path : 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
> 2016-02-24 18:56:46,894 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 255. Privileged Execution Operation 
> Output:
> main : command provided 3
> main : run as user is dsperf
> main : requested yarn user is dsperf
> failed to rmdir job.jar: Not a directory
> Error while deleting 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01:
>  20 (Not a directory)
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  dsperf, dsperf, 3, 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01]
> 2016-02-24 18:56:46,894 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: 
> DeleteAsUser for 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
>  returned with exit code: 255
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=255:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
>

[jira] [Updated] (YARN-4731) container-executor should not follow symlinks in recursive_unlink_children


 [ 
https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated YARN-4731:
---
Attachment: (was: YARN-4731.001.patch)

> container-executor should not follow symlinks in recursive_unlink_children
> --
>
> Key: YARN-4731
> URL: https://issues.apache.org/jira/browse/YARN-4731
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: YARN-4731.001.patch
>
>
> Enable LCE and CGroups
> Submit a mapreduce job
> {noformat}
> 2016-02-24 18:56:46,889 INFO 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting 
> absolute path : 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
> 2016-02-24 18:56:46,894 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 255. Privileged Execution Operation 
> Output:
> main : command provided 3
> main : run as user is dsperf
> main : requested yarn user is dsperf
> failed to rmdir job.jar: Not a directory
> Error while deleting 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01:
>  20 (Not a directory)
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  dsperf, dsperf, 3, 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01]
> 2016-02-24 18:56:46,894 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: 
> DeleteAsUser for 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
>  returned with exit code: 255
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=255:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
> at org.apache.hadoop.util.Shell.run(Shell.java:838)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
> ... 10 more
> {noformat}
> As a result nodemanager-local directory are not getting deleted for each 
> application
> {noformat}
> total 36
> drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./
> drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../
> -rw--- 1 hdfs hadoop  340 Feb 25 08:25 container_tokens
> lrwxrwxrwx 1 hdfs hadoop  111 Feb 25 08:25 job.jar -> 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/11/job.jar/
> lrwxrwxrwx 1 hdfs hadoop  111 Feb 25 08:25 job.xml -> 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/13/job.xml*
> drwxr-s--- 2 hdfs hadoop 4096 Feb 25 08:25 jobSubmitDir/
> -rwx-- 1 hdfs hadoop 5348 Feb 25 08:25 launch_container.sh*
> drwxr-s--- 2 hdfs hadoop 4096 Feb 25 08:25 tmp/
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4731) container-executor should not follow symlinks in recursive_unlink_children


 [ 
https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated YARN-4731:
---
Attachment: YARN-4731.001.patch

Here is a patch which changes {{recursive_unlink_children}} to skip removing 
symlinks.  It doesn't open up a TOCTOU security issue, since it opens files 
with {{O_NOFOLLOW}} after doing the symlink check.  I added a unit test case 
for {{recursive_unlink_children}}.

> container-executor should not follow symlinks in recursive_unlink_children
> --
>
> Key: YARN-4731
> URL: https://issues.apache.org/jira/browse/YARN-4731
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: YARN-4731.001.patch, YARN-4731.001.patch
>
>
> Enable LCE and CGroups
> Submit a mapreduce job
> {noformat}
> 2016-02-24 18:56:46,889 INFO 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting 
> absolute path : 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
> 2016-02-24 18:56:46,894 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 255. Privileged Execution Operation 
> Output:
> main : command provided 3
> main : run as user is dsperf
> main : requested yarn user is dsperf
> failed to rmdir job.jar: Not a directory
> Error while deleting 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01:
>  20 (Not a directory)
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  dsperf, dsperf, 3, 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01]
> 2016-02-24 18:56:46,894 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: 
> DeleteAsUser for 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
>  returned with exit code: 255
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=255:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
> at org.apache.hadoop.util.Shell.run(Shell.java:838)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
> ... 10 more
> {noformat}
> As a result nodemanager-local directory are not getting deleted for each 
> application
> {noformat}
> total 36
> drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./
> drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../
> -rw--- 1 hdfs hadoop  340 Feb 25 08:25 container_tokens
> lrwxrwxrwx 1 hdfs hadoop  111 Feb 25 08:25 job.jar -> 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/11/job.jar/
> lrwxrwxrwx 1 hdfs hadoop  111 Feb 25 08:25 job.xml -> 
>

[jira] [Updated] (YARN-4731) container-executor should not follow symlinks in recursive_unlink_children


 [ 
https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated YARN-4731:
---
Summary: container-executor should not follow symlinks in 
recursive_unlink_children  (was: Linux container executor fails to delete 
nmlocal folders)

> container-executor should not follow symlinks in recursive_unlink_children
> --
>
> Key: YARN-4731
> URL: https://issues.apache.org/jira/browse/YARN-4731
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: YARN-4731.001.patch
>
>
> Enable LCE and CGroups
> Submit a mapreduce job
> {noformat}
> 2016-02-24 18:56:46,889 INFO 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting 
> absolute path : 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
> 2016-02-24 18:56:46,894 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 255. Privileged Execution Operation 
> Output:
> main : command provided 3
> main : run as user is dsperf
> main : requested yarn user is dsperf
> failed to rmdir job.jar: Not a directory
> Error while deleting 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01:
>  20 (Not a directory)
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  dsperf, dsperf, 3, 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01]
> 2016-02-24 18:56:46,894 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: 
> DeleteAsUser for 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
>  returned with exit code: 255
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=255:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
> at org.apache.hadoop.util.Shell.run(Shell.java:838)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
> ... 10 more
> {noformat}
> As a result nodemanager-local directory are not getting deleted for each 
> application
> {noformat}
> total 36
> drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./
> drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../
> -rw--- 1 hdfs hadoop  340 Feb 25 08:25 container_tokens
> lrwxrwxrwx 1 hdfs hadoop  111 Feb 25 08:25 job.jar -> 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/11/job.jar/
> lrwxrwxrwx 1 hdfs hadoop  111 Feb 25 08:25 job.xml -> 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/13/job.xml*
> drwxr-s--- 2 hdfs hadoop 4096 Feb 25 08:25 jobSubmitDir/
> -rwx-- 1 hdfs hadoop 5348 Feb 25 08:25 launch_container.sh*
> drwxr-s--- 2

[jira] [Updated] (YARN-4062) Add the flush and compaction functionality via coprocessors and scanners for flow run table


 [ 
https://issues.apache.org/jira/browse/YARN-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vrushali C updated YARN-4062:
-
Attachment: YARN-4062-YARN-2928.05.patch


Uploading YARN-4062-YARN-2928.05.patch after fixing the checkstyle warnings

> Add the flush and compaction functionality via coprocessors and scanners for 
> flow run table
> ---
>
> Key: YARN-4062
> URL: https://issues.apache.org/jira/browse/YARN-4062
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4062-YARN-2928.04.patch, 
> YARN-4062-YARN-2928.05.patch, YARN-4062-YARN-2928.1.patch, 
> YARN-4062-feature-YARN-2928.01.patch, YARN-4062-feature-YARN-2928.02.patch, 
> YARN-4062-feature-YARN-2928.03.patch
>
>
> As part of YARN-3901, coprocessor and scanner is being added for storing into 
> the flow_run table. It also needs a flush & compaction processing in the 
> coprocessor and perhaps a new scanner to deal with the data during flushing 
> and compaction stages. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted


[ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169939#comment-15169939
 ] 

Vrushali C commented on YARN-4700:
--

This was a good catch, thanks [~gtCarrera], [~varun_saxena] and 
[~Naganarasimha]! Let me know if I can be of any help. 

> ATS storage has one extra record each time the RM got restarted
> ---
>
> Key: YARN-4700
> URL: https://issues.apache.org/jira/browse/YARN-4700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Li Lu
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> When testing the new web UI for ATS v2, I noticed that we're creating one 
> extra record for each finished application (but still hold in the RM state 
> store) each time the RM got restarted. It's quite possible that we add the 
> cluster start timestamp into the default cluster id, thus each time we're 
> creating a new record for one application (cluster id is a part of the row 
> key). We need to fix this behavior, probably by having a better default 
> cluster id. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3863) Support complex filters in TimelineReader


[ 
https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169930#comment-15169930
 ] 

Varun Saxena commented on YARN-3863:


Coming to review comments,
bq. I know this is happening deep inside the method, but it seems like a bit of 
an anti-pattern that we have to reference whether something is an application 
v. entity. 
I think we can simply move all these methods to ApplicationEntityReader as well 
with only application specific code.

bq. This is more of a question. Is a list of multiple equality filters the same 
as the multi-val equality filter? If not, how are they different?
Yes, they are same. Just that, we will have to peek for the same key again and 
again, while matching.

bq. (TimelineMultiValueEqualityFilter.java)
The name is bit confusing (see above)
Maybe we can change TimelineEqualityFilter to TimelineKeyValueFilter and this 
one to TimelineKeyValuesFilter ?

> Support complex filters in TimelineReader
> -
>
> Key: YARN-3863
> URL: https://issues.apache.org/jira/browse/YARN-3863
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3863-YARN-2928.v2.01.patch, 
> YARN-3863-YARN-2928.v2.02.patch, YARN-3863-feature-YARN-2928.wip.003.patch, 
> YARN-3863-feature-YARN-2928.wip.01.patch, 
> YARN-3863-feature-YARN-2928.wip.02.patch, 
> YARN-3863-feature-YARN-2928.wip.04.patch, 
> YARN-3863-feature-YARN-2928.wip.05.patch
>
>
> Currently filters in timeline reader will return an entity only if all the 
> filter conditions hold true i.e. only AND operation is supported. We can 
> support OR operation for the filters as well. Additionally as primary backend 
> implementation is HBase, we can design our filters in a manner, where they 
> closely resemble HBase Filters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted


[ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169923#comment-15169923
 ] 

Vrushali C commented on YARN-4700:
--

I see. So when events are being replayed, we making new entries in the flow 
activity table since we are using the current time.

Yes, I think we should use the event timestamp. Should be a simple enough fix, 
take the event timestamp for the CREATED or the FINISHED event and use that 
instead of null in the HBaseTimelineWriterImpl#storeInFlowActivityTable 
function. 

> ATS storage has one extra record each time the RM got restarted
> ---
>
> Key: YARN-4700
> URL: https://issues.apache.org/jira/browse/YARN-4700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Li Lu
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> When testing the new web UI for ATS v2, I noticed that we're creating one 
> extra record for each finished application (but still hold in the RM state 
> store) each time the RM got restarted. It's quite possible that we add the 
> cluster start timestamp into the default cluster id, thus each time we're 
> creating a new record for one application (cluster id is a part of the row 
> key). We need to fix this behavior, probably by having a better default 
> cluster id. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (YARN-4731) Linux container executor fails to delete nmlocal folders


[ 
https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169619#comment-15169619
 ] 

Colin Patrick McCabe edited comment on YARN-4731 at 2/26/16 9:40 PM:
-

Thanks for finding this bug.  Unfortunately, I think the patch has some 
issues... it introduces a race condition where the path could change during our 
traversal.  Let me see if I can find a way to do this through fstat or a 
related call.  Please hold on for now...


was (Author: cmccabe):
Thanks for finding this bug.  Unfortunately, I think the patch has some 
issues... it introduces a race condition where the path could change during our 
traversal.  Let me see if I can find a way to do this through fstat or a 
related call.

> Linux container executor fails to delete nmlocal folders
> 
>
> Key: YARN-4731
> URL: https://issues.apache.org/jira/browse/YARN-4731
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: YARN-4731.001.patch
>
>
> Enable LCE and CGroups
> Submit a mapreduce job
> {noformat}
> 2016-02-24 18:56:46,889 INFO 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting 
> absolute path : 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
> 2016-02-24 18:56:46,894 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 255. Privileged Execution Operation 
> Output:
> main : command provided 3
> main : run as user is dsperf
> main : requested yarn user is dsperf
> failed to rmdir job.jar: Not a directory
> Error while deleting 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01:
>  20 (Not a directory)
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  dsperf, dsperf, 3, 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01]
> 2016-02-24 18:56:46,894 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: 
> DeleteAsUser for 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
>  returned with exit code: 255
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=255:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
> at org.apache.hadoop.util.Shell.run(Shell.java:838)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
> ... 10 more
> {noformat}
> As a result nodemanager-local directory are not getting deleted for each 
> application
> {noformat}
> total 36
> drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./
> drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../
> -rw--- 1 hdfs hadoop  340 Feb 25 08:25 container_tokens
> lrwxrwxrwx 1 hdfs hadoop  111 Feb 25 08:25

[jira] [Commented] (YARN-4062) Add the flush and compaction functionality via coprocessors and scanners for flow run table


[ 
https://issues.apache.org/jira/browse/YARN-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169876#comment-15169876
 ] 

Hadoop QA commented on YARN-4062:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 21s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
2s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 57s 
{color} | {color:green} YARN-2928 passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 19s 
{color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
37s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 56s 
{color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
28s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
53s {color} | {color:green} YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 50s 
{color} | {color:green} YARN-2928 passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 8s 
{color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
43s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 54s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 54s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 17s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 17s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 36s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 22 new + 
210 unchanged - 1 fixed = 232 total (was 211) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 50s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
25s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
14s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 1m 47s 
{color} | {color:red} 
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-timelineservice-jdk1.8.0_72
 with JDK v1.8.0_72 generated 13 new + 0 unchanged - 0 fixed = 13 total (was 0) 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 3s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 20s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_72. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 13s 
{color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed 
with JDK v1.8.0_72. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_95. 
{color} |
| {color:green}+1{color} |

[jira] [Commented] (YARN-3863) Support complex filters in TimelineReader


[ 
https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169845#comment-15169845
 ] 

Varun Saxena commented on YARN-3863:


I missed *TimelineStorageUtils.java*. Changes here are to apply filters locally 
for FS Implementation.
matchRelationFilters and matchEventFilters will be used by HBase implementation 
as well.

> Support complex filters in TimelineReader
> -
>
> Key: YARN-3863
> URL: https://issues.apache.org/jira/browse/YARN-3863
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3863-YARN-2928.v2.01.patch, 
> YARN-3863-YARN-2928.v2.02.patch, YARN-3863-feature-YARN-2928.wip.003.patch, 
> YARN-3863-feature-YARN-2928.wip.01.patch, 
> YARN-3863-feature-YARN-2928.wip.02.patch, 
> YARN-3863-feature-YARN-2928.wip.04.patch, 
> YARN-3863-feature-YARN-2928.wip.05.patch
>
>
> Currently filters in timeline reader will return an entity only if all the 
> filter conditions hold true i.e. only AND operation is supported. We can 
> support OR operation for the filters as well. Additionally as primary backend 
> implementation is HBase, we can design our filters in a manner, where they 
> closely resemble HBase Filters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3863) Support complex filters in TimelineReader

[
https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169833#comment-15169833
]

Varun Saxena commented on YARN-3863:

As the patch is quite large, to aid in review, I will jot down what has been
done in this JIRA.

# The intention is to convert filters which were represented as maps or sets to
TimelineFilterList which would help in complex filters being supported.
i.e. Let me take example of config filters in the format {{cfg1=value1,
cfg2=value2, cfg3=value3, cfg4=value4}} which means all the key value pairs
should be matched for an entity. With work in this JIRA,we can support complex
filters such as {{(cfg1=value1 OR cfg2=value2) AND (cfg3 \!=value3 AND
cfg4\!=value4)}}.
# Similarly current metric filters just check if a certain metric exists for an
entity or not, and does not compare against metric values, for instance,
{{metric1 >= 40}}. This will be supported now.

Now coming to code,
* *TimelineEntityFilters.java*
Filter representation has been changed. Now all filters will be represented as
a TImelineFilterList to support complex filters with ANDs' and ORs'. What kind
of filters each filter list will hold, well more on that in the next point.

* *TimelinexxxFilter.java*
I have added 4 new filter classes here. All these filter classes can be put
inside a TimelineFilterList to construct complex filters using ANDs' and ORs'.
All these filters will then be converted to HBase Filters in HBase
implementation.
*TimelineEqualityFilter* is meant to match key value pairs. This will be used
to represent config and info filters. Key and value can either be equal to or
not equal to.
*TimelineMultiValEqualityFilter* is to match key and a list/set of values.
These values will be a subset of what each entity must contain. This is used to
match relations(relatesTo and isRelatedTo). For instance, if we specify
entitytype=id1,id2,id3, this means for each entity we will check if in
relations specified, id1,id2 and id3 exist for the entitytype. It would not
matter if other ids'(within the scope of entity type) are specified as
relations for the entity.
*TimelineCompareFilter* - As the name suggests, it is used for comparison. This
is used to represent metric filters. All relational operators such as =, !=, >,
>=, < and <=.
*TimelineExistsFilter* - This checks if the value exists. Used for event
filters to represent if an event exists. Transformed into HBase's column
qualifier filter.

* *xxxEntityReader.java*
These classes are meant to read from different tables from HBase backend. These
classes contain the primary changes for HBase implementation.
I had focused on adding ample comments in code for this part but still as its
important, I will explain it as well.
Basically we create HBase filter list based on fields, configs and metrics to
retrieve(done in YARN-3862) and a filter list based on filters, which is done
in this JIRA.
*TimelineEntityReader* - In this class we introduce a new abstract method
{{constructFilterListBasedOnFilters}} which will be implemented by derived
classes to create a filter list based on filters. For single entity read, a
filter list based on filters does not make any sense so the filter list created
will only be based on fields. For multiple entity reads though, we will create
a new filter list containing filters and fields together.
*GenericEntityReader* - The changes here are meant for entity table. And some
common code which is used by ApplicationEntityReader as well.
In {{constructFilterListBasedOnFilters}}, HBase filter lists are created for
created time range, config, info and metric filters.
Relation based filters and event filters cannot be directly added here because
of the way events and relations are stored in the backend HBase storage. That
is, we cannot apply a SingleColumnValueFilter to filter out rows.
So for them, we add filters to fetch only the columns which we require to match
these filters locally. This is only done if these fields are not mentioned in
fieldsToRetrieve.
For example, if I have event filters coming as (event1, event2) and fields to
retrieve does not mention EVENTS or ALL, I will read all event columns
corresponding to event1 and event2, for the filtered rows.
This will reduce amount of data retrieved from backend. Especially for events,
because number of events can be quite a few.
Code for this is in method {{constructFilterListBasedOnFields}}.
Now coming to the new methods which have been added.
{{fetchPartialColsFromInfoFamily}} checks if we need to get some of the columns
for relations and events from backend. This depends on the condition explained
above.
{{createFilterListForColsOfInfoFamily}}, is called if above condition is true.
Here the idea is to add each column under INFO column family which is in fields
to retrieve or has to be added because we want to match certain relation

[jira] [Commented] (YARN-4723) NodesListManager$UnknownNodeId ClassCastException

2016-02-26 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169832#comment-15169832
 ] 

Hudson commented on YARN-4723:
--

FAILURE: Integrated in Hadoop-trunk-Commit #9376 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/9376/])
YARN-4723. NodesListManager$UnknownNodeId ClassCastException. (jlowe: rev 
6b0f813e898cbd14b2ae52ecfed6d30bce8cb6b7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
* hadoop-yarn-project/CHANGES.txt


> NodesListManager$UnknownNodeId ClassCastException
> -
>
> Key: YARN-4723
> URL: https://issues.apache.org/jira/browse/YARN-4723
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.3
>Reporter: Jason Lowe
>Assignee: Kuhu Shukla
>Priority: Critical
> Fix For: 2.7.3
>
> Attachments: YARN-4723-branch-2.7.001.patch, YARN-4723.001.patch, 
> YARN-4723.002.patch
>
>
> Saw the following in an RM log:
> {noformat}
> 2016-02-16 22:55:35,207 [IPC Server handler 5 on 8030] WARN ipc.Server: IPC 
> Server handler 5 on 8030, call 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server@6c403aff
> java.lang.ClassCastException: 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId 
> cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl
> at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:247)
> at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:271)
> at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:220)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:712)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:68)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:658)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:647)
> at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
> at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
> at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:9335)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:144)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:175)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:96)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:608)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server.call(Server.java:2267)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:648)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:615)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2217)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4671) There is no need to acquire CS lock when completing a container


[ 
https://issues.apache.org/jira/browse/YARN-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169785#comment-15169785
 ] 

Hadoop QA commented on YARN-4671:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
28s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
19s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
29s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 16s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 patch generated 1 new + 104 unchanged - 1 fixed = 105 total (was 105) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 73m 17s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_72. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 22s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
17s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 163m 44s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_72 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | 
hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates |
| JDK v1.7.0_95 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL

[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted


[ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169764#comment-15169764
 ] 

Naganarasimha G R commented on YARN-4700:
-

Yes these were the jira's where we discussed about it but the 
[conclusion|https://issues.apache.org/jira/browse/YARN-4392?focusedCommentId=15033961=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15033961]
 was, we were ok with republishing the events with exact data rather than not 
publishing at all because its not guaranteed that ATS events for apps in state 
store are successfully published. 

> ATS storage has one extra record each time the RM got restarted
> ---
>
> Key: YARN-4700
> URL: https://issues.apache.org/jira/browse/YARN-4700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Li Lu
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> When testing the new web UI for ATS v2, I noticed that we're creating one 
> extra record for each finished application (but still hold in the RM state 
> store) each time the RM got restarted. It's quite possible that we add the 
> cluster start timestamp into the default cluster id, thus each time we're 
> creating a new record for one application (cluster id is a part of the row 
> key). We need to fix this behavior, probably by having a better default 
> cluster id. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted


[ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169756#comment-15169756
 ] 

Naganarasimha G R commented on YARN-4700:
-

Thanks [~varun_saxena], I was also thinking in the same lines of using event's 
timestamp than the current time stamp...

> ATS storage has one extra record each time the RM got restarted
> ---
>
> Key: YARN-4700
> URL: https://issues.apache.org/jira/browse/YARN-4700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Li Lu
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> When testing the new web UI for ATS v2, I noticed that we're creating one 
> extra record for each finished application (but still hold in the RM state 
> store) each time the RM got restarted. It's quite possible that we add the 
> cluster start timestamp into the default cluster id, thus each time we're 
> creating a new record for one application (cluster id is a part of the row 
> key). We need to fix this behavior, probably by having a better default 
> cluster id. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl


[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169753#comment-15169753
 ] 

Naganarasimha G R commented on YARN-4736:
-

Thanks for the analysis [~sjlee0],
bq. Both the thread dump and the HBase exception log are from the client 
process (NM side), correct? 
Yes both are client side exceptions (i.e. NM) but i am not sure issue 2 has 
relationship with issue 1 but based on your explanation it seems to be related. 

bq. some time after that, it looks like you issued a signal to stop the client 
process (NM)?
Yes,  after a significant amount of time  after the completion of the job ( 
00:02:28)  error log came (00:39:03), and there was significant time after 
which i had tried to stop the NM @ 01:09:19.  And even when try to stop the NM 
immediately after the job completion i am able to see this issue (NM not 
stopping completely)

bq. That's why I thought there seems to be a HBase bug that is causing the 
flush operation to be wedged in this state. At least that explains why you were 
not able to shut down the collector (and therefore NM).
Yes may be, but to confirm is the server side logs required ? I am using *HBase 
- Version 1.0.2* . Can share if any other information required too. cc/ 
[~te...@apache.org]. 




> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: hbaseException.log, threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted


[ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169734#comment-15169734
 ] 

Varun Saxena commented on YARN-4700:


When we recover an app we can also get start and finish time for an app from 
the state store. This is updated then in RMAppImpl.
So when the event is replayed and reported to ATSv2, app start event would 
carry the same time as the time at which app originally started.

Currently for flow activity table, the inverted top of the day timestamp is 
generated based on current system time. I think we can instead use the 
timestamp coming in the event.
That should resolve this issue.
cc [~sjlee0], [~vrushalic]. I do not see any issue in using event's timestamp. 
Do you see any ?

> ATS storage has one extra record each time the RM got restarted
> ---
>
> Key: YARN-4700
> URL: https://issues.apache.org/jira/browse/YARN-4700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Li Lu
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> When testing the new web UI for ATS v2, I noticed that we're creating one 
> extra record for each finished application (but still hold in the RM state 
> store) each time the RM got restarted. It's quite possible that we add the 
> cluster start timestamp into the default cluster id, thus each time we're 
> creating a new record for one application (cluster id is a part of the row 
> key). We need to fix this behavior, probably by having a better default 
> cluster id. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4723) NodesListManager$UnknownNodeId ClassCastException

2016-02-26 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169733#comment-15169733
 ] 

Jason Lowe commented on YARN-4723:
--

+1 committing this.

> NodesListManager$UnknownNodeId ClassCastException
> -
>
> Key: YARN-4723
> URL: https://issues.apache.org/jira/browse/YARN-4723
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.3
>Reporter: Jason Lowe
>Assignee: Kuhu Shukla
>Priority: Critical
> Attachments: YARN-4723-branch-2.7.001.patch, YARN-4723.001.patch, 
> YARN-4723.002.patch
>
>
> Saw the following in an RM log:
> {noformat}
> 2016-02-16 22:55:35,207 [IPC Server handler 5 on 8030] WARN ipc.Server: IPC 
> Server handler 5 on 8030, call 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server@6c403aff
> java.lang.ClassCastException: 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId 
> cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl
> at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:247)
> at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:271)
> at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:220)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:712)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:68)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:658)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:647)
> at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
> at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
> at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:9335)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:144)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:175)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:96)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:608)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server.call(Server.java:2267)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:648)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:615)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2217)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted


[ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169721#comment-15169721
 ] 

Varun Saxena commented on YARN-4700:


Hmm...I think the JIRAs' are YARN-3127 and YARN-4392. Let me see how we can 
solve this problem.

> ATS storage has one extra record each time the RM got restarted
> ---
>
> Key: YARN-4700
> URL: https://issues.apache.org/jira/browse/YARN-4700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Li Lu
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> When testing the new web UI for ATS v2, I noticed that we're creating one 
> extra record for each finished application (but still hold in the RM state 
> store) each time the RM got restarted. It's quite possible that we add the 
> cluster start timestamp into the default cluster id, thus each time we're 
> creating a new record for one application (cluster id is a part of the row 
> key). We need to fix this behavior, probably by having a better default 
> cluster id. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4117) End to end unit test with mini YARN cluster for AMRMProxy Service

2016-02-26 Thread Giovanni Matteo Fumarola (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giovanni Matteo Fumarola updated YARN-4117:
---
Attachment: YARN-4117.v1.patch

> End to end unit test with mini YARN cluster for AMRMProxy Service
> -
>
> Key: YARN-4117
> URL: https://issues.apache.org/jira/browse/YARN-4117
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, resourcemanager
>Reporter: Kishore Chaliparambil
>Assignee: Giovanni Matteo Fumarola
> Attachments: YARN-4117.v0.patch, YARN-4117.v1.patch
>
>
> YARN-2884 introduces a proxy between AM and RM. This JIRA proposes an end to 
> end unit test using mini YARN cluster to the AMRMProxy service. This test 
> will validate register, allocate and finish application and token renewal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4686) MiniYARNCluster.start() returns before cluster is completely started

2016-02-26 Thread Eric Badger (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169686#comment-15169686
 ] 

Eric Badger commented on YARN-4686:
---

TestYarnCLI, TestAMRMClient, TestYarnClient, TestNMClient, and TestGetGroups 
are failing in multiple recent precommit builds [YARN-4117], [YARN-4630], 
[YARN-4676].

TestMiniYarnClusterNodeUtilization is tracked by [YARN-4566]. 

TestContainerManagerSecurity is failing on other recent precommit builds 
[YARN-4117], [YARN-4566].

The only failure related to this patch is TestApplicationClientProtocolOnHA. 

> MiniYARNCluster.start() returns before cluster is completely started
> 
>
> Key: YARN-4686
> URL: https://issues.apache.org/jira/browse/YARN-4686
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Eric Badger
> Attachments: MAPREDUCE-6507.001.patch, YARN-4686.001.patch
>
>
> TestRMNMInfo fails intermittently. Below is trace for the failure
> {noformat}
> testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo)  Time elapsed: 0.28 
> sec  <<< FAILURE!
> java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but 
> was:<3>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at 
> org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4711) NM is going down with NPE's due to single thread processing of events by Timeline client


[ 
https://issues.apache.org/jira/browse/YARN-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169667#comment-15169667
 ] 

Naganarasimha G R commented on YARN-4711:
-

Hi [~sjlee0],
Till now i have only experimented with running sleep job with varied number of 
mappers and varied sleep time, I will try to measure the latency in 
{{TimelineClientImpl.putObjects}} by putting some logs and share the figures.

> NM is going down with NPE's due to single thread processing of events by 
> Timeline client
> 
>
> Key: YARN-4711
> URL: https://issues.apache.org/jira/browse/YARN-4711
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
>
> After YARN-3367, while testing the latest 2928 branch came across few NPEs 
> due to which NM is shutting down.
> {code}
> 2016-02-21 23:19:54,078 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:306)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:296)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:213)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerFinishedEvent(NMTimelinePublisher.java:192)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.access$400(NMTimelinePublisher.java:63)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:289)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:280)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> On analysis found that the there was delay in processing of events, as after 
> YARN-3367 all the events were getting processed by a single thread inside the 
> timeline client. 
> Additionally found one scenario where there is possibility of NPE:
> * TimelineEntity.toString() when {{real}} is not null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (YARN-4731) Linux container executor fails to delete nmlocal folders


[ 
https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169619#comment-15169619
 ] 

Colin Patrick McCabe edited comment on YARN-4731 at 2/26/16 7:47 PM:
-

Thanks for finding this bug.  Unfortunately, I think the patch has some 
issues... it introduces a race condition where the path could change during our 
traversal.  Let me see if I can find a way to do this through fstat or a 
related call.


was (Author: cmccabe):
-1.  This introduces a race condition where the path could change during our 
traversal.  Let me see if I can find a way to do this through fstat or a 
related call.

> Linux container executor fails to delete nmlocal folders
> 
>
> Key: YARN-4731
> URL: https://issues.apache.org/jira/browse/YARN-4731
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: YARN-4731.001.patch
>
>
> Enable LCE and CGroups
> Submit a mapreduce job
> {noformat}
> 2016-02-24 18:56:46,889 INFO 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting 
> absolute path : 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
> 2016-02-24 18:56:46,894 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 255. Privileged Execution Operation 
> Output:
> main : command provided 3
> main : run as user is dsperf
> main : requested yarn user is dsperf
> failed to rmdir job.jar: Not a directory
> Error while deleting 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01:
>  20 (Not a directory)
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  dsperf, dsperf, 3, 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01]
> 2016-02-24 18:56:46,894 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: 
> DeleteAsUser for 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
>  returned with exit code: 255
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=255:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
> at org.apache.hadoop.util.Shell.run(Shell.java:838)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
> ... 10 more
> {noformat}
> As a result nodemanager-local directory are not getting deleted for each 
> application
> {noformat}
> total 36
> drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./
> drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../
> -rw--- 1 hdfs hadoop  340 Feb 25 08:25 container_tokens
> lrwxrwxrwx 1 hdfs hadoop  111 Feb 25 08:25 job.jar -> 
>

[jira] [Commented] (YARN-4062) Add the flush and compaction functionality via coprocessors and scanners for flow run table


[ 
https://issues.apache.org/jira/browse/YARN-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169640#comment-15169640
 ] 

Vrushali C commented on YARN-4062:
--


Attaching patch YARN-4062-YARN-2928.04.patch rebased to the right head. Thanks 
Sangjin Lee for the suggestion. 

> Add the flush and compaction functionality via coprocessors and scanners for 
> flow run table
> ---
>
> Key: YARN-4062
> URL: https://issues.apache.org/jira/browse/YARN-4062
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4062-YARN-2928.04.patch, 
> YARN-4062-YARN-2928.1.patch, YARN-4062-feature-YARN-2928.01.patch, 
> YARN-4062-feature-YARN-2928.02.patch, YARN-4062-feature-YARN-2928.03.patch
>
>
> As part of YARN-3901, coprocessor and scanner is being added for storing into 
> the flow_run table. It also needs a flush & compaction processing in the 
> coprocessor and perhaps a new scanner to deal with the data during flushing 
> and compaction stages. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4062) Add the flush and compaction functionality via coprocessors and scanners for flow run table


 [ 
https://issues.apache.org/jira/browse/YARN-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vrushali C updated YARN-4062:
-
Attachment: YARN-4062-YARN-2928.04.patch


Attaching patch rebased to the right head. Thanks [~sjlee0] for the suggestion.

> Add the flush and compaction functionality via coprocessors and scanners for 
> flow run table
> ---
>
> Key: YARN-4062
> URL: https://issues.apache.org/jira/browse/YARN-4062
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Vrushali C
>Assignee: Vrushali C
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-4062-YARN-2928.04.patch, 
> YARN-4062-YARN-2928.1.patch, YARN-4062-feature-YARN-2928.01.patch, 
> YARN-4062-feature-YARN-2928.02.patch, YARN-4062-feature-YARN-2928.03.patch
>
>
> As part of YARN-3901, coprocessor and scanner is being added for storing into 
> the flow_run table. It also needs a flush & compaction processing in the 
> coprocessor and perhaps a new scanner to deal with the data during flushing 
> and compaction stages. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4686) MiniYARNCluster.start() returns before cluster is completely started

2016-02-26 Thread Eric Badger (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169624#comment-15169624
 ] 

Eric Badger commented on YARN-4686:
---

I am unable to reproduce any of the JUnit timeout failures (TestYarnCLI, 
TestAMRMClient, TestYarnClient, TestNMClient) locally via trunk or via trunk 
with my patch added. 

TestMiniYarnClusterNodeUtilization fails locally in both trunk and with my 
patch.

TestContainerManagerSecurity passes locally in both trunk and with my patch.

TestGetGroups has an Ignore annotation, so we can probably ignore that error. 

TestApplicationClientProtocolOnHA passes locally on trunk and fails locally 
with my patch. However, there are a non-deterministic amount of tests failing 
with the same error in the initialization code. 

{noformat}
java.lang.AssertionError: NMs failed to connect to the RM
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at 
org.apache.hadoop.yarn.client.ProtocolHATestBase.verifyConnections(ProtocolHATestBase.java:219)
at 
org.apache.hadoop.yarn.client.ProtocolHATestBase.startHACluster(ProtocolHATestBase.java:284)
at 
org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.initiate(TestApplicationClientProtocolOnHA.java:54)
{noformat}

> MiniYARNCluster.start() returns before cluster is completely started
> 
>
> Key: YARN-4686
> URL: https://issues.apache.org/jira/browse/YARN-4686
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Eric Badger
> Attachments: MAPREDUCE-6507.001.patch, YARN-4686.001.patch
>
>
> TestRMNMInfo fails intermittently. Below is trace for the failure
> {noformat}
> testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo)  Time elapsed: 0.28 
> sec  <<< FAILURE!
> java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but 
> was:<3>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at 
> org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4731) Linux container executor fails to delete nmlocal folders


[ 
https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169619#comment-15169619
 ] 

Colin Patrick McCabe commented on YARN-4731:


-1.  This introduces a race condition where the path could change during our 
traversal.  Let me see if I can find a way to do this through fstat or a 
related call.

> Linux container executor fails to delete nmlocal folders
> 
>
> Key: YARN-4731
> URL: https://issues.apache.org/jira/browse/YARN-4731
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: YARN-4731.001.patch
>
>
> Enable LCE and CGroups
> Submit a mapreduce job
> {noformat}
> 2016-02-24 18:56:46,889 INFO 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting 
> absolute path : 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
> 2016-02-24 18:56:46,894 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 255. Privileged Execution Operation 
> Output:
> main : command provided 3
> main : run as user is dsperf
> main : requested yarn user is dsperf
> failed to rmdir job.jar: Not a directory
> Error while deleting 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01:
>  20 (Not a directory)
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  dsperf, dsperf, 3, 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01]
> 2016-02-24 18:56:46,894 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: 
> DeleteAsUser for 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
>  returned with exit code: 255
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=255:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
> at org.apache.hadoop.util.Shell.run(Shell.java:838)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
> ... 10 more
> {noformat}
> As a result nodemanager-local directory are not getting deleted for each 
> application
> {noformat}
> total 36
> drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./
> drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../
> -rw--- 1 hdfs hadoop  340 Feb 25 08:25 container_tokens
> lrwxrwxrwx 1 hdfs hadoop  111 Feb 25 08:25 job.jar -> 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/11/job.jar/
> lrwxrwxrwx 1 hdfs hadoop  111 Feb 25 08:25 job.xml -> 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/13/job.xml*
> drwxr-s--- 2 hdfs hadoop 4096 Feb 25 08:25 jobSubmitDir/
> -rwx-- 1 hdfs hadoop 5348 Feb 25 08:25 launch_container.sh*

[jira] [Commented] (YARN-4117) End to end unit test with mini YARN cluster for AMRMProxy Service

2016-02-26 Thread Giovanni Matteo Fumarola (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169591#comment-15169591
 ] 

Giovanni Matteo Fumarola commented on YARN-4117:


[~jianhe], thanks for reviewing the patch.
1. Sure;
2. Sure, I will check it.
3. Because I can customize CustomContainerManagerImpl and AMRMProxyService.
MiniYarnCluster allocates the ports for the RM during the Start phase, and 
there was no way to pass the information to the AMRMProxy. 
For this reason I save the value of the Scheduler address and in the 
Customizazion I inform the AMRMProxy what is the address it will connect to. 
4. The purpose of testE2ETokenSwap is to verified that the AMRMProxy swapped 
the security token and the AM cannot submit the job directly to the RM. 

I will update the patch according to the feedback. 

> End to end unit test with mini YARN cluster for AMRMProxy Service
> -
>
> Key: YARN-4117
> URL: https://issues.apache.org/jira/browse/YARN-4117
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, resourcemanager
>Reporter: Kishore Chaliparambil
>Assignee: Giovanni Matteo Fumarola
> Attachments: YARN-4117.v0.patch
>
>
> YARN-2884 introduces a proxy between AM and RM. This JIRA proposes an end to 
> end unit test using mini YARN cluster to the AMRMProxy service. This test 
> will validate register, allocate and finish application and token renewal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4481) negative pending resource of queues lead to applications in accepted status inifnitly

2016-02-26 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169550#comment-15169550
 ] 

Eric Payne commented on YARN-4481:
--

Not sure if this is related, but we are also seeing similar results in 2.7 for 
reserved containers:
{noformat}
"name" : 
"Hadoop:service=ResourceManager,name=QueueMetrics,q0=root,q1=bigmem",
...
"ReservedMB" : -6553600,
"ReservedVCores" : -8000,
"ReservedContainers" : -800,
...
{noformat}

> negative pending resource of queues lead to applications in accepted status 
> inifnitly
> -
>
> Key: YARN-4481
> URL: https://issues.apache.org/jira/browse/YARN-4481
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.2
>Reporter: gu-chi
>Priority: Critical
> Attachments: jmx.txt
>
>
> Met a scenario of negative pending resource with capacity scheduler, in jmx, 
> it shows:
> {noformat}
> "PendingMB" : -4096,
> "PendingVCores" : -1,
> "PendingContainers" : -1,
> {noformat}
> full jmx infomation attached.
> this is not just a jmx UI issue, the actual pending resource of queue is also 
> negative as I see the debug log of
> bq. DEBUG | ResourceManager Event Processor | Skip this queue=root, because 
> it doesn't need more resource, schedulingMode=RESPECT_PARTITION_EXCLUSIVITY 
> node-partition= | ParentQueue.java
> this lead to the {{NULL_ASSIGNMENT}}
> The background is submitting hundreds of applications and consume all cluster 
> resource and reservation happen. While running, network fault injected by 
> some tool, injection types are delay,jitter
> ,repeat,packet loss and disorder. And then kill most of the applications 
> submitted.
> Anyone also facing negative pending resource, or have idea of how this happen?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted

2016-02-26 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169531#comment-15169531
 ] 

Li Lu commented on YARN-4700:
-

I think the redundant events are coming from the work preserving RM restart, 
where the RM tries to "replay" application lifecycle events in the state store. 
I don't remember the JIRA number for fixing this for SMP (but I do remember 
[~Naganarasimha] was involved in the discussion), but seems like the conclusion 
was to handle this on the SMP/storage side rather than the RM side. For us, 
most of the tables are fine, but the flow activity table we need to distinguish 
a "real" activity from a replayed activity. 

> ATS storage has one extra record each time the RM got restarted
> ---
>
> Key: YARN-4700
> URL: https://issues.apache.org/jira/browse/YARN-4700
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Li Lu
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> When testing the new web UI for ATS v2, I noticed that we're creating one 
> extra record for each finished application (but still hold in the RM state 
> store) each time the RM got restarted. It's quite possible that we add the 
> cluster start timestamp into the default cluster id, thus each time we're 
> creating a new record for one application (cluster id is a part of the row 
> key). We need to fix this behavior, probably by having a better default 
> cluster id. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4711) NM is going down with NPE's due to single thread processing of events by Timeline client


[ 
https://issues.apache.org/jira/browse/YARN-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169469#comment-15169469
 ] 

Sangjin Lee commented on YARN-4711:
---

What I meant is the actual put call in {{TimelineClientImpl}}.

> NM is going down with NPE's due to single thread processing of events by 
> Timeline client
> 
>
> Key: YARN-4711
> URL: https://issues.apache.org/jira/browse/YARN-4711
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
>
> After YARN-3367, while testing the latest 2928 branch came across few NPEs 
> due to which NM is shutting down.
> {code}
> 2016-02-21 23:19:54,078 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:306)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:296)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:213)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerFinishedEvent(NMTimelinePublisher.java:192)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.access$400(NMTimelinePublisher.java:63)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:289)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:280)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> On analysis found that the there was delay in processing of events, as after 
> YARN-3367 all the events were getting processed by a single thread inside the 
> timeline client. 
> Additionally found one scenario where there is possibility of NPE:
> * TimelineEntity.toString() when {{real}} is not null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4711) NM is going down with NPE's due to single thread processing of events by Timeline client


[ 
https://issues.apache.org/jira/browse/YARN-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169462#comment-15169462
 ] 

Sangjin Lee commented on YARN-4711:
---

I know it might take some effort to get the data, but do you have some idea 
what type of latency you're seeing with each put call?

> NM is going down with NPE's due to single thread processing of events by 
> Timeline client
> 
>
> Key: YARN-4711
> URL: https://issues.apache.org/jira/browse/YARN-4711
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
>
> After YARN-3367, while testing the latest 2928 branch came across few NPEs 
> due to which NM is shutting down.
> {code}
> 2016-02-21 23:19:54,078 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:306)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:296)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:213)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerFinishedEvent(NMTimelinePublisher.java:192)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.access$400(NMTimelinePublisher.java:63)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:289)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:280)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> On analysis found that the there was delay in processing of events, as after 
> YARN-3367 all the events were getting processed by a single thread inside the 
> timeline client. 
> Additionally found one scenario where there is possibility of NPE:
> * TimelineEntity.toString() when {{real}} is not null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928


[ 
https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169452#comment-15169452
 ] 

Sangjin Lee commented on YARN-4712:
---

Yes, at least for now we should emit the CPU as a percentage (i.e. multiply by 
100).

As for the "unavailable", can we not simply skip sending the CPU if the value 
is equal to unavailable?

> CPU Usage Metric is not captured properly in YARN-2928
> --
>
> Key: YARN-4712
> URL: https://issues.apache.org/jira/browse/YARN-4712
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> There are 2 issues with CPU usage collection 
> * I was able to observe that that many times CPU usage got from 
> {{pTree.getCpuUsagePercent()}} is 
> ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do 
> the calculation  i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore 
> /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE 
> check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not 
> encountered. so proper checks needs to be handled
> * {{EntityColumnPrefix.METRIC}} uses always LongConverter but 
> ContainerMonitor is publishing decimal values for the CPU usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4671) There is no need to acquire CS lock when completing a container

2016-02-26 Thread MENG DING (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-4671:

Attachment: YARN-4671.2.patch

Rebased against trunk

> There is no need to acquire CS lock when completing a container
> ---
>
> Key: YARN-4671
> URL: https://issues.apache.org/jira/browse/YARN-4671
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: MENG DING
>Assignee: MENG DING
> Attachments: YARN-4671.1.patch, YARN-4671.2.patch
>
>
> In YARN-4519, we discovered that there is no need to acquire CS lock in 
> CS#completedContainerInternal, because:
> * Access to critical section are already guarded by queue lock.
> * It is not essential to guard {{schedulerHealth}} with cs lock in 
> completedContainerInternal. All maps in schedulerHealth are concurrent maps. 
> Even if schedulerHealth is not consistent at the moment, it will be 
> eventually consistent.
> With this fix, we can truly claim that CS#allocate doesn't require CS lock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4723) NodesListManager$UnknownNodeId ClassCastException

2016-02-26 Thread Daniel Templeton (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169407#comment-15169407
 ] 

Daniel Templeton commented on YARN-4723:


The patch seems fine to me.

> NodesListManager$UnknownNodeId ClassCastException
> -
>
> Key: YARN-4723
> URL: https://issues.apache.org/jira/browse/YARN-4723
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.3
>Reporter: Jason Lowe
>Assignee: Kuhu Shukla
>Priority: Critical
> Attachments: YARN-4723-branch-2.7.001.patch, YARN-4723.001.patch, 
> YARN-4723.002.patch
>
>
> Saw the following in an RM log:
> {noformat}
> 2016-02-16 22:55:35,207 [IPC Server handler 5 on 8030] WARN ipc.Server: IPC 
> Server handler 5 on 8030, call 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server@6c403aff
> java.lang.ClassCastException: 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId 
> cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl
> at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:247)
> at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:271)
> at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:220)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:712)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:68)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:658)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:647)
> at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
> at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
> at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:9335)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:144)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:175)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:96)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:608)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server.call(Server.java:2267)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:648)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:615)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2217)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl


[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169328#comment-15169328
 ] 

Sangjin Lee commented on YARN-4736:
---

Both the thread dump and the HBase exception log are from the client process 
(NM side), correct? I believe so because both are showing HBase client stack 
traces. Could you confirm?

Since these are both from the client side, I am not sure what was going on on 
the HBase server side. Also, I'm not sure why HBase started refusing 
connections as evidenced by your exception log (that's why I assumed that HBase 
process might have gone away at that point).

Here is how I put together the sequence of events:
- at some point the HBase process starts refusing connections 
(hbaseException.log)
- the periodic flush gets trapped in this bad state, finally logging the 
{{RetriesExhaustedException}} after 36 minutes
{noformat}
2016-02-26 00:02:28,270 INFO 
org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager:
 The collector service for application_1456425026132_0001 was removed
2016-02-26 00:39:03,879 ERROR org.apache.hadoop.hbase.client.AsyncProcess: 
Failed to get region location 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=36, exceptions:
{noformat}
- some time after that, it looks like you issued a signal to stop the client 
process (NM)?
{noformat}
2016-02-26 01:09:19,799 ERROR 
org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: 
SIGTERM
{noformat}
- but the service stop fails to shut down the periodic flush task thread
{noformat}
2016-02-26 01:09:50,035 WARN 
org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager:
 failed to stop the flusher task in time. will still proceed to close the 
writer.
2016-02-26 01:09:50,035 INFO 
org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl: 
closing the entity table
{noformat}
- at this point the NM process is hung because the flush is in this stuck state 
while holding the {{BufferedMutatorImpl}} lock and the closing of the entity 
table needs to acquire that lock

That's why I thought there seems to be a HBase bug that is causing the flush 
operation to be wedged in this state. At least that explains why you were not 
able to shut down the collector (and therefore NM).

> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: hbaseException.log, threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4720) Skip unnecessary NN operations in log aggregation

2016-02-26 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169311#comment-15169311
 ] 

Hudson commented on YARN-4720:
--

FAILURE: Integrated in Hadoop-trunk-Commit #9374 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/9374/])
YARN-4720. Skip unnecessary NN operations in log aggregation. (Jun Gong 
(mingma: rev 7f3139e54da2c496327446a5eac43f8421fc8839)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* hadoop-yarn-project/CHANGES.txt


> Skip unnecessary NN operations in log aggregation
> -
>
> Key: YARN-4720
> URL: https://issues.apache.org/jira/browse/YARN-4720
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Ming Ma
>Assignee: Jun Gong
> Attachments: YARN-4720.01.patch, YARN-4720.02.patch, 
> YARN-4720.03.patch, YARN-4720.04.patch, YARN-4720.05.patch
>
>
> Log aggregation service could have unnecessary NN operations in the following 
> scenarios:
> * No new local log has been created since the last upload for the long 
> running service scenario.
> * NM uses {{ContainerLogAggregationPolicy}} that skips log aggregation for 
> certain containers.
> In the following code snippet, even though {{pendingContainerInThisCycle}} is 
> empty, it still creates the writer and then removes the file later. Thus it 
> introduces unnecessary create/getfileinfo/delete NN calls when NM doesn't 
> aggregate logs for an app.
>   
> {noformat}
> AppLogAggregatorImpl.java
> ..
> writer =
> new LogWriter(this.conf, this.remoteNodeTmpLogFileForApp,
> this.userUgi);
> ..
>   for (ContainerId container : pendingContainerInThisCycle) {
> ..
>   }
> ..
> if (remoteFS.exists(remoteNodeTmpLogFileForApp)) {
>   if (rename) {
> remoteFS.rename(remoteNodeTmpLogFileForApp, renamedPath);
>   } else {
> remoteFS.delete(remoteNodeTmpLogFileForApp, false);
>   }
> }
> ..
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4465) SchedulerUtils#validateRequest for Label check should happen only when nodelabel enabled


[ 
https://issues.apache.org/jira/browse/YARN-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169265#comment-15169265
 ] 

Hadoop QA commented on YARN-4465:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
43s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
17s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
29s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
15s {color} | {color:green} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 patch generated 0 new + 20 unchanged - 1 fixed = 20 total (was 21) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 53s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_72. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 71m 37s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
19s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 158m 55s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_72 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
| JDK v1.7.0_95 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12790128/0007-YARN-4465.patch

[jira] [Commented] (YARN-4723) NodesListManager$UnknownNodeId ClassCastException

2016-02-26 Thread Kuhu Shukla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169233#comment-15169233
 ] 

Kuhu Shukla commented on YARN-4723:
---

This is for the 2.7 patch.

> NodesListManager$UnknownNodeId ClassCastException
> -
>
> Key: YARN-4723
> URL: https://issues.apache.org/jira/browse/YARN-4723
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.3
>Reporter: Jason Lowe
>Assignee: Kuhu Shukla
>Priority: Critical
> Attachments: YARN-4723-branch-2.7.001.patch, YARN-4723.001.patch, 
> YARN-4723.002.patch
>
>
> Saw the following in an RM log:
> {noformat}
> 2016-02-16 22:55:35,207 [IPC Server handler 5 on 8030] WARN ipc.Server: IPC 
> Server handler 5 on 8030, call 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server@6c403aff
> java.lang.ClassCastException: 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId 
> cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl
> at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:247)
> at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:271)
> at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:220)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:712)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:68)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:658)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:647)
> at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
> at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
> at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:9335)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:144)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:175)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:96)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:608)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server.call(Server.java:2267)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:648)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:615)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2217)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4723) NodesListManager$UnknownNodeId ClassCastException

2016-02-26 Thread Kuhu Shukla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169183#comment-15169183
 ] 

Kuhu Shukla commented on YARN-4723:
---

Findbugs:
{code}  
Bug type SIC_INNER_SHOULD_BE_STATIC (click for details) 
In class 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKSyncOperationCallback
At ZKRMStateStore.java:[lines 118-127]
{code}

Checkstyle warnings are unrelated. Same thing with asf license warnings. Test 
failures are known issues.

> NodesListManager$UnknownNodeId ClassCastException
> -
>
> Key: YARN-4723
> URL: https://issues.apache.org/jira/browse/YARN-4723
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.3
>Reporter: Jason Lowe
>Assignee: Kuhu Shukla
>Priority: Critical
> Attachments: YARN-4723-branch-2.7.001.patch, YARN-4723.001.patch, 
> YARN-4723.002.patch
>
>
> Saw the following in an RM log:
> {noformat}
> 2016-02-16 22:55:35,207 [IPC Server handler 5 on 8030] WARN ipc.Server: IPC 
> Server handler 5 on 8030, call 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server@6c403aff
> java.lang.ClassCastException: 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId 
> cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl
> at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:247)
> at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:271)
> at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:220)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:712)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:68)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:658)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:647)
> at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
> at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
> at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:9335)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:144)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:175)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:96)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:608)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server.call(Server.java:2267)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:648)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:615)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2217)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4735) Remove stale LogAggregationReport from NM's context

2016-02-26 Thread Jun Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169088#comment-15169088
 ] 

Jun Gong commented on YARN-4735:


Thanks [~mingma]. I checked the code, it seems not an issue. I'll verify it 
again. 

[~kasha] Maybe the issue you met is not same? 

> Remove stale LogAggregationReport from NM's context
> ---
>
> Key: YARN-4735
> URL: https://issues.apache.org/jira/browse/YARN-4735
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>
> {quote}
> All LogAggregationReport(current and previous) are only added to 
> *context.getLogAggregationStatusForApps*, and never removed.
> So for long running service, the LogAggregationReport list NM sends to RM 
> will grow over time.
> {quote}
> Per discussion in YARN-4720, we need remove stale LogAggregationReport from 
> NM's context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4731) Linux container executor fails to delete nmlocal folders

2016-02-26 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169073#comment-15169073
 ] 

Bibin A Chundatt commented on YARN-4731:


[~jlowe]/[~vvasudev]

Tried same scenarios in branch 2.7.2 signal error doesn't exists.

Signal error and container initialization exception are not causing any task 
failure .
If any scope for improvement we can raise a new jira no need to handle as part 
of this jira.
Localization issue fixed ..+1 (non-binding)


> Linux container executor fails to delete nmlocal folders
> 
>
> Key: YARN-4731
> URL: https://issues.apache.org/jira/browse/YARN-4731
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: YARN-4731.001.patch
>
>
> Enable LCE and CGroups
> Submit a mapreduce job
> {noformat}
> 2016-02-24 18:56:46,889 INFO 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting 
> absolute path : 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
> 2016-02-24 18:56:46,894 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 255. Privileged Execution Operation 
> Output:
> main : command provided 3
> main : run as user is dsperf
> main : requested yarn user is dsperf
> failed to rmdir job.jar: Not a directory
> Error while deleting 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01:
>  20 (Not a directory)
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  dsperf, dsperf, 3, 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01]
> 2016-02-24 18:56:46,894 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: 
> DeleteAsUser for 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
>  returned with exit code: 255
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=255:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
> at org.apache.hadoop.util.Shell.run(Shell.java:838)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
> ... 10 more
> {noformat}
> As a result nodemanager-local directory are not getting deleted for each 
> application
> {noformat}
> total 36
> drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./
> drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../
> -rw--- 1 hdfs hadoop  340 Feb 25 08:25 container_tokens
> lrwxrwxrwx 1 hdfs hadoop  111 Feb 25 08:25 job.jar -> 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/11/job.jar/
> lrwxrwxrwx 1 hdfs hadoop  111 Feb 25 08:25 job.xml -> 
>

[jira] [Commented] (YARN-4465) SchedulerUtils#validateRequest for Label check should happen only when nodelabel enabled

2016-02-26 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169036#comment-15169036
 ] 

Bibin A Chundatt commented on YARN-4465:


[~leftnoteasy]
Thank you for review comments
{quote}
Could you avoid show all cluster labels in exception when label doesn't exist 
in cluster?
{quote}
Done
{quote}
And could you explain why change of TestKillApplicationWithRMHA is needed?
{quote}
Sorry had done the update for debugging testcase, excluded the updation from 
latest patch .


> SchedulerUtils#validateRequest for Label check should happen only when 
> nodelabel enabled
> 
>
> Key: YARN-4465
> URL: https://issues.apache.org/jira/browse/YARN-4465
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: 0001-YARN-4465.patch, 0002-YARN-4465.patch, 
> 0003-YARN-4465.patch, 0004-YARN-4465.patch, 0006-YARN-4465.patch, 
> 0007-YARN-4465.patch
>
>
> Disable label from rm side yarn.nodelabel.enable=false
> Capacity scheduler label configuration for queue is available as below
> default label for queue = b1 as 3 and accessible labels as 1,3
> Submit application to queue A .
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException):
>  Invalid resource request, queue=b1 doesn't have permission to access all 
> labels in resource request. labelExpression of resource request=3. Queue 
> labels=1,3
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:401)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:283)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:602)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:247)
> {noformat}
> # Ignore default label expression when label is disabled *or*
> # NormalizeResourceRequest we can set label expression to  
> when node label is not enabled *or*
> # Improve message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4465) SchedulerUtils#validateRequest for Label check should happen only when nodelabel enabled

2016-02-26 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4465:
---
Attachment: 0007-YARN-4465.patch

> SchedulerUtils#validateRequest for Label check should happen only when 
> nodelabel enabled
> 
>
> Key: YARN-4465
> URL: https://issues.apache.org/jira/browse/YARN-4465
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: 0001-YARN-4465.patch, 0002-YARN-4465.patch, 
> 0003-YARN-4465.patch, 0004-YARN-4465.patch, 0006-YARN-4465.patch, 
> 0007-YARN-4465.patch
>
>
> Disable label from rm side yarn.nodelabel.enable=false
> Capacity scheduler label configuration for queue is available as below
> default label for queue = b1 as 3 and accessible labels as 1,3
> Submit application to queue A .
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException):
>  Invalid resource request, queue=b1 doesn't have permission to access all 
> labels in resource request. labelExpression of resource request=3. Queue 
> labels=1,3
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:401)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:283)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:602)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:247)
> {noformat}
> # Ignore default label expression when label is disabled *or*
> # NormalizeResourceRequest we can set label expression to  
> when node label is not enabled *or*
> # Improve message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4624) NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI


[ 
https://issues.apache.org/jira/browse/YARN-4624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168928#comment-15168928
 ] 

Hadoop QA commented on YARN-4624:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
12s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
20s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
16s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 5s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s 
{color} | {color:green} trunk passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
31s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
17s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 17s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed with JDK v1.8.0_72 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 26s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_72. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 71m 35s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
17s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 159m 18s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
|  |  Boxed value is unboxed and then immediately reboxed in 
org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(ResponseInfo,
 String)  At CapacitySchedulerPage.java:then immediately reboxed in

[jira] [Commented] (YARN-4731) Linux container executor fails to delete nmlocal folders

2016-02-26 Thread Varun Vasudev (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168814#comment-15168814
 ] 

Varun Vasudev commented on YARN-4731:
-

The signal container exception and the container initalization error can be 
ignored. The signal container exception is due to the fact that we call signal 
container as part of the container cleanup and the container initialization 
error is due to the MR AM killing the last reducer.

> Linux container executor fails to delete nmlocal folders
> 
>
> Key: YARN-4731
> URL: https://issues.apache.org/jira/browse/YARN-4731
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Bibin A Chundatt
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: YARN-4731.001.patch
>
>
> Enable LCE and CGroups
> Submit a mapreduce job
> {noformat}
> 2016-02-24 18:56:46,889 INFO 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting 
> absolute path : 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
> 2016-02-24 18:56:46,894 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 255. Privileged Execution Operation 
> Output:
> main : command provided 3
> main : run as user is dsperf
> main : requested yarn user is dsperf
> failed to rmdir job.jar: Not a directory
> Error while deleting 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01:
>  20 (Not a directory)
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  dsperf, dsperf, 3, 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01]
> 2016-02-24 18:56:46,894 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: 
> DeleteAsUser for 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01
>  returned with exit code: 255
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=255:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
> at org.apache.hadoop.util.Shell.run(Shell.java:838)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
> ... 10 more
> {noformat}
> As a result nodemanager-local directory are not getting deleted for each 
> application
> {noformat}
> total 36
> drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./
> drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../
> -rw--- 1 hdfs hadoop  340 Feb 25 08:25 container_tokens
> lrwxrwxrwx 1 hdfs hadoop  111 Feb 25 08:25 job.jar -> 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/11/job.jar/
> lrwxrwxrwx 1 hdfs hadoop  111 Feb 25 08:25 job.xml -> 
> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/13/job.xml*
> drwxr-s--- 2

[jira] [Updated] (YARN-4624) NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI

2016-02-26 Thread Brahma Reddy Battula (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-4624:
---
Attachment: YARN-4624-003.patch

> NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI
> ---
>
> Key: YARN-4624
> URL: https://issues.apache.org/jira/browse/YARN-4624
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
> Attachments: SchedulerUIWithOutLabelMapping.png, YARN-2674-002.patch, 
> YARN-4624-003.patch, YARN-4624.patch
>
>
> Scenario:
> ===
> Configure nodelables and add to cluster
> Start the cluster
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.PartitionQueueCapacitiesInfo.getMaxAMLimitPercentage(PartitionQueueCapacitiesInfo.java:114)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:105)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:94)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117)
>   at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueueBlock.render(CapacitySchedulerPage.java:293)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117)
>   at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:447)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4624) NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI

2016-02-26 Thread Brahma Reddy Battula (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-4624:
---
Attachment: (was: YARN-4624-003.patch)

> NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI
> ---
>
> Key: YARN-4624
> URL: https://issues.apache.org/jira/browse/YARN-4624
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
> Attachments: SchedulerUIWithOutLabelMapping.png, YARN-2674-002.patch, 
> YARN-4624.patch
>
>
> Scenario:
> ===
> Configure nodelables and add to cluster
> Start the cluster
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.PartitionQueueCapacitiesInfo.getMaxAMLimitPercentage(PartitionQueueCapacitiesInfo.java:114)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:105)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:94)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117)
>   at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueueBlock.render(CapacitySchedulerPage.java:293)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117)
>   at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:447)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl


[ 
https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168659#comment-15168659
 ] 

Naganarasimha G R commented on YARN-4736:
-

Hi [~sjlee0],
Hope the attached files are considered correctly for the issues:
*For Issue 1* : threaddump.log
*For Issue 2* : hbaseException.log
bq. This could be a bug in HBase. It seems like the HBase cluster was already 
shut down, but the flush operation took a long time to finally error out (36 
minutes):
This log is for case 2: Actually cluster or NM was not shutdown in this case 
but app completed successfully but after 40 mins this error was shown.

> Issues with HBaseTimelineWriterImpl
> ---
>
> Key: YARN-4736
> URL: https://issues.apache.org/jira/browse/YARN-4736
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Vrushali C
>Priority: Critical
>  Labels: yarn-2928-1st-milestone
> Attachments: hbaseException.log, threaddump.log
>
>
> Faced some issues while running ATSv2 in single node Hadoop cluster and in 
> the same node had launched Hbase with embedded zookeeper.
> # Due to some NPE issues i was able to see NM was trying to shutdown, but the 
> NM daemon process was not completed due to the locks.
> # Got some exception related to Hbase after application finished execution 
> successfully. 
> will attach logs and the trace for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4740) container complete msg may lost while AM restart in race condition