[jira] [Updated] (YARN-2886) Estimating waiting time in NM container queues
[ https://issues.apache.org/jira/browse/YARN-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-2886: -- Parent Issue: YARN-4742 (was: YARN-2877) > Estimating waiting time in NM container queues > -- > > Key: YARN-2886 > URL: https://issues.apache.org/jira/browse/YARN-2886 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Konstantinos Karanasos > > This JIRA is about estimating the waiting time of each NM queue. > Having these estimates is crucial for the distributed scheduling of container > requests, as it allows the LocalRM to decide in which NMs to queue the > queuable container requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4631) Add specialized Token support for DistributedSchedulingProtocol
[ https://issues.apache.org/jira/browse/YARN-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-4631: -- Parent Issue: YARN-4742 (was: YARN-2877) > Add specialized Token support for DistributedSchedulingProtocol > --- > > Key: YARN-4631 > URL: https://issues.apache.org/jira/browse/YARN-4631 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Arun Suresh >Assignee: Arun Suresh > > The {{DistributedSchedulingProtocol}} introduced in YARN-2885 which extends > the {{ApplicationMasterProtocol}}. This protocol should support its own Token > type, and not just reuse the AMRMToken. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4742) [Umbrella] Enhancements to Distributed Scheduling
[ https://issues.apache.org/jira/browse/YARN-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170421#comment-15170421 ] Arun Suresh commented on YARN-4742: --- The subtasks specified here can be worked on independently and need not be worked on a feature branch > [Umbrella] Enhancements to Distributed Scheduling > - > > Key: YARN-4742 > URL: https://issues.apache.org/jira/browse/YARN-4742 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Arun Suresh >Assignee: Arun Suresh > > This is an Umbrella JIRA to track enhancements / improvements that can be > made to the core Distributed Scheduling framework : YARN-2877 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4742) [Umbrella] Enhancements to Distributed Scheduling
[ https://issues.apache.org/jira/browse/YARN-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-4742: -- Summary: [Umbrella] Enhancements to Distributed Scheduling (was: Enhancements to Distributed Scheduling) > [Umbrella] Enhancements to Distributed Scheduling > - > > Key: YARN-4742 > URL: https://issues.apache.org/jira/browse/YARN-4742 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Arun Suresh >Assignee: Arun Suresh > > This is an Umbrella JIRA to track enhancements / improvements that can be > made to the core Distributed Scheduling framework : YARN-2877 > These can -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4742) [Umbrella] Enhancements to Distributed Scheduling
[ https://issues.apache.org/jira/browse/YARN-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-4742: -- Description: This is an Umbrella JIRA to track enhancements / improvements that can be made to the core Distributed Scheduling framework : YARN-2877 (was: This is an Umbrella JIRA to track enhancements / improvements that can be made to the core Distributed Scheduling framework : YARN-2877 These can) > [Umbrella] Enhancements to Distributed Scheduling > - > > Key: YARN-4742 > URL: https://issues.apache.org/jira/browse/YARN-4742 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Arun Suresh >Assignee: Arun Suresh > > This is an Umbrella JIRA to track enhancements / improvements that can be > made to the core Distributed Scheduling framework : YARN-2877 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4742) Enhancements to Distributed Scheduling
[ https://issues.apache.org/jira/browse/YARN-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-4742: -- Description: This is an Umbrella JIRA to track enhancements / improvements that can be made to the core Distributed Scheduling framework : YARN-2877 These can was:This is an Umbrella JIRA to track enhancements / improvements that can be made to the core Distributed Scheduling framework : YARN-2877 > Enhancements to Distributed Scheduling > -- > > Key: YARN-4742 > URL: https://issues.apache.org/jira/browse/YARN-4742 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Arun Suresh >Assignee: Arun Suresh > > This is an Umbrella JIRA to track enhancements / improvements that can be > made to the core Distributed Scheduling framework : YARN-2877 > These can -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2995) Enhance UI to show cluster resource utilization of various container types
[ https://issues.apache.org/jira/browse/YARN-2995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-2995: -- Parent Issue: YARN-4742 (was: YARN-2877) > Enhance UI to show cluster resource utilization of various container types > -- > > Key: YARN-2995 > URL: https://issues.apache.org/jira/browse/YARN-2995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Sriram Rao > > This JIRA proposes to extend the Resource manager UI to show how cluster > resources are being used to run *guaranteed start* and *queueable* > containers. For example, a graph that shows over time, the fraction of > running containers that are *guaranteed start* and the fraction of running > containers that are *queueable*. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4503) Allow for a pluggable policy to decide if a ResourceRequest is GUARANTEED or not
[ https://issues.apache.org/jira/browse/YARN-4503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-4503: -- Parent Issue: YARN-4742 (was: YARN-2877) > Allow for a pluggable policy to decide if a ResourceRequest is GUARANTEED or > not > > > Key: YARN-4503 > URL: https://issues.apache.org/jira/browse/YARN-4503 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Arun Suresh >Assignee: Arun Suresh > > As per discussions on the YARN-2882 thread, specifically [this > comment|https://issues.apache.org/jira/browse/YARN-2882?focusedCommentId=15065547=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15065547], > we would require a pluggable policy that can decide if a ResourceRequest is > GUARANTEED or OPPORTUNISTIC -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2895) Integrate distributed scheduling with capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-2895: -- Parent Issue: YARN-4742 (was: YARN-2877) > Integrate distributed scheduling with capacity scheduler > > > Key: YARN-2895 > URL: https://issues.apache.org/jira/browse/YARN-2895 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wangda Tan > > There're some benefit to integrate distributed scheduling mechanism (LocalRM) > with capacity scheduler: > - Resource usage of opportunistic container can be tracked by central RM and > capacity could be enforced > - Opportunity to transfer opportunistic container to conservative container -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4742) Enhancements to Distributed Scheduling
Arun Suresh created YARN-4742: - Summary: Enhancements to Distributed Scheduling Key: YARN-4742 URL: https://issues.apache.org/jira/browse/YARN-4742 Project: Hadoop YARN Issue Type: Bug Reporter: Arun Suresh Assignee: Arun Suresh This is an Umbrella JIRA to track enhancements / improvements that can be made to the core Distributed Scheduling framework : YARN-2877 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.
[ https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170406#comment-15170406 ] Silnov commented on YARN-4728: -- Varun Saxena,thanks for your response! I have checked MAPREDUCE-6513. The scenario is similar to that as you said. I'll get some knowledge from it:) > MapReduce job doesn't make any progress for a very very long time after one > Node become unusable. > - > > Key: YARN-4728 > URL: https://issues.apache.org/jira/browse/YARN-4728 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.6.0 > Environment: hadoop 2.6.0 > yarn >Reporter: Silnov >Priority: Critical > Original Estimate: 24h > Remaining Estimate: 24h > > I have some nodes running hadoop 2.6.0. > The cluster's configuration remain default largely. > I run some job on the cluster(especially some job processing a lot of data) > every day. > Sometimes, I found my job remain the same progression for a very very long > time. So I have to kill the job mannually and re-submit it to the cluster. It > works well before(re-submit the job and it run to the end), but something go > wrong today. > After I re-submit the same task for 3 times, its running go deadlock(the > progression doesn't change for a long time, and each time has a different > progress value.e.g.33.01%,45.8%,73.21%). > I begin to check the web UI for the hadoop, then I find there are 98 map > suspend while all the running reduce task have consumed all the avaliable > memory. I stop the yarn and add configuration below into yarn-site.xml and > then restart the yarn. > yarn.app.mapreduce.am.job.reduce.rampup.limit > 0.1 > yarn.app.mapreduce.am.job.reduce.preemption.limit > 1.0 > (wanting the yarn to preempt the reduce task's resource to run suspending map > task) > After restart the yarn,I submit the job with the property > mapreduce.job.reduce.slowstart.completedmaps=1. > but the same result happen again!!(my job remain the same progress value for > a very very long time) > I check the web UI for the hadoop again,and find that the suspended map task > is newed with the previous note:"TaskAttempt killed because it ran on > unusable node node02:21349". > Then I check the resourcemanager's log and find some useful messages below: > **Deactivating Node node02:21349 as it is now LOST. > **node02:21349 Node Transitioned from RUNNING to LOST. > I think this may happen because my network across the cluster is not good > which cause the RM don't receive the NM's heartbeat in time. > But I wonder that why the yarn framework can't preempt the running reduce > task's resource to run the suspend map task?(this cause the job remain the > same progress value for a very very long time:( ) > Any one can help? > Thank you very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4728) MapReduce job doesn't make any progress for a very very long time after one Node become unusable.
[ https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170403#comment-15170403 ] Silnov commented on YARN-4728: -- zhihai xu,thanks for your response! I will try to make some changes following your advice! > MapReduce job doesn't make any progress for a very very long time after one > Node become unusable. > - > > Key: YARN-4728 > URL: https://issues.apache.org/jira/browse/YARN-4728 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, nodemanager, resourcemanager >Affects Versions: 2.6.0 > Environment: hadoop 2.6.0 > yarn >Reporter: Silnov >Priority: Critical > Original Estimate: 24h > Remaining Estimate: 24h > > I have some nodes running hadoop 2.6.0. > The cluster's configuration remain default largely. > I run some job on the cluster(especially some job processing a lot of data) > every day. > Sometimes, I found my job remain the same progression for a very very long > time. So I have to kill the job mannually and re-submit it to the cluster. It > works well before(re-submit the job and it run to the end), but something go > wrong today. > After I re-submit the same task for 3 times, its running go deadlock(the > progression doesn't change for a long time, and each time has a different > progress value.e.g.33.01%,45.8%,73.21%). > I begin to check the web UI for the hadoop, then I find there are 98 map > suspend while all the running reduce task have consumed all the avaliable > memory. I stop the yarn and add configuration below into yarn-site.xml and > then restart the yarn. > yarn.app.mapreduce.am.job.reduce.rampup.limit > 0.1 > yarn.app.mapreduce.am.job.reduce.preemption.limit > 1.0 > (wanting the yarn to preempt the reduce task's resource to run suspending map > task) > After restart the yarn,I submit the job with the property > mapreduce.job.reduce.slowstart.completedmaps=1. > but the same result happen again!!(my job remain the same progress value for > a very very long time) > I check the web UI for the hadoop again,and find that the suspended map task > is newed with the previous note:"TaskAttempt killed because it ran on > unusable node node02:21349". > Then I check the resourcemanager's log and find some useful messages below: > **Deactivating Node node02:21349 as it is now LOST. > **node02:21349 Node Transitioned from RUNNING to LOST. > I think this may happen because my network across the cluster is not good > which cause the RM don't receive the NM's heartbeat in time. > But I wonder that why the yarn framework can't preempt the running reduce > task's resource to run the suspend map task?(this cause the job remain the > same progress value for a very very long time:( ) > Any one can help? > Thank you very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170397#comment-15170397 ] Varun Saxena commented on YARN-4700: This event can be emitted on recovery for RUNNING apps as well. > ATS storage has one extra record each time the RM got restarted > --- > > Key: YARN-4700 > URL: https://issues.apache.org/jira/browse/YARN-4700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Li Lu >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > When testing the new web UI for ATS v2, I noticed that we're creating one > extra record for each finished application (but still hold in the RM state > store) each time the RM got restarted. It's quite possible that we add the > cluster start timestamp into the default cluster id, thus each time we're > creating a new record for one application (cluster id is a part of the row > key). We need to fix this behavior, probably by having a better default > cluster id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170387#comment-15170387 ] Varun Saxena commented on YARN-4700: If we do not fix it in RM, as was the conclusion in those 2 JIRAs' and we use current timestamp, we will have to peek into the app table to see if the app already exists or not and if yes, do not enter in activity table. The reason as it seems from the discussion on those JIRAs' is that ATS events are asynchronous(because we use a dispatcher in between) so its better to replay the events. Maybe we can give priority to app start and finish events and make it synchronous for V2 as the app collector will be running within RM but this would block app finishing till ATS event completes. > ATS storage has one extra record each time the RM got restarted > --- > > Key: YARN-4700 > URL: https://issues.apache.org/jira/browse/YARN-4700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Li Lu >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > When testing the new web UI for ATS v2, I noticed that we're creating one > extra record for each finished application (but still hold in the RM state > store) each time the RM got restarted. It's quite possible that we add the > cluster start timestamp into the default cluster id, thus each time we're > creating a new record for one application (cluster id is a part of the row > key). We need to fix this behavior, probably by having a better default > cluster id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170376#comment-15170376 ] Varun Saxena commented on YARN-4700: bq. I think we're using the day timestamp for a reason as this table is supposed to be a flow (daily) activity table. And some considerations are given to long running apps that will cross the day boundaries. Let us assume RM does not restart. In that case, we will only get the start event and finish event once each. In that case, event timestamp will be close to current timestamp. And if those are the only events we get, issue with long running apps(extending over more than 2 days) will anyways be there. For instance if we get start event on day 1 and finish event on day 3 and if there is no other app for this flow, this will lead to no activity on day 2 even if we use current timestamp.. There was YARN-4069 which was filed for this issue and its with me. I was thinking of scheduling a global timer in RM which can emit ATS events for all the running apps at a certain point of time.This should resolve long running app issue. This is not marked for 1st milestone though so no progress made. > ATS storage has one extra record each time the RM got restarted > --- > > Key: YARN-4700 > URL: https://issues.apache.org/jira/browse/YARN-4700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Li Lu >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > When testing the new web UI for ATS v2, I noticed that we're creating one > extra record for each finished application (but still hold in the RM state > store) each time the RM got restarted. It's quite possible that we add the > cluster start timestamp into the default cluster id, thus each time we're > creating a new record for one application (cluster id is a part of the row > key). We need to fix this behavior, probably by having a better default > cluster id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4624) NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI
[ https://issues.apache.org/jira/browse/YARN-4624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170335#comment-15170335 ] Rohith Sharma K S commented on YARN-4624: - [~brahmareddy] can you check findbugs warnings? > NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI > --- > > Key: YARN-4624 > URL: https://issues.apache.org/jira/browse/YARN-4624 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Brahma Reddy Battula >Assignee: Brahma Reddy Battula > Attachments: SchedulerUIWithOutLabelMapping.png, YARN-2674-002.patch, > YARN-4624-003.patch, YARN-4624.patch > > > Scenario: > === > Configure nodelables and add to cluster > Start the cluster > {noformat} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.PartitionQueueCapacitiesInfo.getMaxAMLimitPercentage(PartitionQueueCapacitiesInfo.java:114) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:105) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:94) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueueBlock.render(CapacitySchedulerPage.java:293) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:447) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4741) RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue
[ https://issues.apache.org/jira/browse/YARN-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170307#comment-15170307 ] Rohith Sharma K S commented on YARN-4741: - I am not pretty sure whether it is same YARN-3990. Based on the affect version I am suspecting it might be a same issue. On the other hand, looking into event type, it may be new issue also. Anyway [~sjlee0] can you cross verify the fix of YARN-3990 is present in your cluster? > RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async > dispatcher event queue > --- > > Key: YARN-4741 > URL: https://issues.apache.org/jira/browse/YARN-4741 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Sangjin Lee >Priority: Critical > > We had a pretty major incident with the RM where it was continually flooded > with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event > queue. > In our setup, we had the RM HA or stateful restart *disabled*, but NM > work-preserving restart *enabled*. Due to other issues, we did a cluster-wide > NM restart. > Some time during the restart (which took multiple hours), we started seeing > the async dispatcher event queue building. Normally it would log 1,000. In > this case, it climbed all the way up to tens of millions of events. > When we looked at the RM log, it was full of the following messages: > {noformat} > 2016-02-18 01:47:29,530 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid > event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 > 2016-02-18 01:47:29,535 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle > this event at current state > 2016-02-18 01:47:29,535 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid > event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 > 2016-02-18 01:47:29,538 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle > this event at current state > 2016-02-18 01:47:29,538 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid > event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 > {noformat} > And that node in question was restarted a few minutes earlier. > When we inspected the RM heap, it was full of > RMNodeFinishedContainersPulledByAMEvents. > Suspecting the NM work-preserving restart, we disabled it and did another > cluster-wide rolling restart. Initially that seemed to have helped reduce the > queue size, but the queue built back up to several millions and continued for > an extended period. We had to restart the RM to resolve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2883) Queuing of container requests in the NM
[ https://issues.apache.org/jira/browse/YARN-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170234#comment-15170234 ] Hadoop QA commented on YARN-2883: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 16m 19s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 21s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 45s {color} | {color:green} yarn-2877 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 51s {color} | {color:green} yarn-2877 passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 9s {color} | {color:green} yarn-2877 passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 36s {color} | {color:green} yarn-2877 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 17s {color} | {color:green} yarn-2877 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 39s {color} | {color:green} yarn-2877 passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 41s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common in yarn-2877 has 3 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 14s {color} | {color:green} yarn-2877 passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 5s {color} | {color:green} yarn-2877 passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 7s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 7s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 23s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 23s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 41s {color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 26 new + 222 unchanged - 1 fixed = 248 total (was 223) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 40s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 55 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 12s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 26s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 49s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 25s {color} | {color:green} hadoop-yarn-server-common in the patch passed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 14s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 30s {color} | {color:green}
[jira] [Commented] (YARN-4731) container-executor should not follow symlinks in recursive_unlink_children
[ https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170178#comment-15170178 ] Hadoop QA commented on YARN-4731: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 11m 2s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 45s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 29s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 11s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 12s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 26s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 17s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 40m 12s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12790248/YARN-4731.002.patch | | JIRA Issue | YARN-4731 | | Optional Tests | asflicense compile cc mvnsite javac unit | | uname | Linux 5c9152fd5ef1 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / d1d4e16 | | Default Java | 1.7.0_95 | | Multi-JDK versions | /usr/lib/jvm/java-8-oracle:1.8.0_72 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 | | JDK v1.7.0_95 Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/10655/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/10655/console | | Powered by | Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > container-executor should not follow symlinks in recursive_unlink_children > -- > > Key: YARN-4731 > URL: https://issues.apache.org/jira/browse/YARN-4731 > Project: Hadoop YARN > Issue Type: Bug >Affects
[jira] [Updated] (YARN-1011) [Umbrella] Schedule containers based on utilization of currently allocated containers
[ https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantinos Karanasos updated YARN-1011: - Attachment: patch-for-yarn-1011.patch Hi guys, as I mentioned to [~kasha] in an offline discussion we had also with [~asuresh], I recently did some very similar changes to the RM and the CapacityScheduler, which could be of use here. We implemented a system, which we call Yaq (will be presented in EuroSys in a couple of months, so I will soon be able to share the paper as well), that allows queuing of tasks at the NMs (using some bits from YARN-2883). If you consider queuing as a way of over-committing resources, the setting has many commonalities with the current JIRA. The basic idea is that each NM advertises a number of queue slots (i.e., containers that are allowed to be queued at the NM), in order to have some tasks ready to be started immediately once resources become available. This way we can mask the allocation latency when waiting for the RM to assign and send new tasks to the NM. Then the central scheduler (the CapacityScheduler in our case), performs placement in the following way: # When there are still available resources (I hard-coded it to be <95% utilization, Karthik mentioned 80% above), the scheduling is heartbeat-driven, like it is today. # Above 95% utilization, there is an additional asynchronous thread that orders the nodes (in the current implementation based on the expected wait time of each node) and then starts filling up the queue slots. The reason for having this thread is that (1) we don't want a container to be given a queue slot, if there is a run slot available, and (2) we want to have the global view of all nodes so that we favor nodes that have queues that are less loaded. As you can see, the scheduling logic is very similar to what [~kasha] described above, so I think we can reuse those bits. What is also similar to the over-commitment that is described in this JIRA, is that we extended the {{SchedulerNode}} so that it can account for two types of resources, namely run slots (which were already there) and queue slots. Although in the current implementation we do not over-commit resources (queued tasks start only after allocated resources become available), this just has to do with how the NM decides to treat the additional tasks it receives (and can be easily changed to actually over-commit resources). As far as the RM/scheduler is concerned, the logic is the same: you have the allocated resources (run slots) and then you have some additional resources that are advertised by the nodes and can be used either for queuing containers (in Yaq's case) or for starting additional tasks based on the actual resource utilization (in the present JIRA's case). So, for the RM, the logic should be the same. I am attaching an initial patch that contains the classes that are related to the RM. It is against branch-2.7.1. Since I left out many class that were specific to YARN, the patch will not compile, but you should be able to see all the changes I did to the {{SchedulerNode}}, the {{CapacityScheduler}}, the {{*Queue}} classes, etc. I would be happy to share more code if needed. > [Umbrella] Schedule containers based on utilization of currently allocated > containers > - > > Key: YARN-1011 > URL: https://issues.apache.org/jira/browse/YARN-1011 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun C Murthy > Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf, > yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf > > > Currently RM allocates containers and assumes resources allocated are > utilized. > RM can, and should, get to a point where it measures utilization of allocated > containers and, if appropriate, allocate more (speculative?) containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4720) Skip unnecessary NN operations in log aggregation
[ https://issues.apache.org/jira/browse/YARN-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170152#comment-15170152 ] Jun Gong commented on YARN-4720: Thanks [~mingma] for the review, suggestion and commit! > Skip unnecessary NN operations in log aggregation > - > > Key: YARN-4720 > URL: https://issues.apache.org/jira/browse/YARN-4720 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Jun Gong > Fix For: 2.8.0, 2.9.0 > > Attachments: YARN-4720.01.patch, YARN-4720.02.patch, > YARN-4720.03.patch, YARN-4720.04.patch, YARN-4720.05.patch > > > Log aggregation service could have unnecessary NN operations in the following > scenarios: > * No new local log has been created since the last upload for the long > running service scenario. > * NM uses {{ContainerLogAggregationPolicy}} that skips log aggregation for > certain containers. > In the following code snippet, even though {{pendingContainerInThisCycle}} is > empty, it still creates the writer and then removes the file later. Thus it > introduces unnecessary create/getfileinfo/delete NN calls when NM doesn't > aggregate logs for an app. > > {noformat} > AppLogAggregatorImpl.java > .. > writer = > new LogWriter(this.conf, this.remoteNodeTmpLogFileForApp, > this.userUgi); > .. > for (ContainerId container : pendingContainerInThisCycle) { > .. > } > .. > if (remoteFS.exists(remoteNodeTmpLogFileForApp)) { > if (rename) { > remoteFS.rename(remoteNodeTmpLogFileForApp, renamedPath); > } else { > remoteFS.delete(remoteNodeTmpLogFileForApp, false); > } > } > .. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4741) RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue
[ https://issues.apache.org/jira/browse/YARN-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170145#comment-15170145 ] Sangjin Lee commented on YARN-4741: --- I do see the node in question trying to get in sync with the RM with the applications it thinks it still owns. The trigger might be related to that. Still, it's not clear why the queue was still flooded with those events even after the *second* restart that disabled the NM work-preserving restart. > RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async > dispatcher event queue > --- > > Key: YARN-4741 > URL: https://issues.apache.org/jira/browse/YARN-4741 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Sangjin Lee >Priority: Critical > > We had a pretty major incident with the RM where it was continually flooded > with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event > queue. > In our setup, we had the RM HA or stateful restart *disabled*, but NM > work-preserving restart *enabled*. Due to other issues, we did a cluster-wide > NM restart. > Some time during the restart (which took multiple hours), we started seeing > the async dispatcher event queue building. Normally it would log 1,000. In > this case, it climbed all the way up to tens of millions of events. > When we looked at the RM log, it was full of the following messages: > {noformat} > 2016-02-18 01:47:29,530 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid > event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 > 2016-02-18 01:47:29,535 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle > this event at current state > 2016-02-18 01:47:29,535 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid > event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 > 2016-02-18 01:47:29,538 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle > this event at current state > 2016-02-18 01:47:29,538 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid > event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 > {noformat} > And that node in question was restarted a few minutes earlier. > When we inspected the RM heap, it was full of > RMNodeFinishedContainersPulledByAMEvents. > Suspecting the NM work-preserving restart, we disabled it and did another > cluster-wide rolling restart. Initially that seemed to have helped reduce the > queue size, but the queue built back up to several millions and continued for > an extended period. We had to restart the RM to resolve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4062) Add the flush and compaction functionality via coprocessors and scanners for flow run table
[ https://issues.apache.org/jira/browse/YARN-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170144#comment-15170144 ] Hadoop QA commented on YARN-4062: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 13m 9s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 2m 7s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 42s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 59s {color} | {color:green} YARN-2928 passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 21s {color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 40s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 56s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 32s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 59s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 55s {color} | {color:green} YARN-2928 passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 8s {color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 44s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 57s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 57s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 16s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 16s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 36s {color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 3 new + 210 unchanged - 1 fixed = 213 total (was 211) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 50s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 1s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 48s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 3s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 22s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 14s {color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 11s {color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s {color} | {color:green} Patch does not generate ASF License warnings. {color} | |
[jira] [Commented] (YARN-4117) End to end unit test with mini YARN cluster for AMRMProxy Service
[ https://issues.apache.org/jira/browse/YARN-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170135#comment-15170135 ] Hadoop QA commented on YARN-4117: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 19s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 35s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 45s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 8s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 32s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 40s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 50s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 54s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 13s {color} | {color:red} hadoop-yarn-server-tests in the patch failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 1m 28s {color} | {color:red} hadoop-yarn in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 1m 28s {color} | {color:red} hadoop-yarn in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 1m 35s {color} | {color:red} hadoop-yarn in the patch failed with JDK v1.7.0_95. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 1m 35s {color} | {color:red} hadoop-yarn in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 31s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 15s {color} | {color:red} hadoop-yarn-server-tests in the patch failed. {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 34s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 12s {color} | {color:red} hadoop-yarn-server-tests in the patch failed. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 41s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 12s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 13s {color} | {color:red} hadoop-yarn-server-tests in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 66m 4s {color} | {color:red} hadoop-yarn-client in the patch failed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 39s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_95. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 14s {color} | {color:red} hadoop-yarn-server-tests in the patch failed with JDK v1.7.0_95.
[jira] [Updated] (YARN-4741) RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue
[ https://issues.apache.org/jira/browse/YARN-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-4741: --- Priority: Critical (was: Major) > RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async > dispatcher event queue > --- > > Key: YARN-4741 > URL: https://issues.apache.org/jira/browse/YARN-4741 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Sangjin Lee >Priority: Critical > > We had a pretty major incident with the RM where it was continually flooded > with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event > queue. > In our setup, we had the RM HA or stateful restart *disabled*, but NM > work-preserving restart *enabled*. Due to other issues, we did a cluster-wide > NM restart. > Some time during the restart (which took multiple hours), we started seeing > the async dispatcher event queue building. Normally it would log 1,000. In > this case, it climbed all the way up to tens of millions of events. > When we looked at the RM log, it was full of the following messages: > {noformat} > 2016-02-18 01:47:29,530 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid > event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 > 2016-02-18 01:47:29,535 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle > this event at current state > 2016-02-18 01:47:29,535 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid > event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 > 2016-02-18 01:47:29,538 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle > this event at current state > 2016-02-18 01:47:29,538 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid > event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 > {noformat} > And that node in question was restarted a few minutes earlier. > When we inspected the RM heap, it was full of > RMNodeFinishedContainersPulledByAMEvents. > Suspecting the NM work-preserving restart, we disabled it and did another > cluster-wide rolling restart. Initially that seemed to have helped reduce the > queue size, but the queue built back up to several millions and continued for > an extended period. We had to restart the RM to resolve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4741) RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue
Sangjin Lee created YARN-4741: - Summary: RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue Key: YARN-4741 URL: https://issues.apache.org/jira/browse/YARN-4741 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Sangjin Lee We had a pretty major incident with the RM where it was continually flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue. In our setup, we had the RM HA or stateful restart *disabled*, but NM work-preserving restart *enabled*. Due to other issues, we did a cluster-wide NM restart. Some time during the restart (which took multiple hours), we started seeing the async dispatcher event queue building. Normally it would log 1,000. In this case, it climbed all the way up to tens of millions of events. When we looked at the RM log, it was full of the following messages: {noformat} 2016-02-18 01:47:29,530 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 2016-02-18 01:47:29,535 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle this event at current state 2016-02-18 01:47:29,535 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 2016-02-18 01:47:29,538 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle this event at current state 2016-02-18 01:47:29,538 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node worker-node-foo.bar.net:8041 {noformat} And that node in question was restarted a few minutes earlier. When we inspected the RM heap, it was full of RMNodeFinishedContainersPulledByAMEvents. Suspecting the NM work-preserving restart, we disabled it and did another cluster-wide rolling restart. Initially that seemed to have helped reduce the queue size, but the queue built back up to several millions and continued for an extended period. We had to restart the RM to resolve the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170102#comment-15170102 ] Li Lu commented on YARN-4700: - bq. In the code that writes to the flow activity table, can we check the application status and make a decision not to write them? I'm not very familiar with the replayed event sequence, but will we receive an application finished event for each of the finished applications? If so, it will make distinguishing the real events from the replayed events very difficult? [~Naganarasimha] what's your experience working with the SMP? Thanks! > ATS storage has one extra record each time the RM got restarted > --- > > Key: YARN-4700 > URL: https://issues.apache.org/jira/browse/YARN-4700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Li Lu >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > When testing the new web UI for ATS v2, I noticed that we're creating one > extra record for each finished application (but still hold in the RM state > store) each time the RM got restarted. It's quite possible that we add the > cluster start timestamp into the default cluster id, thus each time we're > creating a new record for one application (cluster id is a part of the row > key). We need to fix this behavior, probably by having a better default > cluster id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2883) Queuing of container requests in the NM
[ https://issues.apache.org/jira/browse/YARN-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170103#comment-15170103 ] Arun Suresh commented on YARN-2883: --- Thanks for the patch [~kkaranasos]. Couple of comments from my initial fly-by of the patch. * Don’t think we should add the additional *newContainerStatus* method in ContainerStatus, since the standard pattern I see is to use the setter. * In {{Context}}, briefly give a comment about what the Key & Value of the maps returned by *getQueuedContainers* and *getKilledQueuedContainers* * It may be better to create another class like {{DistributedSchedulingContext}} tha exposes both the above methods and then {{Context}} can just expose {{getDistributedSchedulingContext}} which can return null if DistSched is not enabled. * See above comment on changes to NMContext (create a DistSchedContext and add both the new maps in there) * In {{ContainerManagerImpl}}, we should register the {{ContainerExecutionEventType}} with the dispatcher in its contructor where most of the other dispatchers are registered.. And I guess it should be there even if AMRMProxy is enabled or not (in which case no events will be generated) *ContainersMonitorImpl* * Can we replace Logical with Allocated (I am guessing that is what it means) ? * Again, lets move all the state related to DistributedScheduling (logicalGuarContainers, logicalOpportContainers, logicalContainersUtilization, queuedGuarRequests, queuedOpportRequests and opportContainersToKill) to a separate {{DistributedSchedulingState}} object.. Will be easier to reason about how the existing code has changed. * Maybe we should mover all the right shifting ( >> 20) into a utility function inside ProcessTreeInfo ? *General Comments* * Couple of classes have unused imports.. * We need some Unit tests Will provide more comments shortly.. > Queuing of container requests in the NM > --- > > Key: YARN-2883 > URL: https://issues.apache.org/jira/browse/YARN-2883 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Konstantinos Karanasos >Assignee: Konstantinos Karanasos > Attachments: YARN-2883-yarn-2877.001.patch > > > We propose to add a queue in each NM, where queueable container requests can > be held. > Based on the available resources in the node and the containers in the queue, > the NM will decide when to allow the execution of a queued container. > In order to ensure the instantaneous start of a guaranteed-start container, > the NM may decide to pre-empt/kill running queueable containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170065#comment-15170065 ] Sangjin Lee commented on YARN-4700: --- Wait, I think we're using the day timestamp for a reason as this table is supposed to be a flow (daily) activity table. And some considerations are given to long running apps that will cross the day boundaries. I'd like us to stick with that unless there is a compelling reason not to? In the code that writes to the flow activity table, can we check the application status and make a decision not to write them? cc [~jrottinghuis] > ATS storage has one extra record each time the RM got restarted > --- > > Key: YARN-4700 > URL: https://issues.apache.org/jira/browse/YARN-4700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Li Lu >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > When testing the new web UI for ATS v2, I noticed that we're creating one > extra record for each finished application (but still hold in the RM state > store) each time the RM got restarted. It's quite possible that we add the > cluster start timestamp into the default cluster id, thus each time we're > creating a new record for one application (cluster id is a part of the row > key). We need to fix this behavior, probably by having a better default > cluster id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2965) Enhance Node Managers to monitor and report the resource usage on machines
[ https://issues.apache.org/jira/browse/YARN-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170064#comment-15170064 ] Inigo Goiri commented on YARN-2965: --- YARN-3534 already collects the resource utilization in the NM and YARN-4055 sends them to the RM. I would like to use this JIRA to collect disk and network utilization from HADOOP-12210 and HADOOP-12211, add them to {{ResourceUtilization}} as MB/s, and let YARN-4055 take care of sending all the utilization to the RM. Not sure what is the status of the different efforts towards this direction at this point. [~vinodkv], [~kasha] thoughts? > Enhance Node Managers to monitor and report the resource usage on machines > -- > > Key: YARN-2965 > URL: https://issues.apache.org/jira/browse/YARN-2965 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Robert Grandl >Assignee: Robert Grandl > Attachments: ddoc_RT.docx > > > This JIRA is about augmenting Node Managers to monitor the resource usage on > the machine, aggregates these reports and exposes them to the RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4711) NM is going down with NPE's due to single thread processing of events by Timeline client
[ https://issues.apache.org/jira/browse/YARN-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170052#comment-15170052 ] Sangjin Lee commented on YARN-4711: --- Thanks! > NM is going down with NPE's due to single thread processing of events by > Timeline client > > > Key: YARN-4711 > URL: https://issues.apache.org/jira/browse/YARN-4711 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R >Priority: Critical > Labels: yarn-2928-1st-milestone > > After YARN-3367, while testing the latest 2928 branch came across few NPEs > due to which NM is shutting down. > {code} > 2016-02-21 23:19:54,078 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: > Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:306) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:296) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109) > at java.lang.Thread.run(Thread.java:745) > {code} > {code} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:213) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerFinishedEvent(NMTimelinePublisher.java:192) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.access$400(NMTimelinePublisher.java:63) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:289) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:280) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109) > at java.lang.Thread.run(Thread.java:745) > {code} > On analysis found that the there was delay in processing of events, as after > YARN-3367 all the events were getting processed by a single thread inside the > timeline client. > Additionally found one scenario where there is possibility of NPE: > * TimelineEntity.toString() when {{real}} is not null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4731) container-executor should not follow symlinks in recursive_unlink_children
[ https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated YARN-4731: --- Attachment: YARN-4731.002.patch > container-executor should not follow symlinks in recursive_unlink_children > -- > > Key: YARN-4731 > URL: https://issues.apache.org/jira/browse/YARN-4731 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Varun Vasudev >Priority: Blocker > Attachments: YARN-4731.001.patch, YARN-4731.002.patch > > > Enable LCE and CGroups > Submit a mapreduce job > {noformat} > 2016-02-24 18:56:46,889 INFO > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting > absolute path : > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > 2016-02-24 18:56:46,894 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 255. Privileged Execution Operation > Output: > main : command provided 3 > main : run as user is dsperf > main : requested yarn user is dsperf > failed to rmdir job.jar: Not a directory > Error while deleting > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01: > 20 (Not a directory) > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > dsperf, dsperf, 3, > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01] > 2016-02-24 18:56:46,894 ERROR > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: > DeleteAsUser for > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > returned with exit code: 255 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=255: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=255: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) > ... 10 more > {noformat} > As a result nodemanager-local directory are not getting deleted for each > application > {noformat} > total 36 > drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./ > drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../ > -rw--- 1 hdfs hadoop 340 Feb 25 08:25 container_tokens > lrwxrwxrwx 1 hdfs hadoop 111 Feb 25 08:25 job.jar -> > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/11/job.jar/ > lrwxrwxrwx 1 hdfs hadoop 111 Feb 25 08:25 job.xml -> > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/13/job.xml* > drwxr-s--- 2 hdfs hadoop 4096 Feb 25 08:25 jobSubmitDir/ > -rwx-- 1 hdfs hadoop 5348 Feb 25 08:25 launch_container.sh* > drwxr-s--- 2 hdfs hadoop 4096 Feb 25 08:25 tmp/ > {noformat} -- This message was sent by Atlassian JIRA
[jira] [Comment Edited] (YARN-4731) container-executor should not follow symlinks in recursive_unlink_children
[ https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169619#comment-15169619 ] Colin Patrick McCabe edited comment on YARN-4731 at 2/26/16 10:49 PM: -- Thanks for finding this bug. Unfortunately, I think the patch has some issues... it introduces a race condition where the path could change during our traversal. The issue with patch v1 is a TOCTOU (time of check / time of use) race condition. Here is one example: 1. {{container-executor}} checks /foo to make sure that it's not a symlink; it isn't 2. An attacker moves /foo out of the way and re-creates /foo as a symlink to /etc 3. {{container-executor}} deletes /foo (which is really actually /etc at this point) The v2 version I posted avoids this race condition by using O_NOFOLLOW to open the files in step 3. Also, one note: we should also be using the {{dirfd}} and {{name}}, not {{fullpath}}. "fullpath" is purely provided for debugging and log messages. The directory could be renamed while we're traversing it; we don't want the removal to fail in this case. was (Author: cmccabe): Thanks for finding this bug. Unfortunately, I think the patch has some issues... it introduces a race condition where the path could change during our traversal. Let me see if I can find a way to do this through fstat or a related call. Please hold on for now... > container-executor should not follow symlinks in recursive_unlink_children > -- > > Key: YARN-4731 > URL: https://issues.apache.org/jira/browse/YARN-4731 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Varun Vasudev >Priority: Blocker > Attachments: YARN-4731.001.patch > > > Enable LCE and CGroups > Submit a mapreduce job > {noformat} > 2016-02-24 18:56:46,889 INFO > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting > absolute path : > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > 2016-02-24 18:56:46,894 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 255. Privileged Execution Operation > Output: > main : command provided 3 > main : run as user is dsperf > main : requested yarn user is dsperf > failed to rmdir job.jar: Not a directory > Error while deleting > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01: > 20 (Not a directory) > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > dsperf, dsperf, 3, > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01] > 2016-02-24 18:56:46,894 ERROR > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: > DeleteAsUser for > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > returned with exit code: 255 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=255: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=255: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) >
[jira] [Updated] (YARN-4731) container-executor should not follow symlinks in recursive_unlink_children
[ https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated YARN-4731: --- Attachment: (was: YARN-4731.001.patch) > container-executor should not follow symlinks in recursive_unlink_children > -- > > Key: YARN-4731 > URL: https://issues.apache.org/jira/browse/YARN-4731 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Varun Vasudev >Priority: Blocker > Attachments: YARN-4731.001.patch > > > Enable LCE and CGroups > Submit a mapreduce job > {noformat} > 2016-02-24 18:56:46,889 INFO > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting > absolute path : > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > 2016-02-24 18:56:46,894 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 255. Privileged Execution Operation > Output: > main : command provided 3 > main : run as user is dsperf > main : requested yarn user is dsperf > failed to rmdir job.jar: Not a directory > Error while deleting > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01: > 20 (Not a directory) > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > dsperf, dsperf, 3, > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01] > 2016-02-24 18:56:46,894 ERROR > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: > DeleteAsUser for > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > returned with exit code: 255 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=255: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=255: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) > ... 10 more > {noformat} > As a result nodemanager-local directory are not getting deleted for each > application > {noformat} > total 36 > drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./ > drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../ > -rw--- 1 hdfs hadoop 340 Feb 25 08:25 container_tokens > lrwxrwxrwx 1 hdfs hadoop 111 Feb 25 08:25 job.jar -> > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/11/job.jar/ > lrwxrwxrwx 1 hdfs hadoop 111 Feb 25 08:25 job.xml -> > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/13/job.xml* > drwxr-s--- 2 hdfs hadoop 4096 Feb 25 08:25 jobSubmitDir/ > -rwx-- 1 hdfs hadoop 5348 Feb 25 08:25 launch_container.sh* > drwxr-s--- 2 hdfs hadoop 4096 Feb 25 08:25 tmp/ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4731) container-executor should not follow symlinks in recursive_unlink_children
[ https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated YARN-4731: --- Attachment: YARN-4731.001.patch Here is a patch which changes {{recursive_unlink_children}} to skip removing symlinks. It doesn't open up a TOCTOU security issue, since it opens files with {{O_NOFOLLOW}} after doing the symlink check. I added a unit test case for {{recursive_unlink_children}}. > container-executor should not follow symlinks in recursive_unlink_children > -- > > Key: YARN-4731 > URL: https://issues.apache.org/jira/browse/YARN-4731 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Varun Vasudev >Priority: Blocker > Attachments: YARN-4731.001.patch, YARN-4731.001.patch > > > Enable LCE and CGroups > Submit a mapreduce job > {noformat} > 2016-02-24 18:56:46,889 INFO > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting > absolute path : > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > 2016-02-24 18:56:46,894 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 255. Privileged Execution Operation > Output: > main : command provided 3 > main : run as user is dsperf > main : requested yarn user is dsperf > failed to rmdir job.jar: Not a directory > Error while deleting > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01: > 20 (Not a directory) > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > dsperf, dsperf, 3, > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01] > 2016-02-24 18:56:46,894 ERROR > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: > DeleteAsUser for > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > returned with exit code: 255 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=255: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=255: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) > ... 10 more > {noformat} > As a result nodemanager-local directory are not getting deleted for each > application > {noformat} > total 36 > drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./ > drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../ > -rw--- 1 hdfs hadoop 340 Feb 25 08:25 container_tokens > lrwxrwxrwx 1 hdfs hadoop 111 Feb 25 08:25 job.jar -> > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/11/job.jar/ > lrwxrwxrwx 1 hdfs hadoop 111 Feb 25 08:25 job.xml -> >
[jira] [Updated] (YARN-4731) container-executor should not follow symlinks in recursive_unlink_children
[ https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated YARN-4731: --- Summary: container-executor should not follow symlinks in recursive_unlink_children (was: Linux container executor fails to delete nmlocal folders) > container-executor should not follow symlinks in recursive_unlink_children > -- > > Key: YARN-4731 > URL: https://issues.apache.org/jira/browse/YARN-4731 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Varun Vasudev >Priority: Blocker > Attachments: YARN-4731.001.patch > > > Enable LCE and CGroups > Submit a mapreduce job > {noformat} > 2016-02-24 18:56:46,889 INFO > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting > absolute path : > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > 2016-02-24 18:56:46,894 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 255. Privileged Execution Operation > Output: > main : command provided 3 > main : run as user is dsperf > main : requested yarn user is dsperf > failed to rmdir job.jar: Not a directory > Error while deleting > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01: > 20 (Not a directory) > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > dsperf, dsperf, 3, > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01] > 2016-02-24 18:56:46,894 ERROR > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: > DeleteAsUser for > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > returned with exit code: 255 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=255: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=255: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) > ... 10 more > {noformat} > As a result nodemanager-local directory are not getting deleted for each > application > {noformat} > total 36 > drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./ > drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../ > -rw--- 1 hdfs hadoop 340 Feb 25 08:25 container_tokens > lrwxrwxrwx 1 hdfs hadoop 111 Feb 25 08:25 job.jar -> > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/11/job.jar/ > lrwxrwxrwx 1 hdfs hadoop 111 Feb 25 08:25 job.xml -> > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/13/job.xml* > drwxr-s--- 2 hdfs hadoop 4096 Feb 25 08:25 jobSubmitDir/ > -rwx-- 1 hdfs hadoop 5348 Feb 25 08:25 launch_container.sh* > drwxr-s--- 2
[jira] [Updated] (YARN-4062) Add the flush and compaction functionality via coprocessors and scanners for flow run table
[ https://issues.apache.org/jira/browse/YARN-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vrushali C updated YARN-4062: - Attachment: YARN-4062-YARN-2928.05.patch Uploading YARN-4062-YARN-2928.05.patch after fixing the checkstyle warnings > Add the flush and compaction functionality via coprocessors and scanners for > flow run table > --- > > Key: YARN-4062 > URL: https://issues.apache.org/jira/browse/YARN-4062 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Labels: yarn-2928-1st-milestone > Attachments: YARN-4062-YARN-2928.04.patch, > YARN-4062-YARN-2928.05.patch, YARN-4062-YARN-2928.1.patch, > YARN-4062-feature-YARN-2928.01.patch, YARN-4062-feature-YARN-2928.02.patch, > YARN-4062-feature-YARN-2928.03.patch > > > As part of YARN-3901, coprocessor and scanner is being added for storing into > the flow_run table. It also needs a flush & compaction processing in the > coprocessor and perhaps a new scanner to deal with the data during flushing > and compaction stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169939#comment-15169939 ] Vrushali C commented on YARN-4700: -- This was a good catch, thanks [~gtCarrera], [~varun_saxena] and [~Naganarasimha]! Let me know if I can be of any help. > ATS storage has one extra record each time the RM got restarted > --- > > Key: YARN-4700 > URL: https://issues.apache.org/jira/browse/YARN-4700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Li Lu >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > When testing the new web UI for ATS v2, I noticed that we're creating one > extra record for each finished application (but still hold in the RM state > store) each time the RM got restarted. It's quite possible that we add the > cluster start timestamp into the default cluster id, thus each time we're > creating a new record for one application (cluster id is a part of the row > key). We need to fix this behavior, probably by having a better default > cluster id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3863) Support complex filters in TimelineReader
[ https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169930#comment-15169930 ] Varun Saxena commented on YARN-3863: Coming to review comments, bq. I know this is happening deep inside the method, but it seems like a bit of an anti-pattern that we have to reference whether something is an application v. entity. I think we can simply move all these methods to ApplicationEntityReader as well with only application specific code. bq. This is more of a question. Is a list of multiple equality filters the same as the multi-val equality filter? If not, how are they different? Yes, they are same. Just that, we will have to peek for the same key again and again, while matching. bq. (TimelineMultiValueEqualityFilter.java) The name is bit confusing (see above) Maybe we can change TimelineEqualityFilter to TimelineKeyValueFilter and this one to TimelineKeyValuesFilter ? > Support complex filters in TimelineReader > - > > Key: YARN-3863 > URL: https://issues.apache.org/jira/browse/YARN-3863 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-3863-YARN-2928.v2.01.patch, > YARN-3863-YARN-2928.v2.02.patch, YARN-3863-feature-YARN-2928.wip.003.patch, > YARN-3863-feature-YARN-2928.wip.01.patch, > YARN-3863-feature-YARN-2928.wip.02.patch, > YARN-3863-feature-YARN-2928.wip.04.patch, > YARN-3863-feature-YARN-2928.wip.05.patch > > > Currently filters in timeline reader will return an entity only if all the > filter conditions hold true i.e. only AND operation is supported. We can > support OR operation for the filters as well. Additionally as primary backend > implementation is HBase, we can design our filters in a manner, where they > closely resemble HBase Filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169923#comment-15169923 ] Vrushali C commented on YARN-4700: -- I see. So when events are being replayed, we making new entries in the flow activity table since we are using the current time. Yes, I think we should use the event timestamp. Should be a simple enough fix, take the event timestamp for the CREATED or the FINISHED event and use that instead of null in the HBaseTimelineWriterImpl#storeInFlowActivityTable function. > ATS storage has one extra record each time the RM got restarted > --- > > Key: YARN-4700 > URL: https://issues.apache.org/jira/browse/YARN-4700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Li Lu >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > When testing the new web UI for ATS v2, I noticed that we're creating one > extra record for each finished application (but still hold in the RM state > store) each time the RM got restarted. It's quite possible that we add the > cluster start timestamp into the default cluster id, thus each time we're > creating a new record for one application (cluster id is a part of the row > key). We need to fix this behavior, probably by having a better default > cluster id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-4731) Linux container executor fails to delete nmlocal folders
[ https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169619#comment-15169619 ] Colin Patrick McCabe edited comment on YARN-4731 at 2/26/16 9:40 PM: - Thanks for finding this bug. Unfortunately, I think the patch has some issues... it introduces a race condition where the path could change during our traversal. Let me see if I can find a way to do this through fstat or a related call. Please hold on for now... was (Author: cmccabe): Thanks for finding this bug. Unfortunately, I think the patch has some issues... it introduces a race condition where the path could change during our traversal. Let me see if I can find a way to do this through fstat or a related call. > Linux container executor fails to delete nmlocal folders > > > Key: YARN-4731 > URL: https://issues.apache.org/jira/browse/YARN-4731 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Varun Vasudev >Priority: Blocker > Attachments: YARN-4731.001.patch > > > Enable LCE and CGroups > Submit a mapreduce job > {noformat} > 2016-02-24 18:56:46,889 INFO > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting > absolute path : > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > 2016-02-24 18:56:46,894 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 255. Privileged Execution Operation > Output: > main : command provided 3 > main : run as user is dsperf > main : requested yarn user is dsperf > failed to rmdir job.jar: Not a directory > Error while deleting > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01: > 20 (Not a directory) > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > dsperf, dsperf, 3, > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01] > 2016-02-24 18:56:46,894 ERROR > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: > DeleteAsUser for > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > returned with exit code: 255 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=255: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=255: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) > ... 10 more > {noformat} > As a result nodemanager-local directory are not getting deleted for each > application > {noformat} > total 36 > drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./ > drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../ > -rw--- 1 hdfs hadoop 340 Feb 25 08:25 container_tokens > lrwxrwxrwx 1 hdfs hadoop 111 Feb 25 08:25
[jira] [Commented] (YARN-4062) Add the flush and compaction functionality via coprocessors and scanners for flow run table
[ https://issues.apache.org/jira/browse/YARN-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169876#comment-15169876 ] Hadoop QA commented on YARN-4062: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 21s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 2s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 57s {color} | {color:green} YARN-2928 passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 19s {color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 37s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 56s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 28s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 53s {color} | {color:green} YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 50s {color} | {color:green} YARN-2928 passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 8s {color} | {color:green} YARN-2928 passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 43s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 54s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 54s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 17s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 17s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 36s {color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 22 new + 210 unchanged - 1 fixed = 232 total (was 211) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 50s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 14s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 1m 47s {color} | {color:red} hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-timelineservice-jdk1.8.0_72 with JDK v1.8.0_72 generated 13 new + 0 unchanged - 0 fixed = 13 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 3s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 20s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m 13s {color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_95. {color} | | {color:green}+1{color} |
[jira] [Commented] (YARN-3863) Support complex filters in TimelineReader
[ https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169845#comment-15169845 ] Varun Saxena commented on YARN-3863: I missed *TimelineStorageUtils.java*. Changes here are to apply filters locally for FS Implementation. matchRelationFilters and matchEventFilters will be used by HBase implementation as well. > Support complex filters in TimelineReader > - > > Key: YARN-3863 > URL: https://issues.apache.org/jira/browse/YARN-3863 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-3863-YARN-2928.v2.01.patch, > YARN-3863-YARN-2928.v2.02.patch, YARN-3863-feature-YARN-2928.wip.003.patch, > YARN-3863-feature-YARN-2928.wip.01.patch, > YARN-3863-feature-YARN-2928.wip.02.patch, > YARN-3863-feature-YARN-2928.wip.04.patch, > YARN-3863-feature-YARN-2928.wip.05.patch > > > Currently filters in timeline reader will return an entity only if all the > filter conditions hold true i.e. only AND operation is supported. We can > support OR operation for the filters as well. Additionally as primary backend > implementation is HBase, we can design our filters in a manner, where they > closely resemble HBase Filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3863) Support complex filters in TimelineReader
[ https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169833#comment-15169833 ] Varun Saxena commented on YARN-3863: As the patch is quite large, to aid in review, I will jot down what has been done in this JIRA. # The intention is to convert filters which were represented as maps or sets to TimelineFilterList which would help in complex filters being supported. i.e. Let me take example of config filters in the format {{cfg1=value1, cfg2=value2, cfg3=value3, cfg4=value4}} which means all the key value pairs should be matched for an entity. With work in this JIRA,we can support complex filters such as {{(cfg1=value1 OR cfg2=value2) AND (cfg3 \!=value3 AND cfg4\!=value4)}}. # Similarly current metric filters just check if a certain metric exists for an entity or not, and does not compare against metric values, for instance, {{metric1 >= 40}}. This will be supported now. Now coming to code, * *TimelineEntityFilters.java* Filter representation has been changed. Now all filters will be represented as a TImelineFilterList to support complex filters with ANDs' and ORs'. What kind of filters each filter list will hold, well more on that in the next point. * *TimelinexxxFilter.java* I have added 4 new filter classes here. All these filter classes can be put inside a TimelineFilterList to construct complex filters using ANDs' and ORs'. All these filters will then be converted to HBase Filters in HBase implementation. *TimelineEqualityFilter* is meant to match key value pairs. This will be used to represent config and info filters. Key and value can either be equal to or not equal to. *TimelineMultiValEqualityFilter* is to match key and a list/set of values. These values will be a subset of what each entity must contain. This is used to match relations(relatesTo and isRelatedTo). For instance, if we specify entitytype=id1,id2,id3, this means for each entity we will check if in relations specified, id1,id2 and id3 exist for the entitytype. It would not matter if other ids'(within the scope of entity type) are specified as relations for the entity. *TimelineCompareFilter* - As the name suggests, it is used for comparison. This is used to represent metric filters. All relational operators such as =, !=, >, >=, < and <=. *TimelineExistsFilter* - This checks if the value exists. Used for event filters to represent if an event exists. Transformed into HBase's column qualifier filter. * *xxxEntityReader.java* These classes are meant to read from different tables from HBase backend. These classes contain the primary changes for HBase implementation. I had focused on adding ample comments in code for this part but still as its important, I will explain it as well. Basically we create HBase filter list based on fields, configs and metrics to retrieve(done in YARN-3862) and a filter list based on filters, which is done in this JIRA. *TimelineEntityReader* - In this class we introduce a new abstract method {{constructFilterListBasedOnFilters}} which will be implemented by derived classes to create a filter list based on filters. For single entity read, a filter list based on filters does not make any sense so the filter list created will only be based on fields. For multiple entity reads though, we will create a new filter list containing filters and fields together. *GenericEntityReader* - The changes here are meant for entity table. And some common code which is used by ApplicationEntityReader as well. In {{constructFilterListBasedOnFilters}}, HBase filter lists are created for created time range, config, info and metric filters. Relation based filters and event filters cannot be directly added here because of the way events and relations are stored in the backend HBase storage. That is, we cannot apply a SingleColumnValueFilter to filter out rows. So for them, we add filters to fetch only the columns which we require to match these filters locally. This is only done if these fields are not mentioned in fieldsToRetrieve. For example, if I have event filters coming as (event1, event2) and fields to retrieve does not mention EVENTS or ALL, I will read all event columns corresponding to event1 and event2, for the filtered rows. This will reduce amount of data retrieved from backend. Especially for events, because number of events can be quite a few. Code for this is in method {{constructFilterListBasedOnFields}}. Now coming to the new methods which have been added. {{fetchPartialColsFromInfoFamily}} checks if we need to get some of the columns for relations and events from backend. This depends on the condition explained above. {{createFilterListForColsOfInfoFamily}}, is called if above condition is true. Here the idea is to add each column under INFO column family which is in fields to retrieve or has to be added because we want to match certain relation
[jira] [Commented] (YARN-4723) NodesListManager$UnknownNodeId ClassCastException
[ https://issues.apache.org/jira/browse/YARN-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169832#comment-15169832 ] Hudson commented on YARN-4723: -- FAILURE: Integrated in Hadoop-trunk-Commit #9376 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/9376/]) YARN-4723. NodesListManager$UnknownNodeId ClassCastException. (jlowe: rev 6b0f813e898cbd14b2ae52ecfed6d30bce8cb6b7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java * hadoop-yarn-project/CHANGES.txt > NodesListManager$UnknownNodeId ClassCastException > - > > Key: YARN-4723 > URL: https://issues.apache.org/jira/browse/YARN-4723 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.3 >Reporter: Jason Lowe >Assignee: Kuhu Shukla >Priority: Critical > Fix For: 2.7.3 > > Attachments: YARN-4723-branch-2.7.001.patch, YARN-4723.001.patch, > YARN-4723.002.patch > > > Saw the following in an RM log: > {noformat} > 2016-02-16 22:55:35,207 [IPC Server handler 5 on 8030] WARN ipc.Server: IPC > Server handler 5 on 8030, call > org.apache.hadoop.ipc.ProtobufRpcEngine$Server@6c403aff > java.lang.ClassCastException: > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId > cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:247) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:271) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:220) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:712) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:68) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:658) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:647) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:9335) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:144) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:175) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:96) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:608) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server.call(Server.java:2267) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:648) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:615) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2217) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4671) There is no need to acquire CS lock when completing a container
[ https://issues.apache.org/jira/browse/YARN-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169785#comment-15169785 ] Hadoop QA commented on YARN-4671: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 16s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: patch generated 1 new + 104 unchanged - 1 fixed = 105 total (was 105) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 73m 17s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 22s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 17s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 163m 44s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_72 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates | | JDK v1.7.0_95 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL
[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169764#comment-15169764 ] Naganarasimha G R commented on YARN-4700: - Yes these were the jira's where we discussed about it but the [conclusion|https://issues.apache.org/jira/browse/YARN-4392?focusedCommentId=15033961=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15033961] was, we were ok with republishing the events with exact data rather than not publishing at all because its not guaranteed that ATS events for apps in state store are successfully published. > ATS storage has one extra record each time the RM got restarted > --- > > Key: YARN-4700 > URL: https://issues.apache.org/jira/browse/YARN-4700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Li Lu >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > When testing the new web UI for ATS v2, I noticed that we're creating one > extra record for each finished application (but still hold in the RM state > store) each time the RM got restarted. It's quite possible that we add the > cluster start timestamp into the default cluster id, thus each time we're > creating a new record for one application (cluster id is a part of the row > key). We need to fix this behavior, probably by having a better default > cluster id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169756#comment-15169756 ] Naganarasimha G R commented on YARN-4700: - Thanks [~varun_saxena], I was also thinking in the same lines of using event's timestamp than the current time stamp... > ATS storage has one extra record each time the RM got restarted > --- > > Key: YARN-4700 > URL: https://issues.apache.org/jira/browse/YARN-4700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Li Lu >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > When testing the new web UI for ATS v2, I noticed that we're creating one > extra record for each finished application (but still hold in the RM state > store) each time the RM got restarted. It's quite possible that we add the > cluster start timestamp into the default cluster id, thus each time we're > creating a new record for one application (cluster id is a part of the row > key). We need to fix this behavior, probably by having a better default > cluster id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169753#comment-15169753 ] Naganarasimha G R commented on YARN-4736: - Thanks for the analysis [~sjlee0], bq. Both the thread dump and the HBase exception log are from the client process (NM side), correct? Yes both are client side exceptions (i.e. NM) but i am not sure issue 2 has relationship with issue 1 but based on your explanation it seems to be related. bq. some time after that, it looks like you issued a signal to stop the client process (NM)? Yes, after a significant amount of time after the completion of the job ( 00:02:28) error log came (00:39:03), and there was significant time after which i had tried to stop the NM @ 01:09:19. And even when try to stop the NM immediately after the job completion i am able to see this issue (NM not stopping completely) bq. That's why I thought there seems to be a HBase bug that is causing the flush operation to be wedged in this state. At least that explains why you were not able to shut down the collector (and therefore NM). Yes may be, but to confirm is the server side logs required ? I am using *HBase - Version 1.0.2* . Can share if any other information required too. cc/ [~te...@apache.org]. > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: hbaseException.log, threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169734#comment-15169734 ] Varun Saxena commented on YARN-4700: When we recover an app we can also get start and finish time for an app from the state store. This is updated then in RMAppImpl. So when the event is replayed and reported to ATSv2, app start event would carry the same time as the time at which app originally started. Currently for flow activity table, the inverted top of the day timestamp is generated based on current system time. I think we can instead use the timestamp coming in the event. That should resolve this issue. cc [~sjlee0], [~vrushalic]. I do not see any issue in using event's timestamp. Do you see any ? > ATS storage has one extra record each time the RM got restarted > --- > > Key: YARN-4700 > URL: https://issues.apache.org/jira/browse/YARN-4700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Li Lu >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > When testing the new web UI for ATS v2, I noticed that we're creating one > extra record for each finished application (but still hold in the RM state > store) each time the RM got restarted. It's quite possible that we add the > cluster start timestamp into the default cluster id, thus each time we're > creating a new record for one application (cluster id is a part of the row > key). We need to fix this behavior, probably by having a better default > cluster id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4723) NodesListManager$UnknownNodeId ClassCastException
[ https://issues.apache.org/jira/browse/YARN-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169733#comment-15169733 ] Jason Lowe commented on YARN-4723: -- +1 committing this. > NodesListManager$UnknownNodeId ClassCastException > - > > Key: YARN-4723 > URL: https://issues.apache.org/jira/browse/YARN-4723 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.3 >Reporter: Jason Lowe >Assignee: Kuhu Shukla >Priority: Critical > Attachments: YARN-4723-branch-2.7.001.patch, YARN-4723.001.patch, > YARN-4723.002.patch > > > Saw the following in an RM log: > {noformat} > 2016-02-16 22:55:35,207 [IPC Server handler 5 on 8030] WARN ipc.Server: IPC > Server handler 5 on 8030, call > org.apache.hadoop.ipc.ProtobufRpcEngine$Server@6c403aff > java.lang.ClassCastException: > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId > cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:247) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:271) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:220) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:712) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:68) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:658) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:647) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:9335) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:144) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:175) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:96) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:608) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server.call(Server.java:2267) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:648) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:615) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2217) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169721#comment-15169721 ] Varun Saxena commented on YARN-4700: Hmm...I think the JIRAs' are YARN-3127 and YARN-4392. Let me see how we can solve this problem. > ATS storage has one extra record each time the RM got restarted > --- > > Key: YARN-4700 > URL: https://issues.apache.org/jira/browse/YARN-4700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Li Lu >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > When testing the new web UI for ATS v2, I noticed that we're creating one > extra record for each finished application (but still hold in the RM state > store) each time the RM got restarted. It's quite possible that we add the > cluster start timestamp into the default cluster id, thus each time we're > creating a new record for one application (cluster id is a part of the row > key). We need to fix this behavior, probably by having a better default > cluster id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4117) End to end unit test with mini YARN cluster for AMRMProxy Service
[ https://issues.apache.org/jira/browse/YARN-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giovanni Matteo Fumarola updated YARN-4117: --- Attachment: YARN-4117.v1.patch > End to end unit test with mini YARN cluster for AMRMProxy Service > - > > Key: YARN-4117 > URL: https://issues.apache.org/jira/browse/YARN-4117 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager >Reporter: Kishore Chaliparambil >Assignee: Giovanni Matteo Fumarola > Attachments: YARN-4117.v0.patch, YARN-4117.v1.patch > > > YARN-2884 introduces a proxy between AM and RM. This JIRA proposes an end to > end unit test using mini YARN cluster to the AMRMProxy service. This test > will validate register, allocate and finish application and token renewal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4686) MiniYARNCluster.start() returns before cluster is completely started
[ https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169686#comment-15169686 ] Eric Badger commented on YARN-4686: --- TestYarnCLI, TestAMRMClient, TestYarnClient, TestNMClient, and TestGetGroups are failing in multiple recent precommit builds [YARN-4117], [YARN-4630], [YARN-4676]. TestMiniYarnClusterNodeUtilization is tracked by [YARN-4566]. TestContainerManagerSecurity is failing on other recent precommit builds [YARN-4117], [YARN-4566]. The only failure related to this patch is TestApplicationClientProtocolOnHA. > MiniYARNCluster.start() returns before cluster is completely started > > > Key: YARN-4686 > URL: https://issues.apache.org/jira/browse/YARN-4686 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith Sharma K S >Assignee: Eric Badger > Attachments: MAPREDUCE-6507.001.patch, YARN-4686.001.patch > > > TestRMNMInfo fails intermittently. Below is trace for the failure > {noformat} > testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo) Time elapsed: 0.28 > sec <<< FAILURE! > java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but > was:<3> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4711) NM is going down with NPE's due to single thread processing of events by Timeline client
[ https://issues.apache.org/jira/browse/YARN-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169667#comment-15169667 ] Naganarasimha G R commented on YARN-4711: - Hi [~sjlee0], Till now i have only experimented with running sleep job with varied number of mappers and varied sleep time, I will try to measure the latency in {{TimelineClientImpl.putObjects}} by putting some logs and share the figures. > NM is going down with NPE's due to single thread processing of events by > Timeline client > > > Key: YARN-4711 > URL: https://issues.apache.org/jira/browse/YARN-4711 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R >Priority: Critical > Labels: yarn-2928-1st-milestone > > After YARN-3367, while testing the latest 2928 branch came across few NPEs > due to which NM is shutting down. > {code} > 2016-02-21 23:19:54,078 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: > Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:306) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:296) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109) > at java.lang.Thread.run(Thread.java:745) > {code} > {code} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:213) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerFinishedEvent(NMTimelinePublisher.java:192) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.access$400(NMTimelinePublisher.java:63) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:289) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:280) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109) > at java.lang.Thread.run(Thread.java:745) > {code} > On analysis found that the there was delay in processing of events, as after > YARN-3367 all the events were getting processed by a single thread inside the > timeline client. > Additionally found one scenario where there is possibility of NPE: > * TimelineEntity.toString() when {{real}} is not null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-4731) Linux container executor fails to delete nmlocal folders
[ https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169619#comment-15169619 ] Colin Patrick McCabe edited comment on YARN-4731 at 2/26/16 7:47 PM: - Thanks for finding this bug. Unfortunately, I think the patch has some issues... it introduces a race condition where the path could change during our traversal. Let me see if I can find a way to do this through fstat or a related call. was (Author: cmccabe): -1. This introduces a race condition where the path could change during our traversal. Let me see if I can find a way to do this through fstat or a related call. > Linux container executor fails to delete nmlocal folders > > > Key: YARN-4731 > URL: https://issues.apache.org/jira/browse/YARN-4731 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Varun Vasudev >Priority: Blocker > Attachments: YARN-4731.001.patch > > > Enable LCE and CGroups > Submit a mapreduce job > {noformat} > 2016-02-24 18:56:46,889 INFO > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting > absolute path : > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > 2016-02-24 18:56:46,894 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 255. Privileged Execution Operation > Output: > main : command provided 3 > main : run as user is dsperf > main : requested yarn user is dsperf > failed to rmdir job.jar: Not a directory > Error while deleting > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01: > 20 (Not a directory) > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > dsperf, dsperf, 3, > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01] > 2016-02-24 18:56:46,894 ERROR > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: > DeleteAsUser for > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > returned with exit code: 255 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=255: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=255: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) > ... 10 more > {noformat} > As a result nodemanager-local directory are not getting deleted for each > application > {noformat} > total 36 > drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./ > drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../ > -rw--- 1 hdfs hadoop 340 Feb 25 08:25 container_tokens > lrwxrwxrwx 1 hdfs hadoop 111 Feb 25 08:25 job.jar -> >
[jira] [Commented] (YARN-4062) Add the flush and compaction functionality via coprocessors and scanners for flow run table
[ https://issues.apache.org/jira/browse/YARN-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169640#comment-15169640 ] Vrushali C commented on YARN-4062: -- Attaching patch YARN-4062-YARN-2928.04.patch rebased to the right head. Thanks Sangjin Lee for the suggestion. > Add the flush and compaction functionality via coprocessors and scanners for > flow run table > --- > > Key: YARN-4062 > URL: https://issues.apache.org/jira/browse/YARN-4062 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Labels: yarn-2928-1st-milestone > Attachments: YARN-4062-YARN-2928.04.patch, > YARN-4062-YARN-2928.1.patch, YARN-4062-feature-YARN-2928.01.patch, > YARN-4062-feature-YARN-2928.02.patch, YARN-4062-feature-YARN-2928.03.patch > > > As part of YARN-3901, coprocessor and scanner is being added for storing into > the flow_run table. It also needs a flush & compaction processing in the > coprocessor and perhaps a new scanner to deal with the data during flushing > and compaction stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4062) Add the flush and compaction functionality via coprocessors and scanners for flow run table
[ https://issues.apache.org/jira/browse/YARN-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vrushali C updated YARN-4062: - Attachment: YARN-4062-YARN-2928.04.patch Attaching patch rebased to the right head. Thanks [~sjlee0] for the suggestion. > Add the flush and compaction functionality via coprocessors and scanners for > flow run table > --- > > Key: YARN-4062 > URL: https://issues.apache.org/jira/browse/YARN-4062 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Labels: yarn-2928-1st-milestone > Attachments: YARN-4062-YARN-2928.04.patch, > YARN-4062-YARN-2928.1.patch, YARN-4062-feature-YARN-2928.01.patch, > YARN-4062-feature-YARN-2928.02.patch, YARN-4062-feature-YARN-2928.03.patch > > > As part of YARN-3901, coprocessor and scanner is being added for storing into > the flow_run table. It also needs a flush & compaction processing in the > coprocessor and perhaps a new scanner to deal with the data during flushing > and compaction stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4686) MiniYARNCluster.start() returns before cluster is completely started
[ https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169624#comment-15169624 ] Eric Badger commented on YARN-4686: --- I am unable to reproduce any of the JUnit timeout failures (TestYarnCLI, TestAMRMClient, TestYarnClient, TestNMClient) locally via trunk or via trunk with my patch added. TestMiniYarnClusterNodeUtilization fails locally in both trunk and with my patch. TestContainerManagerSecurity passes locally in both trunk and with my patch. TestGetGroups has an Ignore annotation, so we can probably ignore that error. TestApplicationClientProtocolOnHA passes locally on trunk and fails locally with my patch. However, there are a non-deterministic amount of tests failing with the same error in the initialization code. {noformat} java.lang.AssertionError: NMs failed to connect to the RM at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.hadoop.yarn.client.ProtocolHATestBase.verifyConnections(ProtocolHATestBase.java:219) at org.apache.hadoop.yarn.client.ProtocolHATestBase.startHACluster(ProtocolHATestBase.java:284) at org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.initiate(TestApplicationClientProtocolOnHA.java:54) {noformat} > MiniYARNCluster.start() returns before cluster is completely started > > > Key: YARN-4686 > URL: https://issues.apache.org/jira/browse/YARN-4686 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith Sharma K S >Assignee: Eric Badger > Attachments: MAPREDUCE-6507.001.patch, YARN-4686.001.patch > > > TestRMNMInfo fails intermittently. Below is trace for the failure > {noformat} > testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo) Time elapsed: 0.28 > sec <<< FAILURE! > java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but > was:<3> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4731) Linux container executor fails to delete nmlocal folders
[ https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169619#comment-15169619 ] Colin Patrick McCabe commented on YARN-4731: -1. This introduces a race condition where the path could change during our traversal. Let me see if I can find a way to do this through fstat or a related call. > Linux container executor fails to delete nmlocal folders > > > Key: YARN-4731 > URL: https://issues.apache.org/jira/browse/YARN-4731 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Varun Vasudev >Priority: Blocker > Attachments: YARN-4731.001.patch > > > Enable LCE and CGroups > Submit a mapreduce job > {noformat} > 2016-02-24 18:56:46,889 INFO > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting > absolute path : > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > 2016-02-24 18:56:46,894 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 255. Privileged Execution Operation > Output: > main : command provided 3 > main : run as user is dsperf > main : requested yarn user is dsperf > failed to rmdir job.jar: Not a directory > Error while deleting > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01: > 20 (Not a directory) > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > dsperf, dsperf, 3, > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01] > 2016-02-24 18:56:46,894 ERROR > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: > DeleteAsUser for > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > returned with exit code: 255 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=255: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=255: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) > ... 10 more > {noformat} > As a result nodemanager-local directory are not getting deleted for each > application > {noformat} > total 36 > drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./ > drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../ > -rw--- 1 hdfs hadoop 340 Feb 25 08:25 container_tokens > lrwxrwxrwx 1 hdfs hadoop 111 Feb 25 08:25 job.jar -> > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/11/job.jar/ > lrwxrwxrwx 1 hdfs hadoop 111 Feb 25 08:25 job.xml -> > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/13/job.xml* > drwxr-s--- 2 hdfs hadoop 4096 Feb 25 08:25 jobSubmitDir/ > -rwx-- 1 hdfs hadoop 5348 Feb 25 08:25 launch_container.sh*
[jira] [Commented] (YARN-4117) End to end unit test with mini YARN cluster for AMRMProxy Service
[ https://issues.apache.org/jira/browse/YARN-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169591#comment-15169591 ] Giovanni Matteo Fumarola commented on YARN-4117: [~jianhe], thanks for reviewing the patch. 1. Sure; 2. Sure, I will check it. 3. Because I can customize CustomContainerManagerImpl and AMRMProxyService. MiniYarnCluster allocates the ports for the RM during the Start phase, and there was no way to pass the information to the AMRMProxy. For this reason I save the value of the Scheduler address and in the Customizazion I inform the AMRMProxy what is the address it will connect to. 4. The purpose of testE2ETokenSwap is to verified that the AMRMProxy swapped the security token and the AM cannot submit the job directly to the RM. I will update the patch according to the feedback. > End to end unit test with mini YARN cluster for AMRMProxy Service > - > > Key: YARN-4117 > URL: https://issues.apache.org/jira/browse/YARN-4117 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager >Reporter: Kishore Chaliparambil >Assignee: Giovanni Matteo Fumarola > Attachments: YARN-4117.v0.patch > > > YARN-2884 introduces a proxy between AM and RM. This JIRA proposes an end to > end unit test using mini YARN cluster to the AMRMProxy service. This test > will validate register, allocate and finish application and token renewal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4481) negative pending resource of queues lead to applications in accepted status inifnitly
[ https://issues.apache.org/jira/browse/YARN-4481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169550#comment-15169550 ] Eric Payne commented on YARN-4481: -- Not sure if this is related, but we are also seeing similar results in 2.7 for reserved containers: {noformat} "name" : "Hadoop:service=ResourceManager,name=QueueMetrics,q0=root,q1=bigmem", ... "ReservedMB" : -6553600, "ReservedVCores" : -8000, "ReservedContainers" : -800, ... {noformat} > negative pending resource of queues lead to applications in accepted status > inifnitly > - > > Key: YARN-4481 > URL: https://issues.apache.org/jira/browse/YARN-4481 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.7.2 >Reporter: gu-chi >Priority: Critical > Attachments: jmx.txt > > > Met a scenario of negative pending resource with capacity scheduler, in jmx, > it shows: > {noformat} > "PendingMB" : -4096, > "PendingVCores" : -1, > "PendingContainers" : -1, > {noformat} > full jmx infomation attached. > this is not just a jmx UI issue, the actual pending resource of queue is also > negative as I see the debug log of > bq. DEBUG | ResourceManager Event Processor | Skip this queue=root, because > it doesn't need more resource, schedulingMode=RESPECT_PARTITION_EXCLUSIVITY > node-partition= | ParentQueue.java > this lead to the {{NULL_ASSIGNMENT}} > The background is submitting hundreds of applications and consume all cluster > resource and reservation happen. While running, network fault injected by > some tool, injection types are delay,jitter > ,repeat,packet loss and disorder. And then kill most of the applications > submitted. > Anyone also facing negative pending resource, or have idea of how this happen? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4700) ATS storage has one extra record each time the RM got restarted
[ https://issues.apache.org/jira/browse/YARN-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169531#comment-15169531 ] Li Lu commented on YARN-4700: - I think the redundant events are coming from the work preserving RM restart, where the RM tries to "replay" application lifecycle events in the state store. I don't remember the JIRA number for fixing this for SMP (but I do remember [~Naganarasimha] was involved in the discussion), but seems like the conclusion was to handle this on the SMP/storage side rather than the RM side. For us, most of the tables are fine, but the flow activity table we need to distinguish a "real" activity from a replayed activity. > ATS storage has one extra record each time the RM got restarted > --- > > Key: YARN-4700 > URL: https://issues.apache.org/jira/browse/YARN-4700 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Li Lu >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > When testing the new web UI for ATS v2, I noticed that we're creating one > extra record for each finished application (but still hold in the RM state > store) each time the RM got restarted. It's quite possible that we add the > cluster start timestamp into the default cluster id, thus each time we're > creating a new record for one application (cluster id is a part of the row > key). We need to fix this behavior, probably by having a better default > cluster id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4711) NM is going down with NPE's due to single thread processing of events by Timeline client
[ https://issues.apache.org/jira/browse/YARN-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169469#comment-15169469 ] Sangjin Lee commented on YARN-4711: --- What I meant is the actual put call in {{TimelineClientImpl}}. > NM is going down with NPE's due to single thread processing of events by > Timeline client > > > Key: YARN-4711 > URL: https://issues.apache.org/jira/browse/YARN-4711 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R >Priority: Critical > Labels: yarn-2928-1st-milestone > > After YARN-3367, while testing the latest 2928 branch came across few NPEs > due to which NM is shutting down. > {code} > 2016-02-21 23:19:54,078 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: > Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:306) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:296) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109) > at java.lang.Thread.run(Thread.java:745) > {code} > {code} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:213) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerFinishedEvent(NMTimelinePublisher.java:192) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.access$400(NMTimelinePublisher.java:63) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:289) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:280) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109) > at java.lang.Thread.run(Thread.java:745) > {code} > On analysis found that the there was delay in processing of events, as after > YARN-3367 all the events were getting processed by a single thread inside the > timeline client. > Additionally found one scenario where there is possibility of NPE: > * TimelineEntity.toString() when {{real}} is not null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4711) NM is going down with NPE's due to single thread processing of events by Timeline client
[ https://issues.apache.org/jira/browse/YARN-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169462#comment-15169462 ] Sangjin Lee commented on YARN-4711: --- I know it might take some effort to get the data, but do you have some idea what type of latency you're seeing with each put call? > NM is going down with NPE's due to single thread processing of events by > Timeline client > > > Key: YARN-4711 > URL: https://issues.apache.org/jira/browse/YARN-4711 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R >Priority: Critical > Labels: yarn-2928-1st-milestone > > After YARN-3367, while testing the latest 2928 branch came across few NPEs > due to which NM is shutting down. > {code} > 2016-02-21 23:19:54,078 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: > Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:306) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:296) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109) > at java.lang.Thread.run(Thread.java:745) > {code} > {code} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:213) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerFinishedEvent(NMTimelinePublisher.java:192) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.access$400(NMTimelinePublisher.java:63) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:289) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:280) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109) > at java.lang.Thread.run(Thread.java:745) > {code} > On analysis found that the there was delay in processing of events, as after > YARN-3367 all the events were getting processed by a single thread inside the > timeline client. > Additionally found one scenario where there is possibility of NPE: > * TimelineEntity.toString() when {{real}} is not null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928
[ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169452#comment-15169452 ] Sangjin Lee commented on YARN-4712: --- Yes, at least for now we should emit the CPU as a percentage (i.e. multiply by 100). As for the "unavailable", can we not simply skip sending the CPU if the value is equal to unavailable? > CPU Usage Metric is not captured properly in YARN-2928 > -- > > Key: YARN-4712 > URL: https://issues.apache.org/jira/browse/YARN-4712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > There are 2 issues with CPU usage collection > * I was able to observe that that many times CPU usage got from > {{pTree.getCpuUsagePercent()}} is > ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do > the calculation i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore > /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE > check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not > encountered. so proper checks needs to be handled > * {{EntityColumnPrefix.METRIC}} uses always LongConverter but > ContainerMonitor is publishing decimal values for the CPU usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4671) There is no need to acquire CS lock when completing a container
[ https://issues.apache.org/jira/browse/YARN-4671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MENG DING updated YARN-4671: Attachment: YARN-4671.2.patch Rebased against trunk > There is no need to acquire CS lock when completing a container > --- > > Key: YARN-4671 > URL: https://issues.apache.org/jira/browse/YARN-4671 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: MENG DING >Assignee: MENG DING > Attachments: YARN-4671.1.patch, YARN-4671.2.patch > > > In YARN-4519, we discovered that there is no need to acquire CS lock in > CS#completedContainerInternal, because: > * Access to critical section are already guarded by queue lock. > * It is not essential to guard {{schedulerHealth}} with cs lock in > completedContainerInternal. All maps in schedulerHealth are concurrent maps. > Even if schedulerHealth is not consistent at the moment, it will be > eventually consistent. > With this fix, we can truly claim that CS#allocate doesn't require CS lock. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4723) NodesListManager$UnknownNodeId ClassCastException
[ https://issues.apache.org/jira/browse/YARN-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169407#comment-15169407 ] Daniel Templeton commented on YARN-4723: The patch seems fine to me. > NodesListManager$UnknownNodeId ClassCastException > - > > Key: YARN-4723 > URL: https://issues.apache.org/jira/browse/YARN-4723 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.3 >Reporter: Jason Lowe >Assignee: Kuhu Shukla >Priority: Critical > Attachments: YARN-4723-branch-2.7.001.patch, YARN-4723.001.patch, > YARN-4723.002.patch > > > Saw the following in an RM log: > {noformat} > 2016-02-16 22:55:35,207 [IPC Server handler 5 on 8030] WARN ipc.Server: IPC > Server handler 5 on 8030, call > org.apache.hadoop.ipc.ProtobufRpcEngine$Server@6c403aff > java.lang.ClassCastException: > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId > cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:247) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:271) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:220) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:712) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:68) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:658) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:647) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:9335) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:144) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:175) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:96) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:608) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server.call(Server.java:2267) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:648) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:615) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2217) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169328#comment-15169328 ] Sangjin Lee commented on YARN-4736: --- Both the thread dump and the HBase exception log are from the client process (NM side), correct? I believe so because both are showing HBase client stack traces. Could you confirm? Since these are both from the client side, I am not sure what was going on on the HBase server side. Also, I'm not sure why HBase started refusing connections as evidenced by your exception log (that's why I assumed that HBase process might have gone away at that point). Here is how I put together the sequence of events: - at some point the HBase process starts refusing connections (hbaseException.log) - the periodic flush gets trapped in this bad state, finally logging the {{RetriesExhaustedException}} after 36 minutes {noformat} 2016-02-26 00:02:28,270 INFO org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager: The collector service for application_1456425026132_0001 was removed 2016-02-26 00:39:03,879 ERROR org.apache.hadoop.hbase.client.AsyncProcess: Failed to get region location org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions: {noformat} - some time after that, it looks like you issued a signal to stop the client process (NM)? {noformat} 2016-02-26 01:09:19,799 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM {noformat} - but the service stop fails to shut down the periodic flush task thread {noformat} 2016-02-26 01:09:50,035 WARN org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager: failed to stop the flusher task in time. will still proceed to close the writer. 2016-02-26 01:09:50,035 INFO org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl: closing the entity table {noformat} - at this point the NM process is hung because the flush is in this stuck state while holding the {{BufferedMutatorImpl}} lock and the closing of the entity table needs to acquire that lock That's why I thought there seems to be a HBase bug that is causing the flush operation to be wedged in this state. At least that explains why you were not able to shut down the collector (and therefore NM). > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: hbaseException.log, threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4720) Skip unnecessary NN operations in log aggregation
[ https://issues.apache.org/jira/browse/YARN-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169311#comment-15169311 ] Hudson commented on YARN-4720: -- FAILURE: Integrated in Hadoop-trunk-Commit #9374 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/9374/]) YARN-4720. Skip unnecessary NN operations in log aggregation. (Jun Gong (mingma: rev 7f3139e54da2c496327446a5eac43f8421fc8839) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/CHANGES.txt > Skip unnecessary NN operations in log aggregation > - > > Key: YARN-4720 > URL: https://issues.apache.org/jira/browse/YARN-4720 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Jun Gong > Attachments: YARN-4720.01.patch, YARN-4720.02.patch, > YARN-4720.03.patch, YARN-4720.04.patch, YARN-4720.05.patch > > > Log aggregation service could have unnecessary NN operations in the following > scenarios: > * No new local log has been created since the last upload for the long > running service scenario. > * NM uses {{ContainerLogAggregationPolicy}} that skips log aggregation for > certain containers. > In the following code snippet, even though {{pendingContainerInThisCycle}} is > empty, it still creates the writer and then removes the file later. Thus it > introduces unnecessary create/getfileinfo/delete NN calls when NM doesn't > aggregate logs for an app. > > {noformat} > AppLogAggregatorImpl.java > .. > writer = > new LogWriter(this.conf, this.remoteNodeTmpLogFileForApp, > this.userUgi); > .. > for (ContainerId container : pendingContainerInThisCycle) { > .. > } > .. > if (remoteFS.exists(remoteNodeTmpLogFileForApp)) { > if (rename) { > remoteFS.rename(remoteNodeTmpLogFileForApp, renamedPath); > } else { > remoteFS.delete(remoteNodeTmpLogFileForApp, false); > } > } > .. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4465) SchedulerUtils#validateRequest for Label check should happen only when nodelabel enabled
[ https://issues.apache.org/jira/browse/YARN-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169265#comment-15169265 ] Hadoop QA commented on YARN-4465: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 43s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 17s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 15s {color} | {color:green} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: patch generated 0 new + 20 unchanged - 1 fixed = 20 total (was 21) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 53s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 71m 37s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 158m 55s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_72 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | JDK v1.7.0_95 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12790128/0007-YARN-4465.patch
[jira] [Commented] (YARN-4723) NodesListManager$UnknownNodeId ClassCastException
[ https://issues.apache.org/jira/browse/YARN-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169233#comment-15169233 ] Kuhu Shukla commented on YARN-4723: --- This is for the 2.7 patch. > NodesListManager$UnknownNodeId ClassCastException > - > > Key: YARN-4723 > URL: https://issues.apache.org/jira/browse/YARN-4723 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.3 >Reporter: Jason Lowe >Assignee: Kuhu Shukla >Priority: Critical > Attachments: YARN-4723-branch-2.7.001.patch, YARN-4723.001.patch, > YARN-4723.002.patch > > > Saw the following in an RM log: > {noformat} > 2016-02-16 22:55:35,207 [IPC Server handler 5 on 8030] WARN ipc.Server: IPC > Server handler 5 on 8030, call > org.apache.hadoop.ipc.ProtobufRpcEngine$Server@6c403aff > java.lang.ClassCastException: > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId > cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:247) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:271) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:220) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:712) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:68) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:658) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:647) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:9335) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:144) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:175) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:96) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:608) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server.call(Server.java:2267) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:648) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:615) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2217) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4723) NodesListManager$UnknownNodeId ClassCastException
[ https://issues.apache.org/jira/browse/YARN-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169183#comment-15169183 ] Kuhu Shukla commented on YARN-4723: --- Findbugs: {code} Bug type SIC_INNER_SHOULD_BE_STATIC (click for details) In class org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKSyncOperationCallback At ZKRMStateStore.java:[lines 118-127] {code} Checkstyle warnings are unrelated. Same thing with asf license warnings. Test failures are known issues. > NodesListManager$UnknownNodeId ClassCastException > - > > Key: YARN-4723 > URL: https://issues.apache.org/jira/browse/YARN-4723 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.3 >Reporter: Jason Lowe >Assignee: Kuhu Shukla >Priority: Critical > Attachments: YARN-4723-branch-2.7.001.patch, YARN-4723.001.patch, > YARN-4723.002.patch > > > Saw the following in an RM log: > {noformat} > 2016-02-16 22:55:35,207 [IPC Server handler 5 on 8030] WARN ipc.Server: IPC > Server handler 5 on 8030, call > org.apache.hadoop.ipc.ProtobufRpcEngine$Server@6c403aff > java.lang.ClassCastException: > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId > cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:247) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:271) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:220) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:712) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:68) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:658) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:647) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:9335) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:144) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:175) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:96) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:608) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server.call(Server.java:2267) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:648) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:615) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2217) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4735) Remove stale LogAggregationReport from NM's context
[ https://issues.apache.org/jira/browse/YARN-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169088#comment-15169088 ] Jun Gong commented on YARN-4735: Thanks [~mingma]. I checked the code, it seems not an issue. I'll verify it again. [~kasha] Maybe the issue you met is not same? > Remove stale LogAggregationReport from NM's context > --- > > Key: YARN-4735 > URL: https://issues.apache.org/jira/browse/YARN-4735 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jun Gong >Assignee: Jun Gong > > {quote} > All LogAggregationReport(current and previous) are only added to > *context.getLogAggregationStatusForApps*, and never removed. > So for long running service, the LogAggregationReport list NM sends to RM > will grow over time. > {quote} > Per discussion in YARN-4720, we need remove stale LogAggregationReport from > NM's context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4731) Linux container executor fails to delete nmlocal folders
[ https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169073#comment-15169073 ] Bibin A Chundatt commented on YARN-4731: [~jlowe]/[~vvasudev] Tried same scenarios in branch 2.7.2 signal error doesn't exists. Signal error and container initialization exception are not causing any task failure . If any scope for improvement we can raise a new jira no need to handle as part of this jira. Localization issue fixed ..+1 (non-binding) > Linux container executor fails to delete nmlocal folders > > > Key: YARN-4731 > URL: https://issues.apache.org/jira/browse/YARN-4731 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Varun Vasudev >Priority: Blocker > Attachments: YARN-4731.001.patch > > > Enable LCE and CGroups > Submit a mapreduce job > {noformat} > 2016-02-24 18:56:46,889 INFO > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting > absolute path : > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > 2016-02-24 18:56:46,894 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 255. Privileged Execution Operation > Output: > main : command provided 3 > main : run as user is dsperf > main : requested yarn user is dsperf > failed to rmdir job.jar: Not a directory > Error while deleting > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01: > 20 (Not a directory) > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > dsperf, dsperf, 3, > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01] > 2016-02-24 18:56:46,894 ERROR > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: > DeleteAsUser for > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > returned with exit code: 255 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=255: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=255: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) > ... 10 more > {noformat} > As a result nodemanager-local directory are not getting deleted for each > application > {noformat} > total 36 > drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./ > drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../ > -rw--- 1 hdfs hadoop 340 Feb 25 08:25 container_tokens > lrwxrwxrwx 1 hdfs hadoop 111 Feb 25 08:25 job.jar -> > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/11/job.jar/ > lrwxrwxrwx 1 hdfs hadoop 111 Feb 25 08:25 job.xml -> >
[jira] [Commented] (YARN-4465) SchedulerUtils#validateRequest for Label check should happen only when nodelabel enabled
[ https://issues.apache.org/jira/browse/YARN-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169036#comment-15169036 ] Bibin A Chundatt commented on YARN-4465: [~leftnoteasy] Thank you for review comments {quote} Could you avoid show all cluster labels in exception when label doesn't exist in cluster? {quote} Done {quote} And could you explain why change of TestKillApplicationWithRMHA is needed? {quote} Sorry had done the update for debugging testcase, excluded the updation from latest patch . > SchedulerUtils#validateRequest for Label check should happen only when > nodelabel enabled > > > Key: YARN-4465 > URL: https://issues.apache.org/jira/browse/YARN-4465 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Attachments: 0001-YARN-4465.patch, 0002-YARN-4465.patch, > 0003-YARN-4465.patch, 0004-YARN-4465.patch, 0006-YARN-4465.patch, > 0007-YARN-4465.patch > > > Disable label from rm side yarn.nodelabel.enable=false > Capacity scheduler label configuration for queue is available as below > default label for queue = b1 as 3 and accessible labels as 1,3 > Submit application to queue A . > {noformat} > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException): > Invalid resource request, queue=b1 doesn't have permission to access all > labels in resource request. labelExpression of resource request=3. Queue > labels=1,3 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:216) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:401) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:340) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:283) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:602) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:247) > {noformat} > # Ignore default label expression when label is disabled *or* > # NormalizeResourceRequest we can set label expression to > when node label is not enabled *or* > # Improve message -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4465) SchedulerUtils#validateRequest for Label check should happen only when nodelabel enabled
[ https://issues.apache.org/jira/browse/YARN-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-4465: --- Attachment: 0007-YARN-4465.patch > SchedulerUtils#validateRequest for Label check should happen only when > nodelabel enabled > > > Key: YARN-4465 > URL: https://issues.apache.org/jira/browse/YARN-4465 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Attachments: 0001-YARN-4465.patch, 0002-YARN-4465.patch, > 0003-YARN-4465.patch, 0004-YARN-4465.patch, 0006-YARN-4465.patch, > 0007-YARN-4465.patch > > > Disable label from rm side yarn.nodelabel.enable=false > Capacity scheduler label configuration for queue is available as below > default label for queue = b1 as 3 and accessible labels as 1,3 > Submit application to queue A . > {noformat} > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException): > Invalid resource request, queue=b1 doesn't have permission to access all > labels in resource request. labelExpression of resource request=3. Queue > labels=1,3 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:216) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:401) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:340) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:283) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:602) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:247) > {noformat} > # Ignore default label expression when label is disabled *or* > # NormalizeResourceRequest we can set label expression to > when node label is not enabled *or* > # Improve message -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4624) NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI
[ https://issues.apache.org/jira/browse/YARN-4624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168928#comment-15168928 ] Hadoop QA commented on YARN-4624: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 12s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 16s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 5s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 31s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 17s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 26s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 71m 35s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 17s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 159m 18s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | | Boxed value is unboxed and then immediately reboxed in org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(ResponseInfo, String) At CapacitySchedulerPage.java:then immediately reboxed in
[jira] [Commented] (YARN-4731) Linux container executor fails to delete nmlocal folders
[ https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168814#comment-15168814 ] Varun Vasudev commented on YARN-4731: - The signal container exception and the container initalization error can be ignored. The signal container exception is due to the fact that we call signal container as part of the container cleanup and the container initialization error is due to the MR AM killing the last reducer. > Linux container executor fails to delete nmlocal folders > > > Key: YARN-4731 > URL: https://issues.apache.org/jira/browse/YARN-4731 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Bibin A Chundatt >Assignee: Varun Vasudev >Priority: Blocker > Attachments: YARN-4731.001.patch > > > Enable LCE and CGroups > Submit a mapreduce job > {noformat} > 2016-02-24 18:56:46,889 INFO > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting > absolute path : > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > 2016-02-24 18:56:46,894 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 255. Privileged Execution Operation > Output: > main : command provided 3 > main : run as user is dsperf > main : requested yarn user is dsperf > failed to rmdir job.jar: Not a directory > Error while deleting > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01: > 20 (Not a directory) > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > dsperf, dsperf, 3, > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01] > 2016-02-24 18:56:46,894 ERROR > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: > DeleteAsUser for > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > returned with exit code: 255 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=255: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=255: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) > ... 10 more > {noformat} > As a result nodemanager-local directory are not getting deleted for each > application > {noformat} > total 36 > drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./ > drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../ > -rw--- 1 hdfs hadoop 340 Feb 25 08:25 container_tokens > lrwxrwxrwx 1 hdfs hadoop 111 Feb 25 08:25 job.jar -> > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/11/job.jar/ > lrwxrwxrwx 1 hdfs hadoop 111 Feb 25 08:25 job.xml -> > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/13/job.xml* > drwxr-s--- 2
[jira] [Updated] (YARN-4624) NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI
[ https://issues.apache.org/jira/browse/YARN-4624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-4624: --- Attachment: YARN-4624-003.patch > NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI > --- > > Key: YARN-4624 > URL: https://issues.apache.org/jira/browse/YARN-4624 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Brahma Reddy Battula >Assignee: Brahma Reddy Battula > Attachments: SchedulerUIWithOutLabelMapping.png, YARN-2674-002.patch, > YARN-4624-003.patch, YARN-4624.patch > > > Scenario: > === > Configure nodelables and add to cluster > Start the cluster > {noformat} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.PartitionQueueCapacitiesInfo.getMaxAMLimitPercentage(PartitionQueueCapacitiesInfo.java:114) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:105) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:94) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueueBlock.render(CapacitySchedulerPage.java:293) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:447) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4624) NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI
[ https://issues.apache.org/jira/browse/YARN-4624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-4624: --- Attachment: (was: YARN-4624-003.patch) > NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI > --- > > Key: YARN-4624 > URL: https://issues.apache.org/jira/browse/YARN-4624 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Brahma Reddy Battula >Assignee: Brahma Reddy Battula > Attachments: SchedulerUIWithOutLabelMapping.png, YARN-2674-002.patch, > YARN-4624.patch > > > Scenario: > === > Configure nodelables and add to cluster > Start the cluster > {noformat} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.PartitionQueueCapacitiesInfo.getMaxAMLimitPercentage(PartitionQueueCapacitiesInfo.java:114) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:105) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:94) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueueBlock.render(CapacitySchedulerPage.java:293) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:447) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168659#comment-15168659 ] Naganarasimha G R commented on YARN-4736: - Hi [~sjlee0], Hope the attached files are considered correctly for the issues: *For Issue 1* : threaddump.log *For Issue 2* : hbaseException.log bq. This could be a bug in HBase. It seems like the HBase cluster was already shut down, but the flush operation took a long time to finally error out (36 minutes): This log is for case 2: Actually cluster or NM was not shutdown in this case but app completed successfully but after 40 mins this error was shown. > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: hbaseException.log, threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4740) container complete msg may lost while AM restart in race condition
[ https://issues.apache.org/jira/browse/YARN-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168655#comment-15168655 ] Hadoop QA commented on YARN-4740: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 57s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 18s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 35s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 18s {color} | {color:green} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: patch generated 0 new + 119 unchanged - 2 fixed = 119 total (was 121) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 20s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 79m 26s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 14s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 171m 27s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_72 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | JDK v1.7.0_95 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL |