[jira] [Commented] (YARN-3214) Add non-exclusive node labels
[ https://issues.apache.org/jira/browse/YARN-3214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379263#comment-14379263 ] Lohit Vijayarenu commented on YARN-3214: Thanks [~wangda] for reply. I feel partitions and constraints as two separate entities will cause more confusion. If allocation is challenge (as you described in example for multiple labels), then it is something which should be solved in scheduler, no? This is same problem one would have even without labels. For a given node which advertises 10G of memory, and app/queue with X and Y, how would you divide resource among X and Y? PS: Mesos Scheduler for example uses term called constraints which is similar to labels. In that sense I agree with [~vinodkv] that we should probably call this feature as partition or something related? Add non-exclusive node labels -- Key: YARN-3214 URL: https://issues.apache.org/jira/browse/YARN-3214 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: Non-exclusive-Node-Partition-Design.pdf Currently node labels partition the cluster to some sub-clusters so resources cannot be shared between partitioned cluster. With the current implementation of node labels we cannot use the cluster optimally and the throughput of the cluster will suffer. We are proposing adding non-exclusive node labels: 1. Labeled apps get the preference on Labeled nodes 2. If there is no ask for labeled resources we can assign those nodes to non labeled apps 3. If there is any future ask for those resources , we will preempt the non labeled apps and give them back to labeled apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3214) Add non-exclusive node labels
[ https://issues.apache.org/jira/browse/YARN-3214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378541#comment-14378541 ] Lohit Vijayarenu commented on YARN-3214: bq. (P0) A node can belong to at most one partition. All nodes belong to a DEFAULT partition unless overridden. Does this mean one a node we can have only one label? If so, it would become too restrictive. Labels on nodes can be seen in multiple dimension (from app's resource, machine resource and also usecase resouce, eg backfill jobs are placed on specific set of nodes). In those cases we should have ability to have multiple labels on node. Also, in the documents there is mention of scheduling apps without any labels being scheduled on labeled nodes if resources are idle. Does that also cover apps which could have different label other than A/B, but still have a label be placed on these nodes when there is free resources available? Add non-exclusive node labels -- Key: YARN-3214 URL: https://issues.apache.org/jira/browse/YARN-3214 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: Non-exclusive-Node-Partition-Design.pdf Currently node labels partition the cluster to some sub-clusters so resources cannot be shared between partitioned cluster. With the current implementation of node labels we cannot use the cluster optimally and the throughput of the cluster will suffer. We are proposing adding non-exclusive node labels: 1. Labeled apps get the preference on Labeled nodes 2. If there is no ask for labeled resources we can assign those nodes to non labeled apps 3. If there is any future ask for those resources , we will preempt the non labeled apps and give them back to labeled apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131732#comment-14131732 ] Lohit Vijayarenu commented on YARN-2314: We hit same problem on one of our large cluster with more than 2.5K nodes. As a work around we ended up increasing container size to 6G for AM (and with pmem-vmem ratio of 2:1) we give away 12G of VM for AM container. From initial looks of this, there is no way to turn this behavior off via config, other than patching code, right? ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Priority: Critical Attachments: nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131920#comment-14131920 ] Lohit Vijayarenu commented on YARN-2314: Thanks [~jlowe] ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Priority: Critical Attachments: disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041128#comment-14041128 ] Lohit Vijayarenu commented on YARN-796: --- As [~tucu00] mentioned, label sounds closely related to affinity and should be treated less off a resource. It becomes closely related to resources when it comes to exposing them on scheduler queues and exposing that to users who wish to schedule their jobs on certain set of labeled nodes. This is definitely very useful feature to have. Looking forward for design document. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: YARN-796.patch It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1692) ConcurrentModificationException in fair scheduler AppSchedulable
[ https://issues.apache.org/jira/browse/YARN-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899506#comment-13899506 ] Lohit Vijayarenu commented on YARN-1692: +1 on the patch. Can anyone else review this as well. ConcurrentModificationException in fair scheduler AppSchedulable Key: YARN-1692 URL: https://issues.apache.org/jira/browse/YARN-1692 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.0.5-alpha Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: yarn-1692.patch We saw a ConcurrentModificationException thrown in the fair scheduler: {noformat} 2014-02-07 01:40:01,978 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Exception in fair scheduler UpdateThread java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926) at java.util.HashMap$ValueIterator.next(HashMap.java:954) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.updateDemand(AppSchedulable.java:85) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.updateDemand(FSLeafQueue.java:125) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.updateDemand(FSParentQueue.java:82) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:195) at java.lang.Thread.run(Thread.java:724) {noformat} The map that gets returned by FSSchedulerApp.getResourceRequests() are iterated on without proper synchronization. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data
[ https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870908#comment-13870908 ] Lohit Vijayarenu commented on YARN-1530: Yes, proxy server inside library, but only in AM not containers. Containers could make rest calls to AM. Main advantage is that we would not send timeline data to one single server. For example we have seen cases where our history files could grow upto 700MB for large jobs. In that case having hundreds of would would easily become bottleneck for single REST point, distributing it to its own AM would help. [Umbrella] Store, manage and serve per-framework application-timeline data -- Key: YARN-1530 URL: https://issues.apache.org/jira/browse/YARN-1530 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Attachments: application timeline design-20140108.pdf This is a sibling JIRA for YARN-321. Today, each application/framework has to do store, and serve per-framework data all by itself as YARN doesn't have a common solution. This JIRA attempts to solve the storage, management and serving of per-framework data from various applications, both running and finished. The aim is to change YARN to collect and store data in a generic manner with plugin points for frameworks to do their own thing w.r.t interpretation and serving. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data
[ https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868242#comment-13868242 ] Lohit Vijayarenu commented on YARN-1530: We might have also think about data transfer rate for REST endpoint from all AMs/Containers if this is hosted by ResourceManager. One idea could be to have REST endpoint be library which any AM can inherit. When AM initializes this library can init REST endpoint which then can push events to pluggable storage (HDFS/Kafka ...). This might be similar to how AM writes history events to HDFS today. This should give good scalability without changing much from API perspective. [Umbrella] Store, manage and serve per-framework application-timeline data -- Key: YARN-1530 URL: https://issues.apache.org/jira/browse/YARN-1530 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Attachments: application timeline design-20140108.pdf This is a sibling JIRA for YARN-321. Today, each application/framework has to do store, and serve per-framework data all by itself as YARN doesn't have a common solution. This JIRA attempts to solve the storage, management and serving of per-framework data from various applications, both running and finished. The aim is to change YARN to collect and store data in a generic manner with plugin points for frameworks to do their own thing w.r.t interpretation and serving. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-85) Allow per job log aggregation configuration
[ https://issues.apache.org/jira/browse/YARN-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13822192#comment-13822192 ] Lohit Vijayarenu commented on YARN-85: -- Patch looks good to me. Can anyone else also take a look at the patch. Allow per job log aggregation configuration --- Key: YARN-85 URL: https://issues.apache.org/jira/browse/YARN-85 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Siddharth Seth Assignee: Chris Trezzo Priority: Critical Currently, if log aggregation is enabled for a cluster - logs for all jobs will be aggregated - leading to a whole bunch of files on hdfs which users may not want. Users should be able to control this along with the aggregation policy - failed only, all, etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (YARN-546) mapred.fairscheduler.eventlog.enabled removed from Hadoop 2.0
[ https://issues.apache.org/jira/browse/YARN-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lohit Vijayarenu resolved YARN-546. --- Resolution: Duplicate Resolving duplicate of YARN-1383 mapred.fairscheduler.eventlog.enabled removed from Hadoop 2.0 - Key: YARN-546 URL: https://issues.apache.org/jira/browse/YARN-546 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.0.3-alpha Reporter: Lohit Vijayarenu Attachments: YARN-546.1.patch Hadoop 1.0 supported an option to turn on/off FairScheduler event logging using mapred.fairscheduler.eventlog.enabled. In Hadoop 2.0, it looks like this option has been removed (or not ported?) which causes event logging to be enabled by default and there is no way to turn it off. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1383) Remove node updates from the Fair Scheduler event log
[ https://issues.apache.org/jira/browse/YARN-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13822210#comment-13822210 ] Lohit Vijayarenu commented on YARN-1383: On big clusters logging for each heartbeat is too much. To debug if NodeManagers are heartbeating, we could possibly use other methods like network connection, stack traces and such. +1 on removing this line. Remove node updates from the Fair Scheduler event log - Key: YARN-1383 URL: https://issues.apache.org/jira/browse/YARN-1383 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.2.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: YARN-1383.patch Writing out a line whenever a node heartbeats is not useful and just too much. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (YARN-290) Wrong cluster metrics on RM page with FairScheduler
[ https://issues.apache.org/jira/browse/YARN-290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lohit Vijayarenu resolved YARN-290. --- Resolution: Duplicate Closing, duplicate of YARN-282 Wrong cluster metrics on RM page with FairScheduler --- Key: YARN-290 URL: https://issues.apache.org/jira/browse/YARN-290 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.3-alpha Reporter: Lohit Vijayarenu Priority: Minor ResourceManager seems to always show few (1-3) applications in pending state on ResourceManager webpage under Cluster metrics tab, while there are no pending applications. It is very easy to replicate. Start RM, submit one job and you would see there is 2 pending applications which is incorrect. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1206) Container logs link is broken on RM web UI after application finished
[ https://issues.apache.org/jira/browse/YARN-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769199#comment-13769199 ] Lohit Vijayarenu commented on YARN-1206: We have seen case where upon enabling log aggregation, container link are broken because logs are aggregated to HDFS. If link is not updated to point ot history server, those will be broken links. Broken here is to say that nodemangaer will not be able to display logs (since they are aggregated to hdfs) One way to reproduce this is, run application to completion, then click application link, something like http://hadoop-rm-host:port/cluster/app/application_1379365648572_0001 Then click on 'logs' link next to ApplicationMaster attempt. This will point to page on NM displaying the below message Failed while trying to construct the redirect url to the log server. Log Server url may not be configured Unknown container. Container either has not started or has already completed or doesn't belong to this node at all. Container logs link is broken on RM web UI after application finished - Key: YARN-1206 URL: https://issues.apache.org/jira/browse/YARN-1206 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Priority: Blocker Labels: 2.1.1-beta When container is running, its logs link works properly, but after the application is finished, the link shows 'Container does not exist.' -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-305) Too many 'Node offerred to app:... messages in RM
[ https://issues.apache.org/jira/browse/YARN-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lohit Vijayarenu updated YARN-305: -- Attachment: YARN-305.2.patch Sorry, somehow missed review comments in email. These are the only log messages which seems to fill up RM output as of now. Too many 'Node offerred to app:... messages in RM -- Key: YARN-305 URL: https://issues.apache.org/jira/browse/YARN-305 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Lohit Vijayarenu Assignee: Lohit Vijayarenu Priority: Minor Attachments: YARN-305.1.patch, YARN-305.2.patch Running fair scheduler YARN shows that RM has lots of messages like the below. {noformat} INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable: Node offered to app: application_1357147147433_0002 reserved: false {noformat} They dont seem to tell much and same line is dumped many times in RM log. It would be good to have it improved with node information or moved to some other logging level with enough debug information -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1122) FairScheduler user-as-default-queue always defaults to 'default'
Lohit Vijayarenu created YARN-1122: -- Summary: FairScheduler user-as-default-queue always defaults to 'default' Key: YARN-1122 URL: https://issues.apache.org/jira/browse/YARN-1122 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.0.5-alpha Reporter: Lohit Vijayarenu By default YARN fairscheduler should use user name as queue name, but we see that in our clusters all jobs were ending up in default queue. Even after picking YARN-333 which is part of trunk, the behavior remains the same. Jobs do end up in right queue, but from UI perspective they are shown as running under default queue. It looks like there is small bug with {noformat} RMApp rmApp = rmContext.getRMApps().get(applicationAttemptId); {noformat} which should actually be {noformat} RMApp rmApp = rmContext.getRMApps().get(applicationAttemptId.getApplicationId()); {noformat} There is also a simple js change needed for filtering of jobs on fairscheduler UI page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1122) FairScheduler user-as-default-queue always defaults to 'default'
[ https://issues.apache.org/jira/browse/YARN-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lohit Vijayarenu updated YARN-1122: --- Attachment: YARN-1122.1.patch Simple patch to fix this. FairScheduler user-as-default-queue always defaults to 'default' Key: YARN-1122 URL: https://issues.apache.org/jira/browse/YARN-1122 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.0.5-alpha Reporter: Lohit Vijayarenu Attachments: YARN-1122.1.patch By default YARN fairscheduler should use user name as queue name, but we see that in our clusters all jobs were ending up in default queue. Even after picking YARN-333 which is part of trunk, the behavior remains the same. Jobs do end up in right queue, but from UI perspective they are shown as running under default queue. It looks like there is small bug with {noformat} RMApp rmApp = rmContext.getRMApps().get(applicationAttemptId); {noformat} which should actually be {noformat} RMApp rmApp = rmContext.getRMApps().get(applicationAttemptId.getApplicationId()); {noformat} There is also a simple js change needed for filtering of jobs on fairscheduler UI page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-305) Too many 'Node offerred to app:... messages in RM
[ https://issues.apache.org/jira/browse/YARN-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lohit Vijayarenu updated YARN-305: -- Attachment: YARN-305.1.patch Simple patch to change log level to debug and add node information. I also saw similar case while offering node to queue, so add node information these as well. Could not think of test case as this is only changing loglevel Too many 'Node offerred to app:... messages in RM -- Key: YARN-305 URL: https://issues.apache.org/jira/browse/YARN-305 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Lohit Vijayarenu Priority: Minor Attachments: YARN-305.1.patch Running fair scheduler YARN shows that RM has lots of messages like the below. {noformat} INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable: Node offered to app: application_1357147147433_0002 reserved: false {noformat} They dont seem to tell much and same line is dumped many times in RM log. It would be good to have it improved with node information or moved to some other logging level with enough debug information -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-305) Too many 'Node offerred to app:... messages in RM
[ https://issues.apache.org/jira/browse/YARN-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lohit Vijayarenu updated YARN-305: -- Attachment: (was: YARN-305.1.patch) Too many 'Node offerred to app:... messages in RM -- Key: YARN-305 URL: https://issues.apache.org/jira/browse/YARN-305 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Lohit Vijayarenu Priority: Minor Running fair scheduler YARN shows that RM has lots of messages like the below. {noformat} INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable: Node offered to app: application_1357147147433_0002 reserved: false {noformat} They dont seem to tell much and same line is dumped many times in RM log. It would be good to have it improved with node information or moved to some other logging level with enough debug information -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-305) Too many 'Node offerred to app:... messages in RM
[ https://issues.apache.org/jira/browse/YARN-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lohit Vijayarenu updated YARN-305: -- Attachment: YARN-305.1.patch Had generated diff from old branch. Reattaching diff. Too many 'Node offerred to app:... messages in RM -- Key: YARN-305 URL: https://issues.apache.org/jira/browse/YARN-305 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Lohit Vijayarenu Priority: Minor Attachments: YARN-305.1.patch Running fair scheduler YARN shows that RM has lots of messages like the below. {noformat} INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable: Node offered to app: application_1357147147433_0002 reserved: false {noformat} They dont seem to tell much and same line is dumped many times in RM log. It would be good to have it improved with node information or moved to some other logging level with enough debug information -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1032) NPE in RackResolve
Lohit Vijayarenu created YARN-1032: -- Summary: NPE in RackResolve Key: YARN-1032 URL: https://issues.apache.org/jira/browse/YARN-1032 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.5-alpha Environment: linux Reporter: Lohit Vijayarenu Priority: Minor We found a case where our rack resolve script was not returning rack due to problem with resolving host address. This exception was see in RackResolver.java as NPE, ultimately caught in RMContainerAllocator. {noformat} 2013-08-01 07:11:37,708 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.lang.NullPointerException at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:99) at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:92) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignMapsWithLocality(RMContainerAllocator.java:1039) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignContainers(RMContainerAllocator.java:925) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:861) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$400(RMContainerAllocator.java:681) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:243) at java.lang.Thread.run(Thread.java:722) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1032) NPE in RackResolve
[ https://issues.apache.org/jira/browse/YARN-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729977#comment-13729977 ] Lohit Vijayarenu commented on YARN-1032: Once we hit exception in RackResolver, since this is not caught or default rack is not returned, this is end up not releasing containers which could not be assigned in RMContainerAllocator.java {noformat} assignContainers(allocatedContainers); // release container if we could not assign it it = allocatedContainers.iterator(); while (it.hasNext()) { Container allocated = it.next(); LOG.info(Releasing unassigned and invalid container + allocated + . RM may have assignment issues); containerNotAssigned(allocated); } {noformat} AM would no longer ask for new containers since it thinks containers are assigned and RM assumes containers are allocated to AM. Job ends up hanging forever without making any progress. Fixing releasing containers might be part of another JIRA, at the minimum we need to catch exception and return default rack incase of failure. NPE in RackResolve -- Key: YARN-1032 URL: https://issues.apache.org/jira/browse/YARN-1032 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.5-alpha Environment: linux Reporter: Lohit Vijayarenu Priority: Minor We found a case where our rack resolve script was not returning rack due to problem with resolving host address. This exception was see in RackResolver.java as NPE, ultimately caught in RMContainerAllocator. {noformat} 2013-08-01 07:11:37,708 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.lang.NullPointerException at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:99) at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:92) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignMapsWithLocality(RMContainerAllocator.java:1039) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignContainers(RMContainerAllocator.java:925) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:861) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$400(RMContainerAllocator.java:681) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:243) at java.lang.Thread.run(Thread.java:722) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1032) NPE in RackResolve
[ https://issues.apache.org/jira/browse/YARN-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lohit Vijayarenu updated YARN-1032: --- Attachment: YARN-1032.1.patch Simple patch to catch NPE and return default-rack. Since it is catch NPE did not try to come up with test case. Let me know if this look good. NPE in RackResolve -- Key: YARN-1032 URL: https://issues.apache.org/jira/browse/YARN-1032 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.5-alpha Environment: linux Reporter: Lohit Vijayarenu Priority: Minor Attachments: YARN-1032.1.patch We found a case where our rack resolve script was not returning rack due to problem with resolving host address. This exception was see in RackResolver.java as NPE, ultimately caught in RMContainerAllocator. {noformat} 2013-08-01 07:11:37,708 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.lang.NullPointerException at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:99) at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:92) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignMapsWithLocality(RMContainerAllocator.java:1039) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignContainers(RMContainerAllocator.java:925) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:861) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$400(RMContainerAllocator.java:681) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:243) at java.lang.Thread.run(Thread.java:722) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1032) NPE in RackResolve
[ https://issues.apache.org/jira/browse/YARN-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730167#comment-13730167 ] Lohit Vijayarenu commented on YARN-1032: [~zjshen] Yes, documentation does not mention returning null for resolve(), but if you look into RawScriptBasedMapping::resolve(), failure to resolve rack can return null in atleast two places, hence the null check. Thanks for pointing out TestRackResolver, I will try to add a test case. NPE in RackResolve -- Key: YARN-1032 URL: https://issues.apache.org/jira/browse/YARN-1032 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.5-alpha Environment: linux Reporter: Lohit Vijayarenu Priority: Minor Attachments: YARN-1032.1.patch We found a case where our rack resolve script was not returning rack due to problem with resolving host address. This exception was see in RackResolver.java as NPE, ultimately caught in RMContainerAllocator. {noformat} 2013-08-01 07:11:37,708 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.lang.NullPointerException at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:99) at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:92) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignMapsWithLocality(RMContainerAllocator.java:1039) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignContainers(RMContainerAllocator.java:925) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:861) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$400(RMContainerAllocator.java:681) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:243) at java.lang.Thread.run(Thread.java:722) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-666) [Umbrella] Support rolling upgrades in YARN
[ https://issues.apache.org/jira/browse/YARN-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13659686#comment-13659686 ] Lohit Vijayarenu commented on YARN-666: --- This looks good. Few minor point/JIRAs against metrics, reporting and UI pages updates with different version of yarn daemon should also be included. As Karthik already mentioned, it would be very useful if this followed HDFS-2983. This will become very useful for people who manage and do rolling upgrades on cluster. Another question regarding draining of NodeManager. Do we have a concept of Blacklisting NodeManager today? Reason I ask is, if we know we can afford to kill running apps on nodemanager, but do not want new jobs to be submitted, one could potentially use blacklisting. [Umbrella] Support rolling upgrades in YARN --- Key: YARN-666 URL: https://issues.apache.org/jira/browse/YARN-666 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.4-alpha Reporter: Siddharth Seth Attachments: YARN_Rolling_Upgrades.pdf, YARN_Rolling_Upgrades_v2.pdf Jira to track changes required in YARN to allow rolling upgrades, including documentation and possible upgrade routes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (YARN-356) Add YARN_NODEMANAGER_OPTS and YARN_RESOURCEMANAGER_OPTS to yarn.env
[ https://issues.apache.org/jira/browse/YARN-356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lohit Vijayarenu resolved YARN-356. --- Resolution: Invalid Add YARN_NODEMANAGER_OPTS and YARN_RESOURCEMANAGER_OPTS to yarn.env --- Key: YARN-356 URL: https://issues.apache.org/jira/browse/YARN-356 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Affects Versions: 2.0.2-alpha Reporter: Lohit Vijayarenu At present it is difficult to set different Xmx values for RM and NM without having different yarn-env.sh. Like HDFS, it would be good to have YARN_NODEMANAGER_OPTS and YARN_RESOURCEMANAGER_OPTS -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (YARN-307) NodeManager should log container launch command.
[ https://issues.apache.org/jira/browse/YARN-307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lohit Vijayarenu resolved YARN-307. --- Resolution: Invalid Resolving as wont invalid NodeManager should log container launch command. Key: YARN-307 URL: https://issues.apache.org/jira/browse/YARN-307 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Lohit Vijayarenu Labels: usability NodeManager's DefaultContainerExecutor seems to log only path of default container executor script instead of contents of script. It would be good to log the execution command so that one could see what is being launched. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-546) mapred.fairscheduler.eventlog.enabled removed from Hadoop 2.0
[ https://issues.apache.org/jira/browse/YARN-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lohit Vijayarenu updated YARN-546: -- Attachment: YARN-546.1.patch It looks like there is nothing much logged in event log and majority seems to be just node updates. If no one votes against to have this removed then here is a patch for review. mapred.fairscheduler.eventlog.enabled removed from Hadoop 2.0 - Key: YARN-546 URL: https://issues.apache.org/jira/browse/YARN-546 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.0.3-alpha Reporter: Lohit Vijayarenu Attachments: YARN-546.1.patch Hadoop 1.0 supported an option to turn on/off FairScheduler event logging using mapred.fairscheduler.eventlog.enabled. In Hadoop 2.0, it looks like this option has been removed (or not ported?) which causes event logging to be enabled by default and there is no way to turn it off. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-451) Add more metrics to RM page
[ https://issues.apache.org/jira/browse/YARN-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627203#comment-13627203 ] Lohit Vijayarenu commented on YARN-451: --- Tried to see if adding total number of Containers was trivial change, but looks like there is no notion of application max resource available to resourcemangaer. This might be the reason why RM page did not have the information. Looking into RMAppImpl shows that this kind of information is not passed either from Client/AM to RM during application initialization. Something close to notion of job weight I could see was Resource demand, but that seems to be change based on how an application request containers. For example FairScheduler seem to recalculate fairshare based on how much resource demand is passed by applications. One option I can think of is to add an additional field in protobuf which specifies what is total number of containers/resource an application might use. This would be optional field which can be used only by MapReduce for now and Client can set this value based on number of mappers/reducers. I am not sure if this is the right approach, any other simpler ideas people can suggest? Add more metrics to RM page --- Key: YARN-451 URL: https://issues.apache.org/jira/browse/YARN-451 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.0.3-alpha Reporter: Lohit Vijayarenu Priority: Minor ResourceManager webUI shows list of RUNNING applications, but it does not tell which applications are requesting more resource compared to others. With cluster running hundreds of applications at once it would be useful to have some kind of metric to show high-resource usage applications vs low-resource usage ones. At the minimum showing number of containers is good option. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-502) RM crash with NPE on NODE_REMOVED event
Lohit Vijayarenu created YARN-502: - Summary: RM crash with NPE on NODE_REMOVED event Key: YARN-502 URL: https://issues.apache.org/jira/browse/YARN-502 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.3-alpha Reporter: Lohit Vijayarenu While running some test and adding/removing nodes, we see RM crashed with the below exception. We are testing with fair scheduler and running hadoop-2.0.3-alpha {noformat} 2013-03-22 18:54:27,015 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node :55680 as it is now LOST 2013-03-22 18:54:27,015 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: :55680 Node Transitioned from UNHEALTHY to LOST 2013-03-22 18:54:27,015 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_REMOVED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeNode(FairScheduler.java:619) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:856) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:375) at java.lang.Thread.run(Thread.java:662) 2013-03-22 18:54:27,016 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. 2013-03-22 18:54:27,020 INFO org.mortbay.log: Stopped SelectChannelConnector@:50030 {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-451) Add more metrics to RM page
Lohit Vijayarenu created YARN-451: - Summary: Add more metrics to RM page Key: YARN-451 URL: https://issues.apache.org/jira/browse/YARN-451 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.0.3-alpha Reporter: Lohit Vijayarenu Priority: Minor ResourceManager webUI shows list of RUNNING applications, but it does not tell which applications are requesting more resource compared to others. With cluster running hundreds of applications at once it would be useful to have some kind of metric to show high-resource usage applications vs low-resource usage ones. At the minimum showing number of containers is good option. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-402) Dispatcher warn message is too late
Lohit Vijayarenu created YARN-402: - Summary: Dispatcher warn message is too late Key: YARN-402 URL: https://issues.apache.org/jira/browse/YARN-402 Project: Hadoop YARN Issue Type: Improvement Reporter: Lohit Vijayarenu Priority: Minor AsyncDispatcher throws out Warn when capacity remaining is less than 1000 {noformat} if (remCapacity 1000) { LOG.warn(Very low remaining capacity in the event-queue: + remCapacity); } {noformat} What would be useful is to warn much before that, may be half full instead of when queue is completely full. I see that eventQueue capacity is int value. So, if one warn's queue has only 1000 capacity left, then service definitely has serious problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-356) Add YARN_NODEMANAGER_OPTS and YARN_RESOURCEMANAGER_OPTS to yarn.env
Lohit Vijayarenu created YARN-356: - Summary: Add YARN_NODEMANAGER_OPTS and YARN_RESOURCEMANAGER_OPTS to yarn.env Key: YARN-356 URL: https://issues.apache.org/jira/browse/YARN-356 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Affects Versions: 2.0.2-alpha Reporter: Lohit Vijayarenu At present it is difficult to set different Xmx values for RM and NM without having different yarn-env.sh. Like HDFS, it would be good to have YARN_NODEMANAGER_OPTS and YARN_RESOURCEMANAGER_OPTS -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-351) ResourceManager NPE during allocateNodeLocal
Lohit Vijayarenu created YARN-351: - Summary: ResourceManager NPE during allocateNodeLocal Key: YARN-351 URL: https://issues.apache.org/jira/browse/YARN-351 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.2-alpha Reporter: Lohit Vijayarenu Priority: Critical ResourceManager seem to die due to NPE shown below on FairScheduler. This is easily reproduced on a cluster with multiple racks and nodes within each rack. Simple job with multiple tasks on each node triggers NPE in RM. Without understanding actual workings, I tried to do a null check which looked like it solved problem. But I am not sure if that is the right behavior yet. I feel this is serious enough to be marked as blocker, what do you guys think? {noformat} 2013-01-22 20:07:45,073 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: allocate: applicationId=application_1358885180585_0001 container=container_1358885180585_0001_01_000830 host=x.x.x.x:36186 2013-01-22 20:07:45,074 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocateNodeLocal(AppSchedulingInfo.java:259) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:220) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp.allocate(FSSchedulerApp.java:544) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.assignContainer(AppSchedulable.java:250) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.assignContainer(AppSchedulable.java:318) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:180) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:796) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:859) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:375) at java.lang.Thread.run(Thread.java:662) 2013-01-22 20:07:45,075 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (YARN-351) ResourceManager NPE during allocateNodeLocal
[ https://issues.apache.org/jira/browse/YARN-351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lohit Vijayarenu resolved YARN-351. --- Resolution: Duplicate Thanks [~sandyr]. It does looks like it is solved in YARN-335. I was running one or two day's earlier build than your fix. ResourceManager NPE during allocateNodeLocal Key: YARN-351 URL: https://issues.apache.org/jira/browse/YARN-351 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.2-alpha Reporter: Lohit Vijayarenu Priority: Critical ResourceManager seem to die due to NPE shown below on FairScheduler. This is easily reproduced on a cluster with multiple racks and nodes within each rack. Simple job with multiple tasks on each node triggers NPE in RM. Without understanding actual workings, I tried to do a null check which looked like it solved problem. But I am not sure if that is the right behavior yet. I feel this is serious enough to be marked as blocker, what do you guys think? {noformat} 2013-01-22 20:07:45,073 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: allocate: applicationId=application_1358885180585_0001 container=container_1358885180585_0001_01_000830 host=x.x.x.x:36186 2013-01-22 20:07:45,074 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocateNodeLocal(AppSchedulingInfo.java:259) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:220) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp.allocate(FSSchedulerApp.java:544) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.assignContainer(AppSchedulable.java:250) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.assignContainer(AppSchedulable.java:318) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:180) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:796) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:859) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:375) at java.lang.Thread.run(Thread.java:662) 2013-01-22 20:07:45,075 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (YARN-287) NodeManager logs incorrect physical/virtual memory values
[ https://issues.apache.org/jira/browse/YARN-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lohit Vijayarenu resolved YARN-287. --- Resolution: Invalid Thanks for explanation. Closing as invalid NodeManager logs incorrect physical/virtual memory values - Key: YARN-287 URL: https://issues.apache.org/jira/browse/YARN-287 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Lohit Vijayarenu Priority: Minor Node manager does not log correct configured physical memory or virtual memory while killing containers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-324) Provide way to preserve container directories
[ https://issues.apache.org/jira/browse/YARN-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lohit Vijayarenu updated YARN-324: -- Summary: Provide way to preserve container directories (was: Provide way to preserve ) Provide way to preserve container directories - Key: YARN-324 URL: https://issues.apache.org/jira/browse/YARN-324 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager Affects Versions: 2.0.3-alpha Reporter: Lohit Vijayarenu There should be a way to preserve container directories (along with filecache/appcache) for offline debugging. As of today, if container completes (either success or failure) it would get cleaned up. In case of failure it becomes very hard to debug to find out what the case of failure is. Having ability to preserve container directories will enable one to log into the machine and debug further for failures. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-307) NodeManager should log container launch command.
Lohit Vijayarenu created YARN-307: - Summary: NodeManager should log container launch command. Key: YARN-307 URL: https://issues.apache.org/jira/browse/YARN-307 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Lohit Vijayarenu NodeManager's DefaultContainerExecutor seems to log only path of default container executor script instead of contents of script. It would be good to log the execution command so that one could see what is being launched. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-307) NodeManager should log container launch command.
[ https://issues.apache.org/jira/browse/YARN-307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542616#comment-13542616 ] Lohit Vijayarenu commented on YARN-307: --- For example I am seeing container launch failure without any useful message like this. {noformat} 2013-01-03 00:33:49,045 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's health-status : true, 2013-01-03 00:33:49,090 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from task is : 1 2013-01-03 00:33:49,090 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: {noformat} Script seems to exit with exit code of 1. To debug further, I wanted to see the command being execute, but in the logs I can see only the line as shown below {noformat} 2013-01-03 00:33:46,591 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /data/disk2/yarn/local/usercache/hadoop/appcache/application_1357147147433_0011/container_1357147147433_0011_01_01/default_container_executor.sh] {noformat} Once task fails, this directory is cleaned up. There seems to be no easy way to find out why container is failing. It would be good to log contents of default_container_executor.sh along with the path. NodeManager should log container launch command. Key: YARN-307 URL: https://issues.apache.org/jira/browse/YARN-307 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Lohit Vijayarenu NodeManager's DefaultContainerExecutor seems to log only path of default container executor script instead of contents of script. It would be good to log the execution command so that one could see what is being launched. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-290) Wrong cluster metrics on RM page
Lohit Vijayarenu created YARN-290: - Summary: Wrong cluster metrics on RM page Key: YARN-290 URL: https://issues.apache.org/jira/browse/YARN-290 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.3-alpha Reporter: Lohit Vijayarenu Priority: Minor ResourceManager seems to always show few (1-3) applications in pending state on ResourceManager webpage under Cluster metrics tab, while there are no pending applications. It is very easy to replicate. Start RM, submit one job and you would see there is 2 pending applications which is incorrect. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira