[jira] [Commented] (YARN-3214) Add non-exclusive node labels

2015-03-24 Thread Lohit Vijayarenu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379263#comment-14379263
 ] 

Lohit Vijayarenu commented on YARN-3214:


Thanks [~wangda] for reply. I feel partitions and constraints as two separate 
entities will cause more confusion. If allocation is challenge (as you 
described in example for multiple labels), then it is something which should be 
solved in scheduler, no? This is same problem one would have even without 
labels. For a given node which advertises 10G of memory, and app/queue with X 
and Y, how would you divide resource among X and Y? 
PS: Mesos Scheduler for example uses term called constraints which is similar 
to labels. In that sense I agree with [~vinodkv] that we should probably call 
this feature as partition or something related? 

 Add non-exclusive node labels 
 --

 Key: YARN-3214
 URL: https://issues.apache.org/jira/browse/YARN-3214
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: Non-exclusive-Node-Partition-Design.pdf


 Currently node labels partition the cluster to some sub-clusters so resources 
 cannot be shared between partitioned cluster. 
 With the current implementation of node labels we cannot use the cluster 
 optimally and the throughput of the cluster will suffer.
 We are proposing adding non-exclusive node labels:
 1. Labeled apps get the preference on Labeled nodes 
 2. If there is no ask for labeled resources we can assign those nodes to non 
 labeled apps
 3. If there is any future ask for those resources , we will preempt the non 
 labeled apps and give them back to labeled apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3214) Add non-exclusive node labels

2015-03-24 Thread Lohit Vijayarenu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378541#comment-14378541
 ] 

Lohit Vijayarenu commented on YARN-3214:


bq. (P0) A node can belong to at most one partition. All nodes belong to a 
DEFAULT
partition unless overridden.

Does this mean one a node we can have only one label? If so, it would become 
too restrictive. Labels on nodes can be seen in multiple dimension (from app's 
resource, machine resource and also usecase resouce, eg backfill jobs are 
placed on specific set of nodes). In those cases we should have ability to have 
multiple labels on node. 

Also, in the documents there is mention of scheduling apps without any labels 
being scheduled on labeled nodes if resources are idle. Does that also cover 
apps which could have different label other than A/B, but still have a label be 
placed on these nodes when there is free resources available?

 Add non-exclusive node labels 
 --

 Key: YARN-3214
 URL: https://issues.apache.org/jira/browse/YARN-3214
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: Non-exclusive-Node-Partition-Design.pdf


 Currently node labels partition the cluster to some sub-clusters so resources 
 cannot be shared between partitioned cluster. 
 With the current implementation of node labels we cannot use the cluster 
 optimally and the throughput of the cluster will suffer.
 We are proposing adding non-exclusive node labels:
 1. Labeled apps get the preference on Labeled nodes 
 2. If there is no ask for labeled resources we can assign those nodes to non 
 labeled apps
 3. If there is any future ask for those resources , we will preempt the non 
 labeled apps and give them back to labeled apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-09-12 Thread Lohit Vijayarenu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131732#comment-14131732
 ] 

Lohit Vijayarenu commented on YARN-2314:


We hit same problem on one of our large cluster with more than 2.5K nodes. As a 
work around we ended up increasing container size to 6G for AM (and with 
pmem-vmem ratio of 2:1) we give away 12G of VM for AM container. From initial 
looks of this, there is no way to turn this behavior off via config, other than 
patching code, right?

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Priority: Critical
 Attachments: nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-09-12 Thread Lohit Vijayarenu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131920#comment-14131920
 ] 

Lohit Vijayarenu commented on YARN-2314:


Thanks [~jlowe]

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Priority: Critical
 Attachments: disable-cm-proxy-cache.patch, 
 nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-06-23 Thread Lohit Vijayarenu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041128#comment-14041128
 ] 

Lohit Vijayarenu commented on YARN-796:
---

As [~tucu00] mentioned, label sounds closely related to affinity and should be 
treated less off a resource. It becomes closely related to resources when it 
comes to exposing them on scheduler queues and exposing that to users who wish 
to schedule their jobs on certain set of labeled nodes. This is definitely very 
useful feature to have. Looking forward for design document. 

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: YARN-796.patch


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1692) ConcurrentModificationException in fair scheduler AppSchedulable

2014-02-12 Thread Lohit Vijayarenu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899506#comment-13899506
 ] 

Lohit Vijayarenu commented on YARN-1692:


+1 on the patch. Can anyone else review this as well.

 ConcurrentModificationException in fair scheduler AppSchedulable
 

 Key: YARN-1692
 URL: https://issues.apache.org/jira/browse/YARN-1692
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.5-alpha
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: yarn-1692.patch


 We saw a ConcurrentModificationException thrown in the fair scheduler:
 {noformat}
 2014-02-07 01:40:01,978 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Exception in fair scheduler UpdateThread
 java.util.ConcurrentModificationException
 at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
 at java.util.HashMap$ValueIterator.next(HashMap.java:954)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.updateDemand(AppSchedulable.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.updateDemand(FSLeafQueue.java:125)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.updateDemand(FSParentQueue.java:82)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:217)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:195)
 at java.lang.Thread.run(Thread.java:724)
 {noformat}
 The map that  gets returned by FSSchedulerApp.getResourceRequests() are 
 iterated on without proper synchronization.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

2014-01-14 Thread Lohit Vijayarenu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870908#comment-13870908
 ] 

Lohit Vijayarenu commented on YARN-1530:


Yes, proxy server inside library, but only in AM not containers. Containers 
could make rest calls to AM. Main advantage is that we would not send timeline 
data to one single server. For example we have seen cases where our history 
files could grow upto 700MB for large jobs. In that case having hundreds of 
would would easily become bottleneck for single REST point, distributing it to 
its own AM would help. 

 [Umbrella] Store, manage and serve per-framework application-timeline data
 --

 Key: YARN-1530
 URL: https://issues.apache.org/jira/browse/YARN-1530
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
 Attachments: application timeline design-20140108.pdf


 This is a sibling JIRA for YARN-321.
 Today, each application/framework has to do store, and serve per-framework 
 data all by itself as YARN doesn't have a common solution. This JIRA attempts 
 to solve the storage, management and serving of per-framework data from 
 various applications, both running and finished. The aim is to change YARN to 
 collect and store data in a generic manner with plugin points for frameworks 
 to do their own thing w.r.t interpretation and serving.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

2014-01-10 Thread Lohit Vijayarenu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868242#comment-13868242
 ] 

Lohit Vijayarenu commented on YARN-1530:


We might have also think about data transfer rate for REST endpoint from all 
AMs/Containers if this is hosted by ResourceManager. One idea could be to have 
REST endpoint be library which any AM can inherit. When AM initializes this 
library can init REST endpoint which then can push events to pluggable storage 
(HDFS/Kafka ...). This might be similar to how AM writes history events to HDFS 
today. This should give good scalability without changing much from API 
perspective. 

 [Umbrella] Store, manage and serve per-framework application-timeline data
 --

 Key: YARN-1530
 URL: https://issues.apache.org/jira/browse/YARN-1530
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
 Attachments: application timeline design-20140108.pdf


 This is a sibling JIRA for YARN-321.
 Today, each application/framework has to do store, and serve per-framework 
 data all by itself as YARN doesn't have a common solution. This JIRA attempts 
 to solve the storage, management and serving of per-framework data from 
 various applications, both running and finished. The aim is to change YARN to 
 collect and store data in a generic manner with plugin points for frameworks 
 to do their own thing w.r.t interpretation and serving.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-85) Allow per job log aggregation configuration

2013-11-13 Thread Lohit Vijayarenu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13822192#comment-13822192
 ] 

Lohit Vijayarenu commented on YARN-85:
--

Patch looks good to me. Can anyone else also take a look at the patch.

 Allow per job log aggregation configuration
 ---

 Key: YARN-85
 URL: https://issues.apache.org/jira/browse/YARN-85
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Siddharth Seth
Assignee: Chris Trezzo
Priority: Critical

 Currently, if log aggregation is enabled for a cluster - logs for all jobs 
 will be aggregated - leading to a whole bunch of files on hdfs which users 
 may not want.
 Users should be able to control this along with the aggregation policy - 
 failed only, all, etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Resolved] (YARN-546) mapred.fairscheduler.eventlog.enabled removed from Hadoop 2.0

2013-11-13 Thread Lohit Vijayarenu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lohit Vijayarenu resolved YARN-546.
---

Resolution: Duplicate

Resolving duplicate of YARN-1383

 mapred.fairscheduler.eventlog.enabled removed from Hadoop 2.0
 -

 Key: YARN-546
 URL: https://issues.apache.org/jira/browse/YARN-546
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Lohit Vijayarenu
 Attachments: YARN-546.1.patch


 Hadoop 1.0 supported an option to turn on/off FairScheduler event logging 
 using mapred.fairscheduler.eventlog.enabled. In Hadoop 2.0, it looks like 
 this option has been removed (or not ported?) which causes event logging to 
 be enabled by default and there is no way to turn it off.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1383) Remove node updates from the Fair Scheduler event log

2013-11-13 Thread Lohit Vijayarenu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13822210#comment-13822210
 ] 

Lohit Vijayarenu commented on YARN-1383:


On big clusters logging for each heartbeat is too much. To debug if 
NodeManagers are heartbeating, we could possibly use other methods like network 
connection, stack traces and such. +1 on removing this line.

 Remove node updates from the Fair Scheduler event log
 -

 Key: YARN-1383
 URL: https://issues.apache.org/jira/browse/YARN-1383
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Affects Versions: 2.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-1383.patch


 Writing out a line whenever a node heartbeats is not useful and just too much.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Resolved] (YARN-290) Wrong cluster metrics on RM page with FairScheduler

2013-09-17 Thread Lohit Vijayarenu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lohit Vijayarenu resolved YARN-290.
---

Resolution: Duplicate

Closing, duplicate of YARN-282

 Wrong cluster metrics on RM page with FairScheduler
 ---

 Key: YARN-290
 URL: https://issues.apache.org/jira/browse/YARN-290
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Lohit Vijayarenu
Priority: Minor

 ResourceManager seems to always show few (1-3) applications in pending state 
 on ResourceManager webpage under Cluster metrics tab, while there are no 
 pending applications. It is very easy to replicate. Start RM, submit one job 
 and you would see there is 2 pending applications which is incorrect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1206) Container logs link is broken on RM web UI after application finished

2013-09-16 Thread Lohit Vijayarenu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769199#comment-13769199
 ] 

Lohit Vijayarenu commented on YARN-1206:


We have seen case where upon enabling log aggregation, container link are 
broken because logs are aggregated to HDFS. If link is not updated to point ot 
history server, those will be broken links. Broken here is to say that 
nodemangaer will not be able to display logs (since they are aggregated to hdfs)

One way to reproduce this is, run application to completion, then click 
application link, something like
http://hadoop-rm-host:port/cluster/app/application_1379365648572_0001

Then click on 'logs' link next to ApplicationMaster attempt. This will point to 
page on NM displaying the below message
Failed while trying to construct the redirect url to the log server. Log 
Server url may not be configured
Unknown container. Container either has not started or has already completed or 
doesn't belong to this node at all.


 Container logs link is broken on RM web UI after application finished
 -

 Key: YARN-1206
 URL: https://issues.apache.org/jira/browse/YARN-1206
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Priority: Blocker
  Labels: 2.1.1-beta

 When container is running, its logs link works properly, but after the 
 application is finished, the link shows 'Container does not exist.'

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-305) Too many 'Node offerred to app:... messages in RM

2013-09-09 Thread Lohit Vijayarenu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lohit Vijayarenu updated YARN-305:
--

Attachment: YARN-305.2.patch

Sorry, somehow missed review comments in email. These are the only log messages 
which seems to fill up RM output as of now.

 Too many 'Node offerred to app:... messages in RM
 --

 Key: YARN-305
 URL: https://issues.apache.org/jira/browse/YARN-305
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Lohit Vijayarenu
Assignee: Lohit Vijayarenu
Priority: Minor
 Attachments: YARN-305.1.patch, YARN-305.2.patch


 Running fair scheduler YARN shows that RM has lots of messages like the below.
 {noformat}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable: 
 Node offered to app: application_1357147147433_0002 reserved: false
 {noformat}
 They dont seem to tell much and same line is dumped many times in RM log. It 
 would be good to have it improved with node information or moved to some 
 other logging level with enough debug information

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-1122) FairScheduler user-as-default-queue always defaults to 'default'

2013-08-29 Thread Lohit Vijayarenu (JIRA)
Lohit Vijayarenu created YARN-1122:
--

 Summary: FairScheduler user-as-default-queue always defaults to 
'default'
 Key: YARN-1122
 URL: https://issues.apache.org/jira/browse/YARN-1122
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.5-alpha
Reporter: Lohit Vijayarenu


By default YARN fairscheduler should use user name as queue name, but we see 
that in our clusters all jobs were ending up in default queue. Even after 
picking YARN-333 which is part of trunk, the behavior remains the same. Jobs do 
end up in right queue, but from UI perspective they are shown as running under 
default queue. It looks like there is small bug with

{noformat}
RMApp rmApp = rmContext.getRMApps().get(applicationAttemptId);
{noformat}

which should actually be
{noformat}
RMApp rmApp = 
rmContext.getRMApps().get(applicationAttemptId.getApplicationId());
{noformat}

There is also a simple js change needed for filtering of jobs on fairscheduler 
UI page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1122) FairScheduler user-as-default-queue always defaults to 'default'

2013-08-29 Thread Lohit Vijayarenu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lohit Vijayarenu updated YARN-1122:
---

Attachment: YARN-1122.1.patch

Simple patch to fix this. 

 FairScheduler user-as-default-queue always defaults to 'default'
 

 Key: YARN-1122
 URL: https://issues.apache.org/jira/browse/YARN-1122
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.5-alpha
Reporter: Lohit Vijayarenu
 Attachments: YARN-1122.1.patch


 By default YARN fairscheduler should use user name as queue name, but we see 
 that in our clusters all jobs were ending up in default queue. Even after 
 picking YARN-333 which is part of trunk, the behavior remains the same. Jobs 
 do end up in right queue, but from UI perspective they are shown as running 
 under default queue. It looks like there is small bug with
 {noformat}
 RMApp rmApp = rmContext.getRMApps().get(applicationAttemptId);
 {noformat}
 which should actually be
 {noformat}
 RMApp rmApp = 
 rmContext.getRMApps().get(applicationAttemptId.getApplicationId());
 {noformat}
 There is also a simple js change needed for filtering of jobs on 
 fairscheduler UI page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-305) Too many 'Node offerred to app:... messages in RM

2013-08-14 Thread Lohit Vijayarenu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lohit Vijayarenu updated YARN-305:
--

Attachment: YARN-305.1.patch

Simple patch to change log level to debug and add node information. I also saw 
similar case while offering node to queue, so add node information these as 
well. Could not think of test case as this is only changing loglevel

 Too many 'Node offerred to app:... messages in RM
 --

 Key: YARN-305
 URL: https://issues.apache.org/jira/browse/YARN-305
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Lohit Vijayarenu
Priority: Minor
 Attachments: YARN-305.1.patch


 Running fair scheduler YARN shows that RM has lots of messages like the below.
 {noformat}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable: 
 Node offered to app: application_1357147147433_0002 reserved: false
 {noformat}
 They dont seem to tell much and same line is dumped many times in RM log. It 
 would be good to have it improved with node information or moved to some 
 other logging level with enough debug information

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-305) Too many 'Node offerred to app:... messages in RM

2013-08-14 Thread Lohit Vijayarenu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lohit Vijayarenu updated YARN-305:
--

Attachment: (was: YARN-305.1.patch)

 Too many 'Node offerred to app:... messages in RM
 --

 Key: YARN-305
 URL: https://issues.apache.org/jira/browse/YARN-305
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Lohit Vijayarenu
Priority: Minor

 Running fair scheduler YARN shows that RM has lots of messages like the below.
 {noformat}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable: 
 Node offered to app: application_1357147147433_0002 reserved: false
 {noformat}
 They dont seem to tell much and same line is dumped many times in RM log. It 
 would be good to have it improved with node information or moved to some 
 other logging level with enough debug information

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-305) Too many 'Node offerred to app:... messages in RM

2013-08-14 Thread Lohit Vijayarenu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lohit Vijayarenu updated YARN-305:
--

Attachment: YARN-305.1.patch

Had generated diff from old branch. Reattaching diff.

 Too many 'Node offerred to app:... messages in RM
 --

 Key: YARN-305
 URL: https://issues.apache.org/jira/browse/YARN-305
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Lohit Vijayarenu
Priority: Minor
 Attachments: YARN-305.1.patch


 Running fair scheduler YARN shows that RM has lots of messages like the below.
 {noformat}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable: 
 Node offered to app: application_1357147147433_0002 reserved: false
 {noformat}
 They dont seem to tell much and same line is dumped many times in RM log. It 
 would be good to have it improved with node information or moved to some 
 other logging level with enough debug information

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-1032) NPE in RackResolve

2013-08-05 Thread Lohit Vijayarenu (JIRA)
Lohit Vijayarenu created YARN-1032:
--

 Summary: NPE in RackResolve
 Key: YARN-1032
 URL: https://issues.apache.org/jira/browse/YARN-1032
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.5-alpha
 Environment: linux
Reporter: Lohit Vijayarenu
Priority: Minor


We found a case where our rack resolve script was not returning rack due to 
problem with resolving host address. This exception was see in 
RackResolver.java as NPE, ultimately caught in RMContainerAllocator. 

{noformat}
2013-08-01 07:11:37,708 ERROR [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING 
RM. 
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:99)
at 
org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:92)
at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignMapsWithLocality(RMContainerAllocator.java:1039)
at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignContainers(RMContainerAllocator.java:925)
at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:861)
at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$400(RMContainerAllocator.java:681)
at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219)
at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:243)
at java.lang.Thread.run(Thread.java:722)

{noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1032) NPE in RackResolve

2013-08-05 Thread Lohit Vijayarenu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729977#comment-13729977
 ] 

Lohit Vijayarenu commented on YARN-1032:


Once we hit exception in RackResolver, since this is not caught or default rack 
is not returned, this is end up not releasing containers which could not be 
assigned in RMContainerAllocator.java

{noformat}

  assignContainers(allocatedContainers);
   
  // release container if we could not assign it 
  it = allocatedContainers.iterator();
  while (it.hasNext()) {
Container allocated = it.next();
LOG.info(Releasing unassigned and invalid container  
+ allocated + . RM may have assignment issues);
containerNotAssigned(allocated);
  }
{noformat}

AM would no longer ask for new containers since it thinks containers are 
assigned and RM assumes containers are allocated to AM. Job ends up hanging 
forever without making any progress. Fixing releasing containers might be part 
of another JIRA, at the minimum we need to catch exception and return default 
rack incase of failure. 

 NPE in RackResolve
 --

 Key: YARN-1032
 URL: https://issues.apache.org/jira/browse/YARN-1032
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.5-alpha
 Environment: linux
Reporter: Lohit Vijayarenu
Priority: Minor

 We found a case where our rack resolve script was not returning rack due to 
 problem with resolving host address. This exception was see in 
 RackResolver.java as NPE, ultimately caught in RMContainerAllocator. 
 {noformat}
 2013-08-01 07:11:37,708 ERROR [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
 CONTACTING RM. 
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:99)
   at 
 org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:92)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignMapsWithLocality(RMContainerAllocator.java:1039)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignContainers(RMContainerAllocator.java:925)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:861)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$400(RMContainerAllocator.java:681)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:243)
   at java.lang.Thread.run(Thread.java:722)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1032) NPE in RackResolve

2013-08-05 Thread Lohit Vijayarenu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lohit Vijayarenu updated YARN-1032:
---

Attachment: YARN-1032.1.patch

Simple patch to catch NPE and return default-rack. Since it is catch NPE did 
not try to come up with test case. Let me know if this look good.

 NPE in RackResolve
 --

 Key: YARN-1032
 URL: https://issues.apache.org/jira/browse/YARN-1032
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.5-alpha
 Environment: linux
Reporter: Lohit Vijayarenu
Priority: Minor
 Attachments: YARN-1032.1.patch


 We found a case where our rack resolve script was not returning rack due to 
 problem with resolving host address. This exception was see in 
 RackResolver.java as NPE, ultimately caught in RMContainerAllocator. 
 {noformat}
 2013-08-01 07:11:37,708 ERROR [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
 CONTACTING RM. 
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:99)
   at 
 org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:92)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignMapsWithLocality(RMContainerAllocator.java:1039)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignContainers(RMContainerAllocator.java:925)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:861)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$400(RMContainerAllocator.java:681)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:243)
   at java.lang.Thread.run(Thread.java:722)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1032) NPE in RackResolve

2013-08-05 Thread Lohit Vijayarenu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730167#comment-13730167
 ] 

Lohit Vijayarenu commented on YARN-1032:


[~zjshen] Yes, documentation does not mention returning null for resolve(), but 
if you look into RawScriptBasedMapping::resolve(), failure to resolve rack can 
return null in atleast two places, hence the null check. Thanks for pointing 
out TestRackResolver, I will try to add a test case.

 NPE in RackResolve
 --

 Key: YARN-1032
 URL: https://issues.apache.org/jira/browse/YARN-1032
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.5-alpha
 Environment: linux
Reporter: Lohit Vijayarenu
Priority: Minor
 Attachments: YARN-1032.1.patch


 We found a case where our rack resolve script was not returning rack due to 
 problem with resolving host address. This exception was see in 
 RackResolver.java as NPE, ultimately caught in RMContainerAllocator. 
 {noformat}
 2013-08-01 07:11:37,708 ERROR [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
 CONTACTING RM. 
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:99)
   at 
 org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:92)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignMapsWithLocality(RMContainerAllocator.java:1039)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignContainers(RMContainerAllocator.java:925)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:861)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$400(RMContainerAllocator.java:681)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219)
   at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:243)
   at java.lang.Thread.run(Thread.java:722)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-666) [Umbrella] Support rolling upgrades in YARN

2013-05-16 Thread Lohit Vijayarenu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13659686#comment-13659686
 ] 

Lohit Vijayarenu commented on YARN-666:
---

This looks good. Few minor point/JIRAs against metrics, reporting and UI pages 
updates with different version of yarn daemon should also be included. As 
Karthik already mentioned, it would be very useful if this followed HDFS-2983. 
This will become very useful for people who manage and do rolling upgrades on 
cluster.

Another question regarding draining of NodeManager. Do we have a concept of 
Blacklisting NodeManager today? Reason I ask is, if we know we can afford to 
kill running apps on nodemanager, but do not want new jobs to be submitted, one 
could potentially use blacklisting.

 [Umbrella] Support rolling upgrades in YARN
 ---

 Key: YARN-666
 URL: https://issues.apache.org/jira/browse/YARN-666
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.0.4-alpha
Reporter: Siddharth Seth
 Attachments: YARN_Rolling_Upgrades.pdf, YARN_Rolling_Upgrades_v2.pdf


 Jira to track changes required in YARN to allow rolling upgrades, including 
 documentation and possible upgrade routes. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (YARN-356) Add YARN_NODEMANAGER_OPTS and YARN_RESOURCEMANAGER_OPTS to yarn.env

2013-05-13 Thread Lohit Vijayarenu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lohit Vijayarenu resolved YARN-356.
---

Resolution: Invalid

 Add YARN_NODEMANAGER_OPTS and YARN_RESOURCEMANAGER_OPTS to yarn.env
 ---

 Key: YARN-356
 URL: https://issues.apache.org/jira/browse/YARN-356
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Affects Versions: 2.0.2-alpha
Reporter: Lohit Vijayarenu

 At present it is difficult to set different Xmx values for RM and NM without 
 having different yarn-env.sh. Like HDFS, it would be good to have 
 YARN_NODEMANAGER_OPTS and YARN_RESOURCEMANAGER_OPTS

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (YARN-307) NodeManager should log container launch command.

2013-05-12 Thread Lohit Vijayarenu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lohit Vijayarenu resolved YARN-307.
---

Resolution: Invalid

Resolving as wont invalid

 NodeManager should log container launch command.
 

 Key: YARN-307
 URL: https://issues.apache.org/jira/browse/YARN-307
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.3-alpha
Reporter: Lohit Vijayarenu
  Labels: usability

 NodeManager's DefaultContainerExecutor seems to log only path of default 
 container executor script instead of contents of script. It would be good to 
 log the execution command so that one could see what is being launched.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-546) mapred.fairscheduler.eventlog.enabled removed from Hadoop 2.0

2013-04-09 Thread Lohit Vijayarenu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lohit Vijayarenu updated YARN-546:
--

Attachment: YARN-546.1.patch

It looks like there is nothing much logged in event log and majority seems to 
be just node updates. If no one votes against to have this removed then here is 
a patch for review.

 mapred.fairscheduler.eventlog.enabled removed from Hadoop 2.0
 -

 Key: YARN-546
 URL: https://issues.apache.org/jira/browse/YARN-546
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Lohit Vijayarenu
 Attachments: YARN-546.1.patch


 Hadoop 1.0 supported an option to turn on/off FairScheduler event logging 
 using mapred.fairscheduler.eventlog.enabled. In Hadoop 2.0, it looks like 
 this option has been removed (or not ported?) which causes event logging to 
 be enabled by default and there is no way to turn it off.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-451) Add more metrics to RM page

2013-04-09 Thread Lohit Vijayarenu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627203#comment-13627203
 ] 

Lohit Vijayarenu commented on YARN-451:
---

Tried to see if adding total number of Containers was trivial change, but looks 
like there is no notion of application max resource available to 
resourcemangaer. This might be the reason why RM page did not have the 
information. Looking into RMAppImpl shows that this kind of information is not 
passed either from Client/AM to RM during application initialization. 

Something close to notion of job weight I could see was Resource demand, but 
that seems to be change based on how an application request containers. For 
example FairScheduler seem to recalculate fairshare based on how much resource 
demand is passed by applications. 

One option I can think of is to add an additional field in protobuf which 
specifies what is total number of containers/resource an application might use. 
This would be optional field which can be used only by MapReduce for now and 
Client can set this value based on number of mappers/reducers. I am not sure if 
this is the right approach, any other simpler ideas people can suggest?

 Add more metrics to RM page
 ---

 Key: YARN-451
 URL: https://issues.apache.org/jira/browse/YARN-451
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Lohit Vijayarenu
Priority: Minor

 ResourceManager webUI shows list of RUNNING applications, but it does not 
 tell which applications are requesting more resource compared to others. With 
 cluster running hundreds of applications at once it would be useful to have 
 some kind of metric to show high-resource usage applications vs low-resource 
 usage ones. At the minimum showing number of containers is good option.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-502) RM crash with NPE on NODE_REMOVED event

2013-03-22 Thread Lohit Vijayarenu (JIRA)
Lohit Vijayarenu created YARN-502:
-

 Summary: RM crash with NPE on NODE_REMOVED event
 Key: YARN-502
 URL: https://issues.apache.org/jira/browse/YARN-502
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.3-alpha
Reporter: Lohit Vijayarenu


While running some test and adding/removing nodes, we see RM crashed with the 
below exception. We are testing with fair scheduler and running 
hadoop-2.0.3-alpha

{noformat}
2013-03-22 18:54:27,015 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating 
Node :55680 as it is now LOST
2013-03-22 18:54:27,015 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: :55680 
Node Transitioned from UNHEALTHY to LOST
2013-03-22 18:54:27,015 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
handling event type NODE_REMOVED to the scheduler
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeNode(FairScheduler.java:619)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:856)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:98)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:375)
at java.lang.Thread.run(Thread.java:662)
2013-03-22 18:54:27,016 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
2013-03-22 18:54:27,020 INFO org.mortbay.log: Stopped 
SelectChannelConnector@:50030
{noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-451) Add more metrics to RM page

2013-03-05 Thread Lohit Vijayarenu (JIRA)
Lohit Vijayarenu created YARN-451:
-

 Summary: Add more metrics to RM page
 Key: YARN-451
 URL: https://issues.apache.org/jira/browse/YARN-451
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Lohit Vijayarenu
Priority: Minor


ResourceManager webUI shows list of RUNNING applications, but it does not tell 
which applications are requesting more resource compared to others. With 
cluster running hundreds of applications at once it would be useful to have 
some kind of metric to show high-resource usage applications vs low-resource 
usage ones. At the minimum showing number of containers is good option.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-402) Dispatcher warn message is too late

2013-02-13 Thread Lohit Vijayarenu (JIRA)
Lohit Vijayarenu created YARN-402:
-

 Summary: Dispatcher warn message is too late
 Key: YARN-402
 URL: https://issues.apache.org/jira/browse/YARN-402
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Lohit Vijayarenu
Priority: Minor


AsyncDispatcher throws out Warn when capacity remaining is less than 1000
{noformat}
if (remCapacity  1000) {
LOG.warn(Very low remaining capacity in the event-queue: 
+ remCapacity);
  }
{noformat}

What would be useful is to warn much before that, may be half full instead of 
when queue is completely full. I see that eventQueue capacity is int value. So, 
if one warn's queue has only 1000 capacity left, then service definitely has 
serious problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-356) Add YARN_NODEMANAGER_OPTS and YARN_RESOURCEMANAGER_OPTS to yarn.env

2013-01-24 Thread Lohit Vijayarenu (JIRA)
Lohit Vijayarenu created YARN-356:
-

 Summary: Add YARN_NODEMANAGER_OPTS and YARN_RESOURCEMANAGER_OPTS 
to yarn.env
 Key: YARN-356
 URL: https://issues.apache.org/jira/browse/YARN-356
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Affects Versions: 2.0.2-alpha
Reporter: Lohit Vijayarenu


At present it is difficult to set different Xmx values for RM and NM without 
having different yarn-env.sh. Like HDFS, it would be good to have 
YARN_NODEMANAGER_OPTS and YARN_RESOURCEMANAGER_OPTS

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-351) ResourceManager NPE during allocateNodeLocal

2013-01-22 Thread Lohit Vijayarenu (JIRA)
Lohit Vijayarenu created YARN-351:
-

 Summary: ResourceManager NPE during allocateNodeLocal
 Key: YARN-351
 URL: https://issues.apache.org/jira/browse/YARN-351
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.2-alpha
Reporter: Lohit Vijayarenu
Priority: Critical


ResourceManager seem to die due to NPE shown below on FairScheduler.
This is easily reproduced on a cluster with multiple racks and nodes within 
each rack. Simple job with multiple tasks on each node triggers NPE in RM.

Without understanding actual workings, I tried to do a null check which looked 
like it solved problem. But I am not sure if that is the right behavior yet.

I feel this is serious enough to be marked as blocker, what do you guys think?

{noformat}
2013-01-22 20:07:45,073 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: 
allocate: applicationId=application_1358885180585_0001 
container=container_1358885180585_0001_01_000830 host=x.x.x.x:36186
2013-01-22 20:07:45,074 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
handling event type NODE_UPDATE to the scheduler
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocateNodeLocal(AppSchedulingInfo.java:259)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:220)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp.allocate(FSSchedulerApp.java:544)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.assignContainer(AppSchedulable.java:250)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.assignContainer(AppSchedulable.java:318)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:180)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:796)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:859)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:98)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:375)
at java.lang.Thread.run(Thread.java:662)
2013-01-22 20:07:45,075 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
{noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (YARN-351) ResourceManager NPE during allocateNodeLocal

2013-01-22 Thread Lohit Vijayarenu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lohit Vijayarenu resolved YARN-351.
---

Resolution: Duplicate

Thanks [~sandyr]. It does looks like it is solved in YARN-335. I was running 
one or two day's earlier build than your fix.

 ResourceManager NPE during allocateNodeLocal
 

 Key: YARN-351
 URL: https://issues.apache.org/jira/browse/YARN-351
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.2-alpha
Reporter: Lohit Vijayarenu
Priority: Critical

 ResourceManager seem to die due to NPE shown below on FairScheduler.
 This is easily reproduced on a cluster with multiple racks and nodes within 
 each rack. Simple job with multiple tasks on each node triggers NPE in RM.
 Without understanding actual workings, I tried to do a null check which 
 looked like it solved problem. But I am not sure if that is the right 
 behavior yet.
 I feel this is serious enough to be marked as blocker, what do you guys think?
 {noformat}
 2013-01-22 20:07:45,073 DEBUG 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: 
 allocate: applicationId=application_1358885180585_0001 
 container=container_1358885180585_0001_01_000830 host=x.x.x.x:36186
 2013-01-22 20:07:45,074 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type NODE_UPDATE to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocateNodeLocal(AppSchedulingInfo.java:259)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:220)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp.allocate(FSSchedulerApp.java:544)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.assignContainer(AppSchedulable.java:250)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.assignContainer(AppSchedulable.java:318)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:180)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:796)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:859)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:375)
 at java.lang.Thread.run(Thread.java:662)
 2013-01-22 20:07:45,075 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (YARN-287) NodeManager logs incorrect physical/virtual memory values

2013-01-15 Thread Lohit Vijayarenu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lohit Vijayarenu resolved YARN-287.
---

Resolution: Invalid

Thanks for explanation. Closing as invalid 

 NodeManager logs incorrect physical/virtual memory values
 -

 Key: YARN-287
 URL: https://issues.apache.org/jira/browse/YARN-287
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.3-alpha
Reporter: Lohit Vijayarenu
Priority: Minor

 Node manager does not log correct configured physical memory or virtual 
 memory while killing containers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-324) Provide way to preserve container directories

2013-01-08 Thread Lohit Vijayarenu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lohit Vijayarenu updated YARN-324:
--

Summary: Provide way to preserve container directories  (was: Provide way 
to preserve )

 Provide way to preserve container directories
 -

 Key: YARN-324
 URL: https://issues.apache.org/jira/browse/YARN-324
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Lohit Vijayarenu

 There should be a way to preserve container directories (along with 
 filecache/appcache) for offline debugging. As of today, if container 
 completes (either success or failure) it would get cleaned up. In case of 
 failure it becomes very hard to debug to find out what the case of failure 
 is. Having ability to preserve container directories will enable one to log 
 into the machine and debug further for failures. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-307) NodeManager should log container launch command.

2013-01-02 Thread Lohit Vijayarenu (JIRA)
Lohit Vijayarenu created YARN-307:
-

 Summary: NodeManager should log container launch command.
 Key: YARN-307
 URL: https://issues.apache.org/jira/browse/YARN-307
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.3-alpha
Reporter: Lohit Vijayarenu


NodeManager's DefaultContainerExecutor seems to log only path of default 
container executor script instead of contents of script. It would be good to 
log the execution command so that one could see what is being launched.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-307) NodeManager should log container launch command.

2013-01-02 Thread Lohit Vijayarenu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542616#comment-13542616
 ] 

Lohit Vijayarenu commented on YARN-307:
---

For example I am seeing container launch failure without any useful message 
like this.
{noformat}
2013-01-03 00:33:49,045 DEBUG 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's 
health-status : true,
2013-01-03 00:33:49,090 WARN 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
from task is : 1
2013-01-03 00:33:49,090 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
{noformat}

Script seems to exit with exit code of 1. To debug further, I wanted to see the 
command being execute, but in the logs I can see only the line as shown below

{noformat}
2013-01-03 00:33:46,591 INFO 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: 
launchContainer: [bash, 
/data/disk2/yarn/local/usercache/hadoop/appcache/application_1357147147433_0011/container_1357147147433_0011_01_01/default_container_executor.sh]
{noformat}

Once task fails, this directory is cleaned up. There seems to be no easy way to 
find out why container is failing. It would be good to log contents of 
default_container_executor.sh along with the path.

 NodeManager should log container launch command.
 

 Key: YARN-307
 URL: https://issues.apache.org/jira/browse/YARN-307
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.3-alpha
Reporter: Lohit Vijayarenu

 NodeManager's DefaultContainerExecutor seems to log only path of default 
 container executor script instead of contents of script. It would be good to 
 log the execution command so that one could see what is being launched.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-290) Wrong cluster metrics on RM page

2012-12-21 Thread Lohit Vijayarenu (JIRA)
Lohit Vijayarenu created YARN-290:
-

 Summary: Wrong cluster metrics on RM page
 Key: YARN-290
 URL: https://issues.apache.org/jira/browse/YARN-290
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Lohit Vijayarenu
Priority: Minor


ResourceManager seems to always show few (1-3) applications in pending state on 
ResourceManager webpage under Cluster metrics tab, while there are no pending 
applications. It is very easy to replicate. Start RM, submit one job and you 
would see there is 2 pending applications which is incorrect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira