[jira] [Commented] (YARN-2362) Capacity Scheduler: apps with requests that exceed current capacity can starve pending apps

2014-07-28 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075925#comment-14075925
 ] 

Sunil G commented on YARN-2362:
---

Possible duplicate of YARN-1631

 Capacity Scheduler: apps with requests that exceed current capacity can 
 starve pending apps
 ---

 Key: YARN-2362
 URL: https://issues.apache.org/jira/browse/YARN-2362
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.4.1
Reporter: Ram Venkatesh

 Cluster configuration:
 Total memory: 8GB
 yarn.scheduler.minimum-allocation-mb 256
 yarn.scheduler.capacity.maximum-am-resource-percent 1 (100%, test only config)
 App 1 makes a request for 4.6 GB, succeeds, app transitions to RUNNING state. 
 It subsequently makes a request for 4.6 GB, which cannot be granted and it 
 waits.
 App 2 makes a request for 1 GB - never receives it, so the app stays in the 
 ACCEPTED state for ever.
 I think this can happen in leaf queues that are near capacity.
 The fix is likely in LeafQueue.java assignContainers near line 861, where it 
 returns if the assignment would exceed queue capacity, instead of checking if 
 requests for other active applications can be met.
 {code:title=LeafQueue.java|borderStyle=solid}
// Check queue max-capacity limit
if (!assignToQueue(clusterResource, required)) {
 -return NULL_ASSIGNMENT;
 +break;
}
 {code}
 With this change, the scenario above allows App 2 to start and finish while 
 App 1 continues to wait.
 I have a patch available, but wondering if the current behavior is by design.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2209) Replace allocate#resync command with ApplicationMasterNotRegisteredException to indicate AM to re-register on RM restart

2014-07-28 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2209:
--

Attachment: YARN-2209.5.patch

 Replace allocate#resync command with ApplicationMasterNotRegisteredException 
 to indicate AM to re-register on RM restart
 

 Key: YARN-2209
 URL: https://issues.apache.org/jira/browse/YARN-2209
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
 YARN-2209.4.patch, YARN-2209.5.patch


 YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
 application to re-register on RM restart. we should do the same for 
 AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2209) Replace allocate#resync command with ApplicationMasterNotRegisteredException to indicate AM to re-register on RM restart

2014-07-28 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075930#comment-14075930
 ] 

Jian He commented on YARN-2209:
---

bq. there could be possibly loosing HeartbeatThread if again 
responseQueue.add(response); InterruptedException. Can it be in while loop?
I see. added the while loop back.
bq. can add Note to AM_SHUTDOWN that providing link to 
ApplicationNotFoundException.
add description too.

 Replace allocate#resync command with ApplicationMasterNotRegisteredException 
 to indicate AM to re-register on RM restart
 

 Key: YARN-2209
 URL: https://issues.apache.org/jira/browse/YARN-2209
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
 YARN-2209.4.patch, YARN-2209.5.patch


 YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
 application to re-register on RM restart. we should do the same for 
 AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-1631) Container allocation issue in Leafqueue assignContainers()

2014-07-28 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G reassigned YARN-1631:
-

Assignee: Sunil G

 Container allocation issue in Leafqueue assignContainers()
 --

 Key: YARN-1631
 URL: https://issues.apache.org/jira/browse/YARN-1631
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: SuSe 11 Linux 
Reporter: Sunil G
Assignee: Sunil G
 Attachments: Yarn-1631.1.patch, Yarn-1631.2.patch


 Application1 has a demand of 8GB[Map Task Size as 8GB] which is more than 
 Node_1 can handle.
 Node_1 has a size of 8GB and 2GB is used by Application1's AM.
 Hence reservation happened for remaining 6GB in Node_1 by Application1.
 A new job is submitted with 2GB AM size and 2GB task size with only 2 Maps to 
 run.
 Node_2 also has 8GB capability.
 But Application2's AM cannot be launched in Node_2. And Application2 waits 
 longer as only 2 Nodes are available in cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions

2014-07-28 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2209:
--

Summary: Replace AM resync/shutdown command with corresponding exceptions  
(was: Replace allocate#resync command with 
ApplicationMasterNotRegisteredException to indicate AM to re-register on RM 
restart)

 Replace AM resync/shutdown command with corresponding exceptions
 

 Key: YARN-2209
 URL: https://issues.apache.org/jira/browse/YARN-2209
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
 YARN-2209.4.patch, YARN-2209.5.patch


 YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
 application to re-register on RM restart. we should do the same for 
 AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2362) Capacity Scheduler: apps with requests that exceed current capacity can starve pending apps

2014-07-28 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075970#comment-14075970
 ] 

Wangda Tan commented on YARN-2362:
--

I think we should fix this,
{code}
   if (!assignToQueue(clusterResource, required)) {
-return NULL_ASSIGNMENT;
+break;
   }
{code}
The {{return NULL_ASSIGNMENT}} statement means: if an app submitted earlier 
cannot allocate resource in a queue, the rest of apps in the queue cannot 
allocate resource in a queue too.

The {{break}} looks better to me.

And I agree this should be a duplicate of YARN-1631

 Capacity Scheduler: apps with requests that exceed current capacity can 
 starve pending apps
 ---

 Key: YARN-2362
 URL: https://issues.apache.org/jira/browse/YARN-2362
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.4.1
Reporter: Ram Venkatesh

 Cluster configuration:
 Total memory: 8GB
 yarn.scheduler.minimum-allocation-mb 256
 yarn.scheduler.capacity.maximum-am-resource-percent 1 (100%, test only config)
 App 1 makes a request for 4.6 GB, succeeds, app transitions to RUNNING state. 
 It subsequently makes a request for 4.6 GB, which cannot be granted and it 
 waits.
 App 2 makes a request for 1 GB - never receives it, so the app stays in the 
 ACCEPTED state for ever.
 I think this can happen in leaf queues that are near capacity.
 The fix is likely in LeafQueue.java assignContainers near line 861, where it 
 returns if the assignment would exceed queue capacity, instead of checking if 
 requests for other active applications can be met.
 {code:title=LeafQueue.java|borderStyle=solid}
// Check queue max-capacity limit
if (!assignToQueue(clusterResource, required)) {
 -return NULL_ASSIGNMENT;
 +break;
}
 {code}
 With this change, the scenario above allows App 2 to start and finish while 
 App 1 continues to wait.
 I have a patch available, but wondering if the current behavior is by design.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions

2014-07-28 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075973#comment-14075973
 ] 

Rohith commented on YARN-2209:
--

+1 patch looks good to me

 Replace AM resync/shutdown command with corresponding exceptions
 

 Key: YARN-2209
 URL: https://issues.apache.org/jira/browse/YARN-2209
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
 YARN-2209.4.patch, YARN-2209.5.patch


 YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
 application to re-register on RM restart. we should do the same for 
 AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions

2014-07-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075984#comment-14075984
 ] 

Hadoop QA commented on YARN-2209:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658088/YARN-2209.5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4454//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4454//console

This message is automatically generated.

 Replace AM resync/shutdown command with corresponding exceptions
 

 Key: YARN-2209
 URL: https://issues.apache.org/jira/browse/YARN-2209
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
 YARN-2209.4.patch, YARN-2209.5.patch


 YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
 application to re-register on RM restart. we should do the same for 
 AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1631) Container allocation issue in Leafqueue assignContainers()

2014-07-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076006#comment-14076006
 ] 

Hadoop QA commented on YARN-1631:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12625843/Yarn-1631.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4455//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4455//console

This message is automatically generated.

 Container allocation issue in Leafqueue assignContainers()
 --

 Key: YARN-1631
 URL: https://issues.apache.org/jira/browse/YARN-1631
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: SuSe 11 Linux 
Reporter: Sunil G
Assignee: Sunil G
 Attachments: Yarn-1631.1.patch, Yarn-1631.2.patch


 Application1 has a demand of 8GB[Map Task Size as 8GB] which is more than 
 Node_1 can handle.
 Node_1 has a size of 8GB and 2GB is used by Application1's AM.
 Hence reservation happened for remaining 6GB in Node_1 by Application1.
 A new job is submitted with 2GB AM size and 2GB task size with only 2 Maps to 
 run.
 Node_2 also has 8GB capability.
 But Application2's AM cannot be launched in Node_2. And Application2 waits 
 longer as only 2 Nodes are available in cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic

2014-07-28 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076042#comment-14076042
 ] 

Wangda Tan commented on YARN-1707:
--

Thanks for uploading the patch [~curino], [~subru]. They're great additions to 
current CapacityScheduler. I took a look at your patch,

*First I have a couple of questions about its background, especially 
{{PlanQueue}}/{{ReservationQueue}} in this patch. I think understanding 
background is important for me to get a whole picture of this patch. What I can 
understand is,*
# {{PlanQueue}} can have a normal {{ParentQueue}} as its parent, but all 
children of {{PlanQueue}} can only be {{ReservationQueue}}. Is it possible that 
multiple {{PlanQueue}} exist in the cluster?
# {{PlanQueue}} is initially setup in configuration, as same as 
{{ParentQueue}}, it has absolute capacity, etc. But different from 
{{ParentQueue}}, it has user-limit/user-limit-factor, etc.
# {{ReservationQueue}} is dynamically initialized by PlanFollower, when a new 
reservationId acquired, it will create a new {{ReservationQueue}} accordingly
# {{PlanFollower}} can dynamically adjust queue size of {{ReservationQueue}}s 
to make resource reservation can be satisfied.
# Is it possible that sum of reserved resource exceeds limit of 
{{PlanQueue}}/{{ReservationQeueu}} and preemption triggered?
# How to deal with RM restart? It is possible that RM restart during resource 
reservation, we may need to consider how to persistent such queues

Hope you could share your ideas about them.

*For requirement of this ticket (copied from JIRA),*
# create queues dynamically
# destroy queues dynamically
# dynamically change queue parameters (e.g., capacity)
# modify refreshqueue validation to enforce sum(child.getCapacity())= 100% 
instead of ==100%
# move app across queues

I found #1-#3 are dedicated used by {{PlanQueue}}, {{Reservation}}. IMHO, it 
should be better to added them to CapacityScheduler and don't couple them with 
ReservationSystem, but I cannot think about other solid senarios can leverage 
them. I hope to get feedbacks from community before we couple them with 
ReservationSystem. And as mentioned by [~acmurthy], can we merge add queue to 
existing add new queue mechanism?
#4 should be only valid in {{PlanQueue}}. Because if we change this behavior in 
{{ParentQueue}}, it is possible that some careless admin will mis-setting 
capacities of queues under a parent queue, if sum of their capacity don't 
equals to 1, some resource may not be able to be used by applications. 

*Some other comments (Majorly about move app because we may need consider scope 
of create/destory queues first):*
1) I think we need consider how moving apps across queues work with YARN-1368. 
We can change queue of containers from queueA to queueB, but with YARN-1368, 
during RM restart, container will report it is in queueA (we don't sync them to 
NM when do moveApp operation). I hope [~jianhe] could share some thoughts about 
this as well.
2) Move application in CapacityScheduler need call finishApplication in 
resource queue and submitApplication in target queue to make QueueMetrics 
correct. And submitApplication will check ACL of target queue as well.
3) Should we respect MaxApplicationsPerUser in target queue when trying to move 
app? IMHO, we can stop moving app if MaxApplicationsPerUser reached in target 
queue.

Thanks,
Wangda

 Making the CapacityScheduler more dynamic
 -

 Key: YARN-1707
 URL: https://issues.apache.org/jira/browse/YARN-1707
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Carlo Curino
Assignee: Carlo Curino
  Labels: capacity-scheduler
 Attachments: YARN-1707.patch


 The CapacityScheduler is a rather static at the moment, and refreshqueue 
 provides a rather heavy-handed way to reconfigure it. Moving towards 
 long-running services (tracked in YARN-896) and to enable more advanced 
 admission control and resource parcelling we need to make the 
 CapacityScheduler more dynamic. This is instrumental to the umbrella jira 
 YARN-1051.
 Concretely this require the following changes:
 * create queues dynamically
 * destroy queues dynamically
 * dynamically change queue parameters (e.g., capacity) 
 * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% 
 instead of ==100%
 We limit this to LeafQueues. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions

2014-07-28 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076048#comment-14076048
 ] 

Junping Du commented on YARN-2209:
--

Thanks [~jianhe] for the patch and [~rohithsharma] for review! I think this is 
a reasonable change and patch itself looks good to me. However, I have concern 
that it could break existing YARN applications that run with old version 
ApplicationMasterProtocol which looks forward to a RESYNC command rather than 
an exception in response. More discussions with broadly people in community are 
needed, I think.

 Replace AM resync/shutdown command with corresponding exceptions
 

 Key: YARN-2209
 URL: https://issues.apache.org/jira/browse/YARN-2209
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
 YARN-2209.4.patch, YARN-2209.5.patch


 YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
 application to re-register on RM restart. we should do the same for 
 AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2287) Add audit log levels for NM and RM

2014-07-28 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-2287:
---

Attachment: YARN-2287-patch-1.patch

 Add audit log levels for NM and RM
 --

 Key: YARN-2287
 URL: https://issues.apache.org/jira/browse/YARN-2287
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Affects Versions: 2.4.1
Reporter: Varun Saxena
 Attachments: YARN-2287-patch-1.patch, YARN-2287.patch


 NM and RM audit logging can be done based on log level as some of the audit 
 logs, especially the container audit logs appear too many times. By 
 introducing log level, certain audit logs can be suppressed, if not required 
 in deployment.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1291) RM INFO logs limit scheduling speed

2014-07-28 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076059#comment-14076059
 ] 

Varun Saxena commented on YARN-1291:


Hi [~sandyr], I had raised YARN-2287 which is also about printing of too many 
RM Audit logs in critical flow. For this, in the patch, I had added support for 
printing audit logs at different log levels and changed container logs in RM 
and NM to DEBUG. I didnt remove the audit logs as I wasnt sure if these audit 
logs are really required or not.  

 RM INFO logs limit scheduling speed
 ---

 Key: YARN-1291
 URL: https://issues.apache.org/jira/browse/YARN-1291
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 I've been running some microbenchmarks to see how fast the Fair Scheduler can 
 fill up a cluster and found its performance is significantly hampered by 
 logging.
 I tested with 500 (mock) nodes, and found that:
 * Taking out fair scheduler INFO logs on the critical path brought down the 
 latency from 14000 ms to 6000 ms
 * Taking out the INFO that RMContainerImpl logs when a container transitions 
 brought it down from 6000 ms to 4000 ms
 * Taking out RMAuditLogger logs brought it down from 4000 ms to 1700 ms



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure

2014-07-28 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076062#comment-14076062
 ] 

Wangda Tan commented on YARN-2008:
--

Hi Craig,
As we discussed in YARN-1198, I think we should consider resource used by a 
queue's siblings when computing headroom, I took a look at your patch again, 
some comments:

We first need think about how to calculate headroom in general, I think 
headroom is (concluded from sub JIRAs of YARN-1198),
{code}
queue_available = min(clusterResource - used_by_sibling_of_parents - 
used_by_this_queue, queue_max_resource)
headroom = min(queue_available - available_resource_in_blacklisted_nodes, 
user_limit)
{code}
So I think this JIRA is focus on computing {{used_by_sibling_of_parents}}, is 
it?

I think the general appoarch looks good to me, except In CSQueueUtils.java, 
(will include review of tests in next iteration):
1) 
{code}
  //sibling used is parent used - my used...
  float siblingUsedCapacity = Resources.ratio(
 resourceCalculator,
 Resources.subtract(parent.getUsedResources(), 
queue.getUsedResources()),
 parentResource);
{code}
It seems to me this computing not robust enough when parent resource is empty, 
no matter it's an zero-capacity queue or sibling of it used 100% of cluster.
It's better to add an edge test case to prevent such zero-division as well.

2)
It's better to explicitly cap {{return absoluteMaxAvail}} in range of \[0~1\] 
to prevent errors float computation.

Thanks,
Wangda

 CapacityScheduler may report incorrect queueMaxCap if there is hierarchy 
 queue structure 
 -

 Key: YARN-2008
 URL: https://issues.apache.org/jira/browse/YARN-2008
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.3.0
Reporter: Chen He
Assignee: Craig Welch
 Attachments: YARN-2008.1.patch, YARN-2008.2.patch


 If there are two queues, both allowed to use 100% of the actual resources in 
 the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and 
 there is not actual space available. If we use current method to get 
 headroom, CapacityScheduler thinks there are still available resources for 
 users in Q1 but they have been used by Q2. 
 If the CapacityScheduelr has a hierarchy queue structure, it may report 
 incorrect queueMaxCap. Here is a example
  ||||rootQueue|| ||
 |  |   /   |  
   \ |
 |  L1ParentQueue1  |  |
 L1ParentQueue2|
 |  (allowed to use up 80% of its parent)|  | (allowed to use 20% 
 in minimum of its parent)|
 |/   | \ ||  
 |  L2LeafQueue1 |L2LeafQueue2 |  | 
 |(50% of its parent) |  (50% of its parent in minimum) |   |
 When we calculate headroom of a user in L2LeafQueue2, current method will 
 think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. 
 However, without checking L1ParentQueue1, we are not sure. It is possible 
 that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, 
 L2LeafQueue2 can only use 30% (60%*50%). 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-321) Generic application history service

2014-07-28 Thread Patrick Morton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076080#comment-14076080
 ] 

Patrick Morton commented on YARN-321:
-

Compared to wrists well is less available status amphetamines but higher 
investigations of withdrawal. 
adderall 20 mg 
http://www.surveyanalytics.com//userimages/sub-2/2007589/3153260/29851520/7787428-29851520-stopadd3.html
 
Areas also document any reasons they have surprisingly been using in the 
information.

 Generic application history service
 ---

 Key: YARN-321
 URL: https://issues.apache.org/jira/browse/YARN-321
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Luke Lu
 Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf, 
 Generic Application History - Design-20131219.pdf, HistoryStorageDemo.java


 The mapreduce job history server currently needs to be deployed as a trusted 
 server in sync with the mapreduce runtime. Every new application would need a 
 similar application history server. Having to deploy O(T*V) (where T is 
 number of type of application, V is number of version of application) trusted 
 servers is clearly not scalable.
 Job history storage handling itself is pretty generic: move the logs and 
 history data into a particular directory for later serving. Job history data 
 is already stored as json (or binary avro). I propose that we create only one 
 trusted application history server, which can have a generic UI (display json 
 as a tree of strings) as well. Specific application/version can deploy 
 untrusted webapps (a la AMs) to query the application history server and 
 interpret the json for its specific UI and/or analytics.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2247) Allow RM web services users to authenticate using delegation tokens

2014-07-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076142#comment-14076142
 ] 

Hudson commented on YARN-2247:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #626 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/626/])
YARN-2247. Made RM web services authenticate users via kerberos and delegation 
token. Contributed by Varun Vasudev. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1613821)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilterInitializer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/RMAuthenticationHandler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokenAuthentication.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebappAuthentication.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm


 Allow RM web services users to authenticate using delegation tokens
 ---

 Key: YARN-2247
 URL: https://issues.apache.org/jira/browse/YARN-2247
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Varun Vasudev
Assignee: Varun Vasudev
Priority: Blocker
 Fix For: 2.5.0

 Attachments: YARN-2247.6.patch, apache-yarn-2247.0.patch, 
 apache-yarn-2247.1.patch, apache-yarn-2247.2.patch, apache-yarn-2247.3.patch, 
 apache-yarn-2247.4.patch, apache-yarn-2247.5.patch


 The RM webapp should allow users to authenticate using delegation tokens to 
 maintain parity with RPC.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces

2014-07-28 Thread Craig Welch (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch updated YARN-1994:
--

Attachment: YARN-1994.11.patch

Fixed bug, YarnConfiguration.getSocketAddr checks in ha cases to see which rm 
it was on, this was no longer active in earlier versions of the patch.  
Simplified logic, removed many unnecessary changes in earlier patch versions, 
added some tests.   With this patch, logic should be to act as before in the 
absence of any bind-host, in the presence of bind-host, only for the listening 
process, port is retrieved from address and used with bind-host to bind, all 
other address/configuration paths should now be unchanged by the patch.

 Expose YARN/MR endpoints on multiple interfaces
 ---

 Key: YARN-1994
 URL: https://issues.apache.org/jira/browse/YARN-1994
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Craig Welch
 Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
 YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, 
 YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch


 YARN and MapReduce daemons currently do not support specifying a wildcard 
 address for the server endpoints. This prevents the endpoints from being 
 accessible from all interfaces on a multihomed machine.
 Note that if we do specify INADDR_ANY for any of the options, it will break 
 clients as they will attempt to connect to 0.0.0.0. We need a solution that 
 allows specifying a hostname or IP-address for clients while requesting 
 wildcard bind for the servers.
 (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2247) Allow RM web services users to authenticate using delegation tokens

2014-07-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076206#comment-14076206
 ] 

Hudson commented on YARN-2247:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1845 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1845/])
YARN-2247. Made RM web services authenticate users via kerberos and delegation 
token. Contributed by Varun Vasudev. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1613821)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilterInitializer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/RMAuthenticationHandler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokenAuthentication.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebappAuthentication.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm


 Allow RM web services users to authenticate using delegation tokens
 ---

 Key: YARN-2247
 URL: https://issues.apache.org/jira/browse/YARN-2247
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Varun Vasudev
Assignee: Varun Vasudev
Priority: Blocker
 Fix For: 2.5.0

 Attachments: YARN-2247.6.patch, apache-yarn-2247.0.patch, 
 apache-yarn-2247.1.patch, apache-yarn-2247.2.patch, apache-yarn-2247.3.patch, 
 apache-yarn-2247.4.patch, apache-yarn-2247.5.patch


 The RM webapp should allow users to authenticate using delegation tokens to 
 maintain parity with RPC.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces

2014-07-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076215#comment-14076215
 ] 

Hadoop QA commented on YARN-1994:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658123/YARN-1994.11.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4456//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4456//console

This message is automatically generated.

 Expose YARN/MR endpoints on multiple interfaces
 ---

 Key: YARN-1994
 URL: https://issues.apache.org/jira/browse/YARN-1994
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Craig Welch
 Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
 YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, 
 YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch


 YARN and MapReduce daemons currently do not support specifying a wildcard 
 address for the server endpoints. This prevents the endpoints from being 
 accessible from all interfaces on a multihomed machine.
 Note that if we do specify INADDR_ANY for any of the options, it will break 
 clients as they will attempt to connect to 0.0.0.0. We need a solution that 
 allows specifying a hostname or IP-address for clients while requesting 
 wildcard bind for the servers.
 (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2247) Allow RM web services users to authenticate using delegation tokens

2014-07-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076226#comment-14076226
 ] 

Hudson commented on YARN-2247:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1818 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1818/])
YARN-2247. Made RM web services authenticate users via kerberos and delegation 
token. Contributed by Varun Vasudev. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1613821)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilterInitializer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/RMAuthenticationHandler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokenAuthentication.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebappAuthentication.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm


 Allow RM web services users to authenticate using delegation tokens
 ---

 Key: YARN-2247
 URL: https://issues.apache.org/jira/browse/YARN-2247
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Varun Vasudev
Assignee: Varun Vasudev
Priority: Blocker
 Fix For: 2.5.0

 Attachments: YARN-2247.6.patch, apache-yarn-2247.0.patch, 
 apache-yarn-2247.1.patch, apache-yarn-2247.2.patch, apache-yarn-2247.3.patch, 
 apache-yarn-2247.4.patch, apache-yarn-2247.5.patch


 The RM webapp should allow users to authenticate using delegation tokens to 
 maintain parity with RPC.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Issue Comment Deleted] (YARN-321) Generic application history service

2014-07-28 Thread Jake Farrell (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jake Farrell updated YARN-321:
--

Comment: was deleted

(was: Compared to wrists well is less available status amphetamines but higher 
investigations of withdrawal. 
adderall 20 mg 
http://www.surveyanalytics.com//userimages/sub-2/2007589/3153260/29851520/7787428-29851520-stopadd3.html
 
Areas also document any reasons they have surprisingly been using in the 
information.)

 Generic application history service
 ---

 Key: YARN-321
 URL: https://issues.apache.org/jira/browse/YARN-321
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Luke Lu
 Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf, 
 Generic Application History - Design-20131219.pdf, HistoryStorageDemo.java


 The mapreduce job history server currently needs to be deployed as a trusted 
 server in sync with the mapreduce runtime. Every new application would need a 
 similar application history server. Having to deploy O(T*V) (where T is 
 number of type of application, V is number of version of application) trusted 
 servers is clearly not scalable.
 Job history storage handling itself is pretty generic: move the logs and 
 history data into a particular directory for later serving. Job history data 
 is already stored as json (or binary avro). I propose that we create only one 
 trusted application history server, which can have a generic UI (display json 
 as a tree of strings) as well. Specific application/version can deploy 
 untrusted webapps (a la AMs) to query the application history server and 
 interpret the json for its specific UI and/or analytics.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2363) Submitted applications occasionally lack a tracking URL

2014-07-28 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-2363:


 Summary: Submitted applications occasionally lack a tracking URL
 Key: YARN-2363
 URL: https://issues.apache.org/jira/browse/YARN-2363
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Jason Lowe


Sometimes when an application is submitted the client receives no tracking URL. 
 More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2363) Submitted applications occasionally lack a tracking URL

2014-07-28 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076292#comment-14076292
 ] 

Jason Lowe commented on YARN-2363:
--

Most application submits result in a proxy tracking URL, but occasionally the 
client sees a transient N/A URL.  Here's a snippet of Pig client output where 
a MapReduce job was submitted with no tracking URL received:

{noformat}
2014-07-23 19:19:16,658 [JobControl] INFO 
org.apache.hadoop.mapred.ResourceMgrDelegate - Submitted application 
application_1403199204249_357708 to ResourceManager at
xx/xx:xx
2014-07-23 19:19:16,660 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - 
The url to track the job: N/A
{noformat}

I believe this can occur if the client tries to get an application report just 
as the app is submitted.  YarnClientImpl.submitApplication won't return until 
the app is past the NEW_SAVING state, but if the client slips in while the app 
is in the SUBMITTED state then I think we could end up with no tracking URL due 
to the lack of a current attempt.  From RMAppImpl.createAndGetApplicationReport:

{code}
  String trackingUrl = UNAVAILABLE;
  String host = UNAVAILABLE;
  String origTrackingUrl = UNAVAILABLE;
[...]
  if (allowAccess) {
if (this.currentAttempt != null) {
  currentApplicationAttemptId = this.currentAttempt.getAppAttemptId();
  trackingUrl = this.currentAttempt.getTrackingUrl();
  origTrackingUrl = this.currentAttempt.getOriginalTrackingUrl();
{code}

So if we don't have a current attempt we'll return N/A as the tracking URL.  
Arguably we should return the proxied URL which will redirect to the RM app 
page if there is no tracking URL set yet so at least the client/user has a URL 
that can be used to track the application.


 Submitted applications occasionally lack a tracking URL
 ---

 Key: YARN-2363
 URL: https://issues.apache.org/jira/browse/YARN-2363
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Jason Lowe

 Sometimes when an application is submitted the client receives no tracking 
 URL.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1354) Recover applications upon nodemanager restart

2014-07-28 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076296#comment-14076296
 ] 

Junping Du commented on YARN-1354:
--

Thanks [~jlowe] for updating the patch! A few quick comments so far:
{code}
+try {
+  this.context.getNMStateStore().finishApplication(appID);
+} catch (IOException e) {
+  LOG.error(Unable to update application state in store, e);
+}
{code}
Looks like we only log when persistent effort get failed as we did for other 
components before. In this case, what would happen if storeApplication(), 
finishApplication(), removeApplication() failed with application related 
information get inconsistent after restart?

In ContainerManagerImpl.java
{code}
+  private void recoverApplication(ContainerManagerApplicationProto p)
+  throws IOException {
+ApplicationId appId = new ApplicationIdPBImpl(p.getId());
+Credentials creds = new Credentials();
+creds.readTokenStorageStream(
+new DataInputStream(p.getCredentials().newInput()));
  ...
{code}
Do we need special warning if get failed on deserializing credential here? i.e. 
adding something like version mismatch, etc. It could happen when any changes 
happen in future on credentials object which is a writable object.

More comments will come later.

 Recover applications upon nodemanager restart
 -

 Key: YARN-1354
 URL: https://issues.apache.org/jira/browse/YARN-1354
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1354-v1.patch, 
 YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch, 
 YARN-1354-v4.patch, YARN-1354-v5.patch


 The set of active applications in the nodemanager context need to be 
 recovered for work-preserving nodemanager restart



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces

2014-07-28 Thread Craig Welch (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch updated YARN-1994:
--

Attachment: YARN-1994.11.patch

 Expose YARN/MR endpoints on multiple interfaces
 ---

 Key: YARN-1994
 URL: https://issues.apache.org/jira/browse/YARN-1994
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Craig Welch
 Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
 YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, 
 YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch


 YARN and MapReduce daemons currently do not support specifying a wildcard 
 address for the server endpoints. This prevents the endpoints from being 
 accessible from all interfaces on a multihomed machine.
 Note that if we do specify INADDR_ANY for any of the options, it will break 
 clients as they will attempt to connect to 0.0.0.0. We need a solution that 
 allows specifying a hostname or IP-address for clients while requesting 
 wildcard bind for the servers.
 (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces

2014-07-28 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076319#comment-14076319
 ] 

Craig Welch commented on YARN-1994:
---

TestAMRestart passes on my box, reattached patch to try again on jenkins

 Expose YARN/MR endpoints on multiple interfaces
 ---

 Key: YARN-1994
 URL: https://issues.apache.org/jira/browse/YARN-1994
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Craig Welch
 Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
 YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, 
 YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch


 YARN and MapReduce daemons currently do not support specifying a wildcard 
 address for the server endpoints. This prevents the endpoints from being 
 accessible from all interfaces on a multihomed machine.
 Note that if we do specify INADDR_ANY for any of the options, it will break 
 clients as they will attempt to connect to 0.0.0.0. We need a solution that 
 allows specifying a hostname or IP-address for clients while requesting 
 wildcard bind for the servers.
 (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions

2014-07-28 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076369#comment-14076369
 ] 

Jian He commented on YARN-2209:
---

Hi [~djp], thanks for the comment.  I think users are expected to handle two 
types of exceptions YarnException and IOException. In that sense, this is 
equivalent to  throwing a new type of exception which should be fine ?

 Replace AM resync/shutdown command with corresponding exceptions
 

 Key: YARN-2209
 URL: https://issues.apache.org/jira/browse/YARN-2209
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
 YARN-2209.4.patch, YARN-2209.5.patch


 YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
 application to re-register on RM restart. we should do the same for 
 AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2357) Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2

2014-07-28 Thread Matt Foley (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Foley updated YARN-2357:
-

Target Version/s: 2.6.0

 Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 
 changes to branch-2
 --

 Key: YARN-2357
 URL: https://issues.apache.org/jira/browse/YARN-2357
 Project: Hadoop YARN
  Issue Type: Task
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Remus Rusanu
Assignee: Remus Rusanu
Priority: Critical
  Labels: security, windows
 Attachments: YARN-2357.1.patch


 As title says. Once YARN-1063, YARN-1972 and YARN-2198 are committed to 
 trunk, they need to be backported to branch-2



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces

2014-07-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076405#comment-14076405
 ] 

Hadoop QA commented on YARN-1994:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658153/YARN-1994.11.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4457//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4457//console

This message is automatically generated.

 Expose YARN/MR endpoints on multiple interfaces
 ---

 Key: YARN-1994
 URL: https://issues.apache.org/jira/browse/YARN-1994
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Craig Welch
 Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
 YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, 
 YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch


 YARN and MapReduce daemons currently do not support specifying a wildcard 
 address for the server endpoints. This prevents the endpoints from being 
 accessible from all interfaces on a multihomed machine.
 Note that if we do specify INADDR_ANY for any of the options, it will break 
 clients as they will attempt to connect to 0.0.0.0. We need a solution that 
 allows specifying a hostname or IP-address for clients while requesting 
 wildcard bind for the servers.
 (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts

2014-07-28 Thread Li Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076439#comment-14076439
 ] 

Li Lu commented on YARN-2354:
-

Same error message as YARN-2295, and could not reproduce locally. Seems like 
this is connected with the network settings of the server, causing the 
following lines to fail
{code}
  if (appReport.getHost().startsWith(hostName)
   appReport.getRpcPort() == -1) {
verified = true;
  }
{code}
If such check failed, verified will never be set to true, hence the test will 
fail. This failure appears to be unrelated to the problem fixed by this patch. 

 DistributedShell may allocate more containers than client specified after it 
 restarts
 -

 Key: YARN-2354
 URL: https://issues.apache.org/jira/browse/YARN-2354
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Li Lu
 Attachments: YARN-2354-072514.patch


 To reproduce, run distributed shell with -num_containers option,
 In ApplicationMaster.java, the following code has some issue.
 {code}
   int numTotalContainersToRequest =
 numTotalContainers - previousAMRunningContainers.size();
 for (int i = 0; i  numTotalContainersToRequest; ++i) {
   ContainerRequest containerAsk = setupContainerAskForRM();
   amRMClient.addContainerRequest(containerAsk);
 }
 numRequestedContainers.set(numTotalContainersToRequest);
 {code}
  numRequestedContainers doesn't account for previous AM's requested 
 containers. so numRequestedContainers should be set to numTotalContainers



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-07-28 Thread Chen He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-1769:
--

Attachment: YARN-1769.patch

reduce log output when LeafQueue need to unreserve resource frequently.

if (needToUnreserve) {
+  if(LOG.isDebugEnabled()){
  LOG.info(we needed to unreserve to be able to allocate);
+  } 
  return false;
}



 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions

2014-07-28 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076498#comment-14076498
 ] 

Zhijie Shen commented on YARN-2209:
---

[~jianhe], thanks for the patch. Bellow is some meta comments on this issue.

Why is it necessary to use the exception instead of the flag to indicate the RM 
restarting? In general, I'm afraid the changes here mutually break the backward 
compatibility between YARN and MR. On the one side, any YARN applications used 
to have the logic to deal with RM restarting need to be updated after this 
patch. For example, MR of prior versions will no longer work properly with a 
YARN cluster after this patch during RM restarting. The MR job won’t recognize 
the not found exception and take the necessary restarting treatment, but will 
just record the error and move on.

On the other side, if we assume it is possible the new version MR job after 
this patch is going to be run on an old YARN cluster, the MR job will then not 
recognize the old flag-style restarting signal, and thus will not executing the 
MR-side logic to deal with RM restarting. IMHO, at least, the switch block to 
check the AMCommand cannot be removed but deprecated for compatibility 
consideration.

In case we want to proceed with this change, here're some comment on the patch:

1.  MR side change is not trivial. According to our convention before, shall we 
split the patch into two pieces: one for YARN and the other for MR, such that 
we can easily track the changes for different projects.

2. Why not throwing ApplicationAttemptNotFoundException instead? It sounds more 
reasonable here, doesn’t it?

3. Deprecate the enum type instead of each enum value?
{code}
 @Public
 @Unstable
 public enum AMCommand {
{code}

4. The description sounds not accurate enough. It doesn’t just request 
containers. “App Master heartbeat”?
{code}
+public static final String AM_ALLOCATE = App Master request containers”;
{code}

5. No need to break it into two lines, right?
{code}
 AllocateResponse allocateResponse;
…
+allocateResponse = scheduler.allocate(allocateRequest);
{code}

6.  Is this change necessary?
{code}
-return allocate(progressIndicator);
+allocateResponse = allocate(progressIndicator);
+return allocateResponse;
{code}

 Replace AM resync/shutdown command with corresponding exceptions
 

 Key: YARN-2209
 URL: https://issues.apache.org/jira/browse/YARN-2209
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
 YARN-2209.4.patch, YARN-2209.5.patch


 YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
 application to re-register on RM restart. we should do the same for 
 AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-07-28 Thread Chen He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-1769:
--

Attachment: (was: YARN-1769.patch)

 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-07-28 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-415:


Attachment: YARN-415.201407281816.txt

[~leftnoteasy]
Thanks for all of your help. 
How were you thinking an end-to-end test would work in the UT environment? In 
order to set a baseline and test that the containers ran for some predetermined 
and expected amount of time, wouldn't I need to somehow control the clock? Do 
you have any ideas on how to implement that?

In the meantime, I have made the additional changes you suggested. Please see 
below:

{quote}
bq. I was able to remove the rmApps variable, but I had to leave the check for 
app != null because if I try to take that out, several unit tests would fail 
with NullPointerException. Even with removing the rmApps variable, I needed to 
change TestRMContainerImpl.java to mock rmContext.getRMApps().

I would like to suggest to fix such UTs instead of inserting some kernel code 
to make UT pass. I'm not sure about the effort of doing this, if the effort is 
still reasonable, we should do it.
{quote}
After some spy and mock magic, I was able to fix the unit tests so that the 
checks for if != null were not necessary.

{quote}
{code}
 ApplicationCLI.java
+  appReportStr.print(\tResources used : );
{code}
We need change it to Resource Utilization as well?
{quote}
Yes. I changed it to that.


 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
 YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
 YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
 YARN-415.201407172144.txt, YARN-415.201407232237.txt, 
 YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-07-28 Thread Chen He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-1769:
--

Attachment: YARN-1769.patch

 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations

2014-07-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076607#comment-14076607
 ] 

Hadoop QA commented on YARN-1769:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658198/YARN-1769.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4458//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4458//console

This message is automatically generated.

 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-07-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076679#comment-14076679
 ] 

Hadoop QA commented on YARN-415:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12658211/YARN-415.201407281816.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.yarn.client.TestResourceTrackerOnHA
  org.apache.hadoop.yarn.client.TestApplicationMasterServiceOnHA
  org.apache.hadoop.yarn.client.TestRMFailover
  org.apache.hadoop.yarn.client.api.impl.TestAMRMClient
  org.apache.hadoop.yarn.client.api.impl.TestNMClient
  org.apache.hadoop.yarn.client.TestGetGroups
  
org.apache.hadoop.yarn.client.TestResourceManagerAdministrationProtocolPBClientImpl
  
org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA
  org.apache.hadoop.yarn.client.cli.TestYarnCLI
  org.apache.hadoop.yarn.client.api.impl.TestYarnClient
  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication
  
org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication
  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerQueueACLs
  
org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens
  
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore
  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerQueueACLs
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps
  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
  
org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService
  org.apache.hadoop.yarn.server.resourcemanager.TestRMHA
  
org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4459//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4459//console

This message is automatically generated.

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
 YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
 YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
 YARN-415.201407172144.txt, YARN-415.201407232237.txt, 
 YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if 

[jira] [Created] (YARN-2364) TestRMRestart#testRMRestartWaitForPreviousAMToFinish is racy

2014-07-28 Thread Mit Desai (JIRA)
Mit Desai created YARN-2364:
---

 Summary: TestRMRestart#testRMRestartWaitForPreviousAMToFinish is 
racy
 Key: YARN-2364
 URL: https://issues.apache.org/jira/browse/YARN-2364
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Mit Desai


TestRMRestart#testRMRestartWaitForPreviousAMToFinish is racy. It fails 
intermittently on branch-2 with the following errors.

Fails with any of these
{noformat}
Running org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 26.836 sec  
FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
testRMRestartWaitForPreviousAMToFinish(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart)
  Time elapsed: 26.687 sec   FAILURE!
java.lang.AssertionError: expected:4 but was:3
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartWaitForPreviousAMToFinish(TestRMRestart.java:557)
{noformat}

or

{noformat}
Running org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 51.326 sec  
FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
testRMRestartWaitForPreviousAMToFinish(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart)
  Time elapsed: 51.055 sec   FAILURE!
java.lang.AssertionError: AppAttempt state is not correct (timedout) 
expected:ALLOCATED but was:SCHEDULED
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:414)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.launchAM(TestRMRestart.java:949)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartWaitForPreviousAMToFinish(TestRMRestart.java:519)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations

2014-07-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076769#comment-14076769
 ] 

Hadoop QA commented on YARN-1769:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658215/YARN-1769.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4460//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4460//console

This message is automatically generated.

 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2365) TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry fails on branch-2

2014-07-28 Thread Mit Desai (JIRA)
Mit Desai created YARN-2365:
---

 Summary: TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry 
fails on branch-2
 Key: YARN-2365
 URL: https://issues.apache.org/jira/browse/YARN-2365
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Mit Desai


TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry fails on branch with 
the following errror
{noformat}
Running 
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 46.471 sec  
FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
testShouldNotCountFailureToMaxAttemptRetry(org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart)
  Time elapsed: 46.354 sec   FAILURE!
java.lang.AssertionError: AppAttempt state is not correct (timedout) 
expected:ALLOCATED but was:SCHEDULED
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:414)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAM(MockRM.java:569)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAndRegisterAM(MockRM.java:576)
at 
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry(TestAMRestart.java:389)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2366) Speed up history server startup time

2014-07-28 Thread Siqi Li (JIRA)
Siqi Li created YARN-2366:
-

 Summary: Speed up history server startup time
 Key: YARN-2366
 URL: https://issues.apache.org/jira/browse/YARN-2366
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Siqi Li






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2366) Speed up history server startup time

2014-07-28 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li updated YARN-2366:
--

Description: When history server starts up, It scans every history 
directories and put all history files into a cache, whereas this cache only 
stores 20K recent history files. Therefore, it is wasting a large portion of 
time loading old history files into the cache, and the startup time will keep 
increasing if we don't trim the number of history files. For example, when 
history server starts up with 2.5M history files in HDFS, it took ~5 minutes.

 Speed up history server startup time
 

 Key: YARN-2366
 URL: https://issues.apache.org/jira/browse/YARN-2366
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Siqi Li

 When history server starts up, It scans every history directories and put all 
 history files into a cache, whereas this cache only stores 20K recent history 
 files. Therefore, it is wasting a large portion of time loading old history 
 files into the cache, and the startup time will keep increasing if we don't 
 trim the number of history files. For example, when history server starts up 
 with 2.5M history files in HDFS, it took ~5 minutes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-2366) Speed up history server startup time

2014-07-28 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li reassigned YARN-2366:
-

Assignee: Siqi Li

 Speed up history server startup time
 

 Key: YARN-2366
 URL: https://issues.apache.org/jira/browse/YARN-2366
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Siqi Li
Assignee: Siqi Li
 Attachments: YARN-2366.v1.patch


 When history server starts up, It scans every history directories and put all 
 history files into a cache, whereas this cache only stores 20K recent history 
 files. Therefore, it is wasting a large portion of time loading old history 
 files into the cache, and the startup time will keep increasing if we don't 
 trim the number of history files. For example, when history server starts up 
 with 2.5M history files in HDFS, it took ~5 minutes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2366) Speed up history server startup time

2014-07-28 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li updated YARN-2366:
--

Attachment: YARN-2366.v1.patch

 Speed up history server startup time
 

 Key: YARN-2366
 URL: https://issues.apache.org/jira/browse/YARN-2366
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Siqi Li
Assignee: Siqi Li
 Attachments: YARN-2366.v1.patch


 When history server starts up, It scans every history directories and put all 
 history files into a cache, whereas this cache only stores 20K recent history 
 files. Therefore, it is wasting a large portion of time loading old history 
 files into the cache, and the startup time will keep increasing if we don't 
 trim the number of history files. For example, when history server starts up 
 with 2.5M history files in HDFS, it took ~5 minutes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2366) Speed up history server startup time

2014-07-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076933#comment-14076933
 ] 

Hadoop QA commented on YARN-2366:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658247/YARN-2366.v1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4461//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4461//console

This message is automatically generated.

 Speed up history server startup time
 

 Key: YARN-2366
 URL: https://issues.apache.org/jira/browse/YARN-2366
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Siqi Li
Assignee: Siqi Li
 Attachments: YARN-2366.v1.patch


 When history server starts up, It scans every history directories and put all 
 history files into a cache, whereas this cache only stores 20K recent history 
 files. Therefore, it is wasting a large portion of time loading old history 
 files into the cache, and the startup time will keep increasing if we don't 
 trim the number of history files. For example, when history server starts up 
 with 2.5M history files in HDFS, it took ~5 minutes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2363) Submitted applications occasionally lack a tracking URL

2014-07-28 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-2363:
-

Attachment: YARN-2363.patch

Quick patch that generates a default proxy URL if the user has access to the 
app but there isn't a current attempt.

 Submitted applications occasionally lack a tracking URL
 ---

 Key: YARN-2363
 URL: https://issues.apache.org/jira/browse/YARN-2363
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Jason Lowe
 Attachments: YARN-2363.patch


 Sometimes when an application is submitted the client receives no tracking 
 URL.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2363) Submitted applications occasionally lack a tracking URL

2014-07-28 Thread Mit Desai (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076984#comment-14076984
 ] 

Mit Desai commented on YARN-2363:
-

patch looks good to me.
+1 (non-binding)

 Submitted applications occasionally lack a tracking URL
 ---

 Key: YARN-2363
 URL: https://issues.apache.org/jira/browse/YARN-2363
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-2363.patch


 Sometimes when an application is submitted the client receives no tracking 
 URL.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions

2014-07-28 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077038#comment-14077038
 ] 

Jian He commented on YARN-2209:
---

Hi Zhijie, thanks for the review.  Here are some responses:
bq. Why is it necessary to use the exception instead of the flag to indicate 
the RM restarting? I
Because as you can see, not just allocate API, unregisterResponse is also 
required to add AMCommand otherwise. Basically, every AMS API other than 
register requires adding a new field otherwise. Throwing exception is much 
cleaner way.
bq. For example, MR of prior versions will no longer work properly with a YARN 
cluster after this patch during RM restarting.
Not matter how application is reacting to the shutdown command, NM will shoot 
down the AM container during RM restart. So prior applications(including MR) 
should still work.  Even earlier MR AM container is possibly killed by NM 
before it actually successfully performs any shutting down logic.
bq. Deprecate the enum type instead of each enum value?
Maybe not deprecating AMCommand, as we may add other commands later on as 
needed.
bq. Why not throwing ApplicationAttemptNotFoundException instead? It sounds 
more reasonable here, doesn’t it?
Do you mean creating a new ApplicationAttemptNotFoundException exception ? I 
think it's fine to just reuse the ApplicationNotFoundException as they are 
quite similar. The internal exception msg shows the attemptId.
bq. Is this change necessary?
It is. because the finally block (i.e. if(allocateResponse == null) ) will be 
executed otherwise.
bq. shall we split the patch into two pieces: one for YARN and the other for MR,
will split once review is done. I think it'll be easier to review with both 
side changes having more context.
bq. No need to break it into two lines, right?
will fix it.

 Replace AM resync/shutdown command with corresponding exceptions
 

 Key: YARN-2209
 URL: https://issues.apache.org/jira/browse/YARN-2209
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
 YARN-2209.4.patch, YARN-2209.5.patch


 YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
 application to re-register on RM restart. we should do the same for 
 AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-07-28 Thread Ashwin Shankar (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin Shankar reassigned YARN-2360:


Assignee: Ashwin Shankar

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar

 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2363) Submitted applications occasionally lack a tracking URL

2014-07-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077087#comment-14077087
 ] 

Hadoop QA commented on YARN-2363:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658270/YARN-2363.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4462//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4462//console

This message is automatically generated.

 Submitted applications occasionally lack a tracking URL
 ---

 Key: YARN-2363
 URL: https://issues.apache.org/jira/browse/YARN-2363
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-2363.patch


 Sometimes when an application is submitted the client receives no tracking 
 URL.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-07-28 Thread Ashwin Shankar (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin Shankar updated YARN-2360:
-

Attachment: YARN-2360-v1.txt

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 YARN-2360-v1.txt


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-07-28 Thread Ashwin Shankar (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin Shankar updated YARN-2360:
-

Attachment: Screen Shot 2014-07-28 at 1.12.19 PM.png

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 YARN-2360-v1.txt


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts

2014-07-28 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2354:


Attachment: YARN-2354-072814.patch

New patch, added log information. 

 DistributedShell may allocate more containers than client specified after it 
 restarts
 -

 Key: YARN-2354
 URL: https://issues.apache.org/jira/browse/YARN-2354
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Li Lu
 Attachments: YARN-2354-072514.patch, YARN-2354-072814.patch


 To reproduce, run distributed shell with -num_containers option,
 In ApplicationMaster.java, the following code has some issue.
 {code}
   int numTotalContainersToRequest =
 numTotalContainers - previousAMRunningContainers.size();
 for (int i = 0; i  numTotalContainersToRequest; ++i) {
   ContainerRequest containerAsk = setupContainerAskForRM();
   amRMClient.addContainerRequest(containerAsk);
 }
 numRequestedContainers.set(numTotalContainersToRequest);
 {code}
  numRequestedContainers doesn't account for previous AM's requested 
 containers. so numRequestedContainers should be set to numTotalContainers



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-07-28 Thread Ashwin Shankar (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077125#comment-14077125
 ] 

Ashwin Shankar commented on YARN-2360:
--

Attached screenshot and patch for UI changes to display dynamic fair share.
Some comments on UI changes :
1. I'm calling dynamic fair share as Current Fair Share and static fair share 
as Guaranteed Fair Share.
2. Since dynamic fair share is a temporary fair share, I've represented it as 
a dashed border.
3. Changed static fair share border to have a solid border rather than 
dashed. 
4. Added Dynamic Fair Share/Current Fair Share to show up on tooltip.
5. Usage changes to Orange, when it goes above dynamic/current fair share 
rather than static fair share.

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 YARN-2360-v1.txt


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-07-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077144#comment-14077144
 ] 

Hadoop QA commented on YARN-2360:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658291/YARN-2360-v1.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4463//console

This message is automatically generated.

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 YARN-2360-v1.txt


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1354) Recover applications upon nodemanager restart

2014-07-28 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077178#comment-14077178
 ] 

Jason Lowe commented on YARN-1354:
--

Thanks for taking a look, Junping!

bq. what would happen if storeApplication(), finishApplication(), 
removeApplication() failed with application related information get 
inconsistent after restart?

If storeApplication fails then it will throw an IOException which will bubble 
up and fail the container start request on the client.  As long as we're unable 
to store a new application, containers for that application will not start, 
which I believe is the desired behavior.  That prevents the state store from 
being inconsistent in this particular scenario.

If finishApplication fails then the NM will proceed as if it did succeed but 
the state store will still have the application present.  This should be 
corrected when the NM restarts and registers with the RM with those 
applications still running.  The RM should correct the situation by telling the 
NM that the application has finished (see YARN-1885), and the NM will proceed 
to perform application finish processing (e.g.: log aggregation, etc.).  I 
think worst-case it will upload all of the app container logs again, but when 
it goes to rename to the final destination name that will fail because the name 
already exists.  Thus there could be some wasted work, but it should sort 
itself out and not do something catastrophic.

If removeApplication fails then the NM will proceed as if it did succeed but 
the state store will still have the application present.  This should be 
corrected when the NM finishes application processing (per above or if it was 
already recorded as finished) and it will again try to remove it from the state 
store.  As above I think there could be some unnecessary work performed, but I 
think in the end the application should eventually be removed from the NM on 
restart.  It could still remain in the state store if the second removal also 
fails, but a subsequent restart should behave the same.

bq. Do we need special warning if get failed on deserializing credential here?

I'm not sure how credential processing is fundamentally all that different from 
protocol buffer parsing which could also fail.  If the credentials can't be 
read then we can't recover the application.  Currently recovery errors are 
fatal to NM startup.  Do you have something specific in mind for handling the 
credentials if the writable changes (e.g.: some pseudo code to show the 
approach)?

 Recover applications upon nodemanager restart
 -

 Key: YARN-1354
 URL: https://issues.apache.org/jira/browse/YARN-1354
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1354-v1.patch, 
 YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch, 
 YARN-1354-v4.patch, YARN-1354-v5.patch


 The set of active applications in the nodemanager context need to be 
 recovered for work-preserving nodemanager restart



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions

2014-07-28 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077192#comment-14077192
 ] 

Junping Du commented on YARN-2209:
--

bq. I think users are expected to handle two types of exceptions YarnException 
and IOException. In that sense, this is equivalent to throwing a new type of 
exception which should be fine?
No. The customized AM code could get RESYNC from response previously (like what 
we original do in AMRMClient) to handle AM re-registering case. Now, it cannot 
get this RESYNC so could failed to re-registering to restarted RM. Do I miss 
anything here?

 Replace AM resync/shutdown command with corresponding exceptions
 

 Key: YARN-2209
 URL: https://issues.apache.org/jira/browse/YARN-2209
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
 YARN-2209.4.patch, YARN-2209.5.patch


 YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
 application to re-register on RM restart. we should do the same for 
 AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2347) Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in yarn-server-common

2014-07-28 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077198#comment-14077198
 ] 

Junping Du commented on YARN-2347:
--

[~zjshen], can you help to review it again? Thx!

 Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in 
 yarn-server-common
 

 Key: YARN-2347
 URL: https://issues.apache.org/jira/browse/YARN-2347
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Junping Du
Assignee: Junping Du
 Attachments: YARN-2347-v2.patch, YARN-2347-v3.patch, 
 YARN-2347-v4.patch, YARN-2347-v5.patch, YARN-2347.patch


 We have similar things for version state for RM, NM, TS (TimelineServer), 
 etc. I think we should consolidate them into a common object.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions

2014-07-28 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077204#comment-14077204
 ] 

Jian He commented on YARN-2209:
---

bq. The customized AM code could get RESYNC from response previously (like what 
we original do in AMRMClient) to handle AM re-registering case.
Previously, AM doesn't do re-register. Re-register on RM restart is a new 
requirement coming out from YARN-556

 Replace AM resync/shutdown command with corresponding exceptions
 

 Key: YARN-2209
 URL: https://issues.apache.org/jira/browse/YARN-2209
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
 YARN-2209.4.patch, YARN-2209.5.patch


 YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
 application to re-register on RM restart. we should do the same for 
 AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-07-28 Thread Ashwin Shankar (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077211#comment-14077211
 ] 

Ashwin Shankar commented on YARN-2360:
--

Expected -1 from Jenkins since patch depends on unresolved YARN-2026.

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 YARN-2360-v1.txt


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios

2014-07-28 Thread Ashwin Shankar (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077218#comment-14077218
 ] 

Ashwin Shankar commented on YARN-2026:
--

[~kasha],[~sandyr] , did you have any comments on the latest patch ?
I also made UI changes and attached screenshot which shows static/dynamic fair 
share in YARN-2360.
Can you please take a look at that also ?

 Fair scheduler : Fair share for inactive queues causes unfair allocation in 
 some scenarios
 --

 Key: YARN-2026
 URL: https://issues.apache.org/jira/browse/YARN-2026
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
  Labels: scheduler
 Attachments: YARN-2026-v1.txt, YARN-2026-v2.txt, YARN-2026-v3.txt


 Problem1- While using hierarchical queues in fair scheduler,there are few 
 scenarios where we have seen a leaf queue with least fair share can take 
 majority of the cluster and starve a sibling parent queue which has greater 
 weight/fair share and preemption doesn’t kick in to reclaim resources.
 The root cause seems to be that fair share of a parent queue is distributed 
 to all its children irrespective of whether its an active or an inactive(no 
 apps running) queue. Preemption based on fair share kicks in only if the 
 usage of a queue is less than 50% of its fair share and if it has demands 
 greater than that. When there are many queues under a parent queue(with high 
 fair share),the child queue’s fair share becomes really low. As a result when 
 only few of these child queues have apps running,they reach their *tiny* fair 
 share quickly and preemption doesn’t happen even if other leaf 
 queues(non-sibling) are hogging the cluster.
 This can be solved by dividing fair share of parent queue only to active 
 child queues.
 Here is an example describing the problem and proposed solution:
 root.lowPriorityQueue is a leaf queue with weight 2
 root.HighPriorityQueue is parent queue with weight 8
 root.HighPriorityQueue has 10 child leaf queues : 
 root.HighPriorityQueue.childQ(1..10)
 Above config,results in root.HighPriorityQueue having 80% fair share
 and each of its ten child queue would have 8% fair share. Preemption would 
 happen only if the child queue is 4% (0.5*8=4). 
 Lets say at the moment no apps are running in any of the 
 root.HighPriorityQueue.childQ(1..10) and few apps are running in 
 root.lowPriorityQueue which is taking up 95% of the cluster.
 Up till this point,the behavior of FS is correct.
 Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% 
 of the cluster. It would get only the available 5% in the cluster and 
 preemption wouldn't kick in since its above 4%(half fair share).This is bad 
 considering childQ1 is under a highPriority parent queue which has *80% fair 
 share*.
 Until root.lowPriorityQueue starts relinquishing containers,we would see the 
 following allocation on the scheduler page:
 *root.lowPriorityQueue = 95%*
 *root.HighPriorityQueue.childQ1=5%*
 This can be solved by distributing a parent’s fair share only to active 
 queues.
 So in the example above,since childQ1 is the only active queue
 under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 
 80%.
 This would cause preemption to reclaim the 30% needed by childQ1 from 
 root.lowPriorityQueue after fairSharePreemptionTimeout seconds.
 Problem2 - Also note that similar situation can happen between 
 root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 
 hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck 
 at 5%,until childQ2 starts relinquishing containers. We would like each of 
 childQ1 and childQ2 to get half of root.HighPriorityQueue  fair share ie 
 40%,which would ensure childQ1 gets upto 40% resource if needed through 
 preemption.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions

2014-07-28 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077235#comment-14077235
 ] 

Junping Du commented on YARN-2209:
--

bq. Previously, AM doesn't do re-register. Re-register on RM restart is a new 
requirement coming out from YARN-556.
Does RESYNC being added in YARN-556 also? If so, I think this is a reasonable 
change and I suggest to remove RESYNC completely (not just deprecated) before 
this feature get released. 

 Replace AM resync/shutdown command with corresponding exceptions
 

 Key: YARN-2209
 URL: https://issues.apache.org/jira/browse/YARN-2209
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
 YARN-2209.4.patch, YARN-2209.5.patch


 YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
 application to re-register on RM restart. we should do the same for 
 AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic

2014-07-28 Thread Subramaniam Venkatraman Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077242#comment-14077242
 ] 

Subramaniam Venkatraman Krishnan commented on YARN-1707:


[~wangda] Thanks for the very detailed comments. I agree that understanding the 
context is essential  glad to help with that. Overall your understanding is 
spot on, please find answers to your questions below: 

1) Yes, it is possible to have multiple PlanQueues (e.g., if two organization 
want to dynamically allocate their resources, but not share among them). This 
is also good to try reservation on a small scale and slowly ramp up at each 
org's pace.
2) The extra confs are needed to automate the initialization of key parameters 
of the dynamic ReservationQueues (without requiring full specification of each 
of those).
3) Correct
4) Correct
5) First: the Plan guarantees that the sum of reservations never exceed 
available resources (replanning if needed to maintain this invariant to handle 
failures). On the other hand, like it happens for normal scheduler we can 
leverage overcapacity to guarantee high cluster utilization. More precisely, 
depending on the configuration (or dynamically on whether reservations have 
gang semantics or not) we can allow resources allocated to PlanQueue and 
ReservationQueue to exceed their guaranteed capacity (i.e., set the dynamic 
max-capacity above the guaranteed one). In this case preemption might kick in 
if other apps with more rights on resources have pending askss. Part of the 
changes in YARN-1957 were driven by this.
6) To limit the scope of changed, we agreed to have a follow up JIRA to address 
HA. The intuition we have is that it is sufficient to persist the Plan alone. 
During recovery, the _Plan Follower_ will resync the Plan with the scheduler by 
creating the dynamic queues for currently active reservations. We will be happy 
to have your input when we work on the HA JIRA.

[~curino] will answer your questions specify to this JIRA.

 Making the CapacityScheduler more dynamic
 -

 Key: YARN-1707
 URL: https://issues.apache.org/jira/browse/YARN-1707
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Carlo Curino
Assignee: Carlo Curino
  Labels: capacity-scheduler
 Attachments: YARN-1707.patch


 The CapacityScheduler is a rather static at the moment, and refreshqueue 
 provides a rather heavy-handed way to reconfigure it. Moving towards 
 long-running services (tracked in YARN-896) and to enable more advanced 
 admission control and resource parcelling we need to make the 
 CapacityScheduler more dynamic. This is instrumental to the umbrella jira 
 YARN-1051.
 Concretely this require the following changes:
 * create queues dynamically
 * destroy queues dynamically
 * dynamically change queue parameters (e.g., capacity) 
 * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% 
 instead of ==100%
 We limit this to LeafQueues. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic

2014-07-28 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077254#comment-14077254
 ] 

Wangda Tan commented on YARN-1707:
--

Hi [~subru], 
Thanks for your elaboration, it is very helpful for me to understand the 
background.

Regards,
Wangda


 Making the CapacityScheduler more dynamic
 -

 Key: YARN-1707
 URL: https://issues.apache.org/jira/browse/YARN-1707
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Carlo Curino
Assignee: Carlo Curino
  Labels: capacity-scheduler
 Attachments: YARN-1707.patch


 The CapacityScheduler is a rather static at the moment, and refreshqueue 
 provides a rather heavy-handed way to reconfigure it. Moving towards 
 long-running services (tracked in YARN-896) and to enable more advanced 
 admission control and resource parcelling we need to make the 
 CapacityScheduler more dynamic. This is instrumental to the umbrella jira 
 YARN-1051.
 Concretely this require the following changes:
 * create queues dynamically
 * destroy queues dynamically
 * dynamically change queue parameters (e.g., capacity) 
 * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% 
 instead of ==100%
 We limit this to LeafQueues. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic

2014-07-28 Thread Carlo Curino (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077256#comment-14077256
 ] 

Carlo Curino commented on YARN-1707:


Thanks again for the fast and insightful feedback. 

*Regarding how the patch matches the JIRA:*
Our initial implementation was indeed making the changes (i.e., the dynamic 
behaviors) in ParentQueue and LeafQueue themselves. Previous feedback pushed us 
to have subclasses to in a sense isolate the changes to dynamic subclasses. I 
think we can go back to the version modifying directly ParentQueue and 
LeafQueue if there is consensus. #4 is required because we cannot 
transactionally “add Q1, resize Q2” so that the invariant “size of children is 
== 100%” is maintained. As a consequence we must relax the constraints (either 
in ParentQueue if we remove the hierarchy, or as it is today in PlanQueue).  
The good news is that the percentages from the configuration are not 
interpreted as actual percentages, but rather used as relative weights 
(ranking queues in used_resources / guaranteed_resources). This means that even 
a careless admin will not get resources unused.  For example, if we set two 
queues to 10,40 (i.e., something that doesn't add up to 100), the behavior is 
equivalent to setting them to 20,80 (as they are used only for relative ranking 
of siblings). I think this is also ok for hierarchies (worth double checking 
this part).

So all in all we can pull up to {{ParentQueue}} and {{LeafQueue}} all the 
dynamic behavior if there is consensus that this is the right path.

*Regarding move:*
1) Good catch... We will wait for feedback from Jian on this.
2) I think we had that at some point and did not work correctly. We will try 
again.
3) There are few invariants we do not check. {{MaxApplicationsPerUser}} is one 
of them, but also how many applications can be active in the target queue, 
etc... As I was mentioning in my previous comment, this is likely fine for the 
limited usage we will make of this from {{ReservationSystem}}, but it is worth 
expand the checks we make (see 
{{FairScheduler.verifyMoveDoesNotViolateConstraints(..)}}) to expose move to 
users via CLI.


 Making the CapacityScheduler more dynamic
 -

 Key: YARN-1707
 URL: https://issues.apache.org/jira/browse/YARN-1707
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Carlo Curino
Assignee: Carlo Curino
  Labels: capacity-scheduler
 Attachments: YARN-1707.patch


 The CapacityScheduler is a rather static at the moment, and refreshqueue 
 provides a rather heavy-handed way to reconfigure it. Moving towards 
 long-running services (tracked in YARN-896) and to enable more advanced 
 admission control and resource parcelling we need to make the 
 CapacityScheduler more dynamic. This is instrumental to the umbrella jira 
 YARN-1051.
 Concretely this require the following changes:
 * create queues dynamically
 * destroy queues dynamically
 * dynamically change queue parameters (e.g., capacity) 
 * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% 
 instead of ==100%
 We limit this to LeafQueues. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback

2014-07-28 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077262#comment-14077262
 ] 

Wangda Tan commented on YARN-415:
-

Hi [~eepayne],
Thanks for updating your patch,
For e2e test, I think we can do this way, you can refer to tests in 
TestRMRestart
Using MockRM/MockAM can do such test, even though it's not a complete e2e test, 
but most logic are included in it. I suggest we could cover following cases:
{code}
* Create an app, before submit AM, resource utilization should be 0
* Submit AM, while AM running, we can get its resource utilization  0
* Allocate some container, and finish them, check total resource utilization
* Finish application attempt, and check total resource utilization
* Start a new application attempt, check if resource utilization of previous 
attempt is added to total resource utilization.
* Check if resource utilization can be persist/read during RM restart
{code}
Do you have any comments on this?

Thanks,
Wangda

 Capture memory utilization at the app-level for chargeback
 --

 Key: YARN-415
 URL: https://issues.apache.org/jira/browse/YARN-415
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
 YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
 YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
 YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
 YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
 YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
 YARN-415.201407172144.txt, YARN-415.201407232237.txt, 
 YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.patch


 For the purpose of chargeback, I'd like to be able to compute the cost of an
 application in terms of cluster resource usage.  To start out, I'd like to 
 get the memory utilization of an application.  The unit should be MB-seconds 
 or something similar and, from a chargeback perspective, the memory amount 
 should be the memory reserved for the application, as even if the app didn't 
 use all that memory, no one else was able to use it.
 (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
 container 2 * lifetime of container 2) + ... + (reserved ram for container n 
 * lifetime of container n)
 It'd be nice to have this at the app level instead of the job level because:
 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
 appear on the job history server).
 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
 This new metric should be available both through the RM UI and RM Web 
 Services REST API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2367) Make ResourceCalculator configurable for FairScheduler and FifoScheduler like CapacityScheduler

2014-07-28 Thread Swapnil Daingade (JIRA)
Swapnil Daingade created YARN-2367:
--

 Summary: Make ResourceCalculator configurable for FairScheduler 
and FifoScheduler like CapacityScheduler
 Key: YARN-2367
 URL: https://issues.apache.org/jira/browse/YARN-2367
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.1, 2.3.0, 2.2.0
Reporter: Swapnil Daingade
Priority: Minor


The ResourceCalculator used by CapacityScheduler is read from a configuration 
file entry capacity-scheduler.xml yarn.scheduler.capacity.resource-calculator. 
This allows for custom implementations that implement the ResourceCalculator 
interface to be plugged in. It would be nice to have the same functionality in 
FairScheduler and FifoScheduler.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1826) TestDirectoryCollection intermittent failures

2014-07-28 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077282#comment-14077282
 ] 

Tsuyoshi OZAWA commented on YARN-1826:
--

Thank you for commenting, Wangda. Vinod is fixing this problem on YARN-1979. 
Close this as duplicated.

 TestDirectoryCollection intermittent failures
 -

 Key: YARN-1826
 URL: https://issues.apache.org/jira/browse/YARN-1826
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Tsuyoshi OZAWA

 testCreateDirectories fails intermittently.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces

2014-07-28 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077285#comment-14077285
 ] 

Xuan Gong commented on YARN-1994:
-

+1 LGTM

 Expose YARN/MR endpoints on multiple interfaces
 ---

 Key: YARN-1994
 URL: https://issues.apache.org/jira/browse/YARN-1994
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Craig Welch
 Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
 YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, 
 YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch


 YARN and MapReduce daemons currently do not support specifying a wildcard 
 address for the server endpoints. This prevents the endpoints from being 
 accessible from all interfaces on a multihomed machine.
 Note that if we do specify INADDR_ANY for any of the options, it will break 
 clients as they will attempt to connect to 0.0.0.0. We need a solution that 
 allows specifying a hostname or IP-address for clients while requesting 
 wildcard bind for the servers.
 (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-1826) TestDirectoryCollection intermittent failures

2014-07-28 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA resolved YARN-1826.
--

Resolution: Duplicate

 TestDirectoryCollection intermittent failures
 -

 Key: YARN-1826
 URL: https://issues.apache.org/jira/browse/YARN-1826
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Tsuyoshi OZAWA

 testCreateDirectories fails intermittently.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic

2014-07-28 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077296#comment-14077296
 ] 

Wangda Tan commented on YARN-1707:
--

Hi [~curino],
Thanks for your reply,
For regarding how the patch matches the JIRA:
Since I don't have other solid use cases in my mind that others besides 
{{ReservationSystem}} can leverage these features, I don't have strong opinions 
to merge such dynamic behaviors into {{ParentQueue}}, {{LeafQueue}}. Let's wait 
for more feedbacks.
I agree that we can consider queue capacity as a weight, it will be easier 
for users to configure, and it's a backward-compatible change also (except it 
will not throw exception when sum of children of a {{ParentQueue}} doesn't 
equals to 100).

bq. As I was mentioning in my previous comment, this is likely fine for the 
limited usage we will make of this from ReservationSystem
I think for moving application across queue is not a ReservationSystem specific 
change. I would suggest to check it will not violate restrictions in target 
queue before moving it.

Thanks,
Wangda

 Making the CapacityScheduler more dynamic
 -

 Key: YARN-1707
 URL: https://issues.apache.org/jira/browse/YARN-1707
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Carlo Curino
Assignee: Carlo Curino
  Labels: capacity-scheduler
 Attachments: YARN-1707.patch


 The CapacityScheduler is a rather static at the moment, and refreshqueue 
 provides a rather heavy-handed way to reconfigure it. Moving towards 
 long-running services (tracked in YARN-896) and to enable more advanced 
 admission control and resource parcelling we need to make the 
 CapacityScheduler more dynamic. This is instrumental to the umbrella jira 
 YARN-1051.
 Concretely this require the following changes:
 * create queues dynamically
 * destroy queues dynamically
 * dynamically change queue parameters (e.g., capacity) 
 * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% 
 instead of ==100%
 We limit this to LeafQueues. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1979) TestDirectoryCollection fails when the umask is unusual

2014-07-28 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1979:
-

Attachment: YARN-1979.2.patch

This JIRA seems to be forgotten, so let me update the patch. Just removed the 
lines [~djp] mentioned.

 TestDirectoryCollection fails when the umask is unusual
 ---

 Key: YARN-1979
 URL: https://issues.apache.org/jira/browse/YARN-1979
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Attachments: YARN-1979.2.patch, YARN-1979.txt


 I've seen this fail in Windows where the default permissions are matching up 
 to 700.
 {code}
 ---
 Test set: org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection
 ---
 Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.015 sec  
 FAILURE! - in 
 org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection
 testCreateDirectories(org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection)
   Time elapsed: 0.422 sec   FAILURE!
 java.lang.AssertionError: local dir parent 
 Y:\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection\dirA
  not created with proper permissions expected:rwxr-xr-x but was:rwx--
 at org.junit.Assert.fail(Assert.java:93)
 at org.junit.Assert.failNotEquals(Assert.java:647)
 at org.junit.Assert.assertEquals(Assert.java:128)
 at 
 org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection.testCreateDirectories(TestDirectoryCollection.java:106)
 {code}
 The clash is between testDiskSpaceUtilizationLimit() and 
 testCreateDirectories().



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2215) Add preemption info to REST/CLI

2014-07-28 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2215:
-

Assignee: Kenji Kikushima

 Add preemption info to REST/CLI
 ---

 Key: YARN-2215
 URL: https://issues.apache.org/jira/browse/YARN-2215
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client, resourcemanager
Reporter: Wangda Tan
Assignee: Kenji Kikushima
 Attachments: YARN-2215.patch


 As discussed in YARN-2181, we'd better to add preemption info to RM RESTful 
 API/CLI to make administrator/user get more understanding about preemption 
 happened on app/queue, etc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2215) Add preemption info to REST/CLI

2014-07-28 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077303#comment-14077303
 ] 

Wangda Tan commented on YARN-2215:
--

Hi [~kj-ki],
Thanks for working on this, I've assigned this JIRA to you. 
I think the fields you added should be fine. With the scope of this JIRA, I 
think it's better to add CLI support as well. Please submit patch to kickoff 
jenkins when you completed.

Wangda


 Add preemption info to REST/CLI
 ---

 Key: YARN-2215
 URL: https://issues.apache.org/jira/browse/YARN-2215
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client, resourcemanager
Reporter: Wangda Tan
Assignee: Kenji Kikushima
 Attachments: YARN-2215.patch


 As discussed in YARN-2181, we'd better to add preemption info to RM RESTful 
 API/CLI to make administrator/user get more understanding about preemption 
 happened on app/queue, etc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1979) TestDirectoryCollection fails when the umask is unusual

2014-07-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077319#comment-14077319
 ] 

Hadoop QA commented on YARN-1979:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658331/YARN-1979.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4465//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4465//console

This message is automatically generated.

 TestDirectoryCollection fails when the umask is unusual
 ---

 Key: YARN-1979
 URL: https://issues.apache.org/jira/browse/YARN-1979
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Attachments: YARN-1979.2.patch, YARN-1979.txt


 I've seen this fail in Windows where the default permissions are matching up 
 to 700.
 {code}
 ---
 Test set: org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection
 ---
 Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.015 sec  
 FAILURE! - in 
 org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection
 testCreateDirectories(org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection)
   Time elapsed: 0.422 sec   FAILURE!
 java.lang.AssertionError: local dir parent 
 Y:\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection\dirA
  not created with proper permissions expected:rwxr-xr-x but was:rwx--
 at org.junit.Assert.fail(Assert.java:93)
 at org.junit.Assert.failNotEquals(Assert.java:647)
 at org.junit.Assert.assertEquals(Assert.java:128)
 at 
 org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection.testCreateDirectories(TestDirectoryCollection.java:106)
 {code}
 The clash is between testDiskSpaceUtilizationLimit() and 
 testCreateDirectories().



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces

2014-07-28 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077401#comment-14077401
 ] 

Arpit Agarwal commented on YARN-1994:
-

+1 from me module one question. Why is the following logic only needed for 
ContainerManagerImpl.java? I probably knew this but can't recall now.

{code}

InetSocketAddress connectAddress;
String connectHost = conf.getTrimmed(YarnConfiguration.NM_ADDRESS);
if (connectHost == null || connectHost.isEmpty()) {
  // Get hostname and port from the listening endpoint.
  connectAddress = NetUtils.getConnectAddress(server);
} else {
  // Combine the configured hostname with the port from the listening
  // endpoint. This gets the correct port number if the configuration
  // specifies an ephemeral port (port number 0).
  connectAddress = NetUtils.getConnectAddress(
  new InetSocketAddress(connectHost.split(:)[0],
server.getListenerAddress().getPort()));
}
{code}

Thanks.

 Expose YARN/MR endpoints on multiple interfaces
 ---

 Key: YARN-1994
 URL: https://issues.apache.org/jira/browse/YARN-1994
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Craig Welch
 Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
 YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, 
 YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch


 YARN and MapReduce daemons currently do not support specifying a wildcard 
 address for the server endpoints. This prevents the endpoints from being 
 accessible from all interfaces on a multihomed machine.
 Note that if we do specify INADDR_ANY for any of the options, it will break 
 clients as they will attempt to connect to 0.0.0.0. We need a solution that 
 allows specifying a hostname or IP-address for clients while requesting 
 wildcard bind for the servers.
 (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-611) Add an AM retry count reset window to YARN RM

2014-07-28 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-611:
---

Attachment: YARN-611.4.rebase.patch

rebased on the latest trunk

 Add an AM retry count reset window to YARN RM
 -

 Key: YARN-611
 URL: https://issues.apache.org/jira/browse/YARN-611
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Chris Riccomini
Assignee: Xuan Gong
 Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, 
 YARN-611.4.patch, YARN-611.4.rebase.patch


 YARN currently has the following config:
 yarn.resourcemanager.am.max-retries
 This config defaults to 2, and defines how many times to retry a failed AM 
 before failing the whole YARN job. YARN counts an AM as failed if the node 
 that it was running on dies (the NM will timeout, which counts as a failure 
 for the AM), or if the AM dies.
 This configuration is insufficient for long running (or infinitely running) 
 YARN jobs, since the machine (or NM) that the AM is running on will 
 eventually need to be restarted (or the machine/NM will fail). In such an 
 event, the AM has not done anything wrong, but this is counted as a failure 
 by the RM. Since the retry count for the AM is never reset, eventually, at 
 some point, the number of machine/NM failures will result in the AM failure 
 count going above the configured value for 
 yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the 
 job as failed, and shut it down. This behavior is not ideal.
 I propose that we add a second configuration:
 yarn.resourcemanager.am.retry-count-window-ms
 This configuration would define a window of time that would define when an AM 
 is well behaved, and it's safe to reset its failure count back to zero. 
 Every time an AM fails the RmAppImpl would check the last time that the AM 
 failed. If the last failure was less than retry-count-window-ms ago, and the 
 new failure count is  max-retries, then the job should fail. If the AM has 
 never failed, the retry count is  max-retries, or if the last failure was 
 OUTSIDE the retry-count-window-ms, then the job should be restarted. 
 Additionally, if the last failure was outside the retry-count-window-ms, then 
 the failure count should be set back to 0.
 This would give developers a way to have well-behaved AMs run forever, while 
 still failing mis-behaving AMs after a short period of time.
 I think the work to be done here is to change the RmAppImpl to actually look 
 at app.attempts, and see if there have been more than max-retries failures in 
 the last retry-count-window-ms milliseconds. If there have, then the job 
 should fail, if not, then the job should go forward. Additionally, we might 
 also need to add an endTime in either RMAppAttemptImpl or 
 RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the 
 failure.
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces

2014-07-28 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077416#comment-14077416
 ] 

Xuan Gong commented on YARN-1994:
-

I think it is because connectAddress is needed for generating the nodeId. With 
this patch, we will bind the NM Server with the NM_BIND address. We need the 
real nm_address to generate the nodeId.
[~cwelch] Could you confirm whether it is the reason ?

 Expose YARN/MR endpoints on multiple interfaces
 ---

 Key: YARN-1994
 URL: https://issues.apache.org/jira/browse/YARN-1994
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Craig Welch
 Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
 YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, 
 YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch


 YARN and MapReduce daemons currently do not support specifying a wildcard 
 address for the server endpoints. This prevents the endpoints from being 
 accessible from all interfaces on a multihomed machine.
 Note that if we do specify INADDR_ANY for any of the options, it will break 
 clients as they will attempt to connect to 0.0.0.0. We need a solution that 
 allows specifying a hostname or IP-address for clients while requesting 
 wildcard bind for the servers.
 (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.2#6252)