[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-3.0/branch-2

2020-01-28 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025156#comment-17025156
 ] 

Thomas Graves commented on YARN-8200:
-

Hey [~jhung] ,

I am trying out the gpu scheduling in hadoop 2.10 and the first thing I noticed 
is it doesn't error properly if you ask for to many GPU's. It seems to happyily 
say it gave them to me, although I think its really giving me the max 
configured.  Is this a known issue already or did configuration change?

I have gpu max configured at 4 and I try to allocate 8, on hadoop 3 I get:

 

Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException):
 Invalid resource request, requested resource type=[yarn.io/gpu] < 0 or greater 
than maximum allowed allocation. Requested resource=, maximum allowed allocation=, please note that maximum allowed allocation is calculated by 
scheduler based on maximum resource of registered NodeManagers, which might be 
less than configured maximum allocation=

 

On hadoop 2.10 I get a container allocated but the logs and UI says it only has 
4 gpus. 

> Backport resource types/GPU features to branch-3.0/branch-2
> ---
>
> Key: YARN-8200
> URL: https://issues.apache.org/jira/browse/YARN-8200
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Fix For: 2.10.0
>
> Attachments: YARN-8200-branch-2.001.patch, 
> YARN-8200-branch-2.002.patch, YARN-8200-branch-2.003.patch, 
> YARN-8200-branch-3.0.001.patch, 
> counter.scheduler.operation.allocate.csv.defaultResources, 
> counter.scheduler.operation.allocate.csv.gpuResources, synth_sls.json
>
>
> Currently we have a need for GPU scheduling on our YARN clusters to support 
> deep learning workloads. However, our main production clusters are running 
> older versions of branch-2 (2.7 in our case). To prevent supporting too many 
> very different hadoop versions across multiple clusters, we would like to 
> backport the resource types/resource profiles feature to branch-2, as well as 
> the GPU specific support.
>  
> We have done a trial backport of YARN-3926 and some miscellaneous patches in 
> YARN-7069 based on issues we uncovered, and the backport was fairly smooth. 
> We also did a trial backport of most of YARN-6223 (sans docker support).
>  
> Regarding the backports, perhaps we can do the development in a feature 
> branch and then merge to branch-2 when ready.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-3.0/branch-2

2020-01-28 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025285#comment-17025285
 ] 

Thomas Graves commented on YARN-8200:
-

After messing with this a bit more I removed the maximum allocation 
configurations after seeing the documentation didn't have them in the 2.10 
release. so removed this setting:


 yarn.resource-types.yarn.io/gpu.maximum-allocation
 4
 

And it appears now  yarn doesn't allocate me a container unless it has 
fullfilled all of the gpus I requested.   So in this case my nodemanager has 4 
gpus so if I request 5 then it just hangs waiting to fullfill the request. This 
behavior is much better then giving me one that is less then I requested.

 

> Backport resource types/GPU features to branch-3.0/branch-2
> ---
>
> Key: YARN-8200
> URL: https://issues.apache.org/jira/browse/YARN-8200
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Fix For: 2.10.0
>
> Attachments: YARN-8200-branch-2.001.patch, 
> YARN-8200-branch-2.002.patch, YARN-8200-branch-2.003.patch, 
> YARN-8200-branch-3.0.001.patch, 
> counter.scheduler.operation.allocate.csv.defaultResources, 
> counter.scheduler.operation.allocate.csv.gpuResources, synth_sls.json
>
>
> Currently we have a need for GPU scheduling on our YARN clusters to support 
> deep learning workloads. However, our main production clusters are running 
> older versions of branch-2 (2.7 in our case). To prevent supporting too many 
> very different hadoop versions across multiple clusters, we would like to 
> backport the resource types/resource profiles feature to branch-2, as well as 
> the GPU specific support.
>  
> We have done a trial backport of YARN-3926 and some miscellaneous patches in 
> YARN-7069 based on issues we uncovered, and the backport was fairly smooth. 
> We also did a trial backport of most of YARN-6223 (sans docker support).
>  
> Regarding the backports, perhaps we can do the development in a feature 
> branch and then merge to branch-2 when ready.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9055) Capacity Scheduler: allow larger queue level maximum-allocation-mb to override the cluster configuration

2018-11-27 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700511#comment-16700511
 ] 

Thomas Graves commented on YARN-9055:
-

It would definitely be a change in behavior which could surprise people with 
existing configurations.   I do think its easier to have this way so you don't 
have to configure all the queues.  I don't remember all the details on why I 
did it this way, I think it was mostly to not break the existing functionality 
of the cluster level max.,  

> Capacity Scheduler: allow larger queue level maximum-allocation-mb to 
> override the cluster configuration
> 
>
> Key: YARN-9055
> URL: https://issues.apache.org/jira/browse/YARN-9055
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.7.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
> Attachments: YARN-9055.1.patch
>
>
> YARN-1582 adds the support of maximum-allocation-mb configuration per queue. 
> That feature gives the flexibility to give different memory requirements for 
> different queues. Such patch adds the limitation that the queue level 
> configuration can't exceed the cluster level default configuration, but I 
> feel it may make more sense to remove such limitation to allow any overrides 
> since 
> # Such configuration is controlled by the admin so it shouldn't get abused; 
> # It's common that typical queues require standard size containers while some 
> job (queues) have requirements for larger containers. With current 
> limitation, we have to set larger configuration on the cluster setting which 
> will cause resource abuse unless we override them on all the queues.
> We can remove such limitation in CapacitySchedulerConfiguration.java so the 
> cluster setting provides the default value and queue setting can override it. 
> {noformat}
>if (maxAllocationMbPerQueue > clusterMax.getMemorySize()
> || maxAllocationVcoresPerQueue > clusterMax.getVirtualCores()) {
>   throw new IllegalArgumentException(
>   "Queue maximum allocation cannot be larger than the cluster setting"
>   + " for queue " + queue
>   + " max allocation per queue: " + result
>   + " cluster setting: " + clusterMax);
> }
> {noformat}
> Let me know if it makes sense.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9116) Capacity Scheduler: add the default maximum-allocation-mb and maximum-allocation-vcores for the queues

2019-01-08 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737392#comment-16737392
 ] 

Thomas Graves commented on YARN-9116:
-

Yes so you want to keep the behavior that the cluster level maximum is the 
absolute maximum and no child queues can be larger then that, otherwise it 
breaks backwards compatibility.  

> Capacity Scheduler: add the default maximum-allocation-mb and 
> maximum-allocation-vcores for the queues
> --
>
> Key: YARN-9116
> URL: https://issues.apache.org/jira/browse/YARN-9116
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 2.7.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
> Attachments: YARN-9116.1.patch
>
>
> YARN-1582 adds the support of maximum-allocation-mb configuration per queue 
> which is targeting to support larger container features on dedicated queues 
> (larger maximum-allocation-mb/maximum-allocation-vcores for such queue) . 
> While to achieve larger container configuration, we need to increase the 
> global maximum-allocation-mb/maximum-allocation-vcores (e.g. 120G/256) and 
> then override those configurations with desired values on the queues since 
> queue configuration can't be larger than cluster configuration. There are 
> many queues in the system and if we forget to configure such values when 
> adding a new queue, then such queue gets default 120G/256 which typically is 
> not what we want.  
> We can come up with a queue-default configuration (set to normal queue 
> configuration like 16G/8), so the leaf queues gets such values by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4610) Reservations continue looking for one app causes other apps to starve

2016-01-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15108697#comment-15108697
 ] 

Thomas Graves commented on YARN-4610:
-

+1.  Thanks for fixing this. 

> Reservations continue looking for one app causes other apps to starve
> -
>
> Key: YARN-4610
> URL: https://issues.apache.org/jira/browse/YARN-4610
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Blocker
> Attachments: YARN-4610.001.patch
>
>
> CapacityScheduler's LeafQueue has "reservations continue looking" logic that 
> allows an application to unreserve elsewhere to fulfil a container request on 
> a node that has available space.  However in 2.7 that logic seems to break 
> allocations for subsequent apps in the queue.  Once a user hits its user 
> limit, subsequent apps in the queue for other users receive containers at a 
> significantly reduced rate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4610) Reservations continue looking for one app causes other apps to starve

2016-01-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15109330#comment-15109330
 ] 

Thomas Graves commented on YARN-4610:
-

Ok thanks for investigating.  +1 from me feel free to commit.

> Reservations continue looking for one app causes other apps to starve
> -
>
> Key: YARN-4610
> URL: https://issues.apache.org/jira/browse/YARN-4610
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Blocker
> Attachments: YARN-4610.001.patch
>
>
> CapacityScheduler's LeafQueue has "reservations continue looking" logic that 
> allows an application to unreserve elsewhere to fulfil a container request on 
> a node that has available space.  However in 2.7 that logic seems to break 
> allocations for subsequent apps in the queue.  Once a user hits its user 
> limit, subsequent apps in the queue for other users receive containers at a 
> significantly reduced rate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4610) Reservations continue looking for one app causes other apps to starve

2016-01-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15109516#comment-15109516
 ] 

Thomas Graves commented on YARN-4610:
-

Sorry after looking some more I think there might be an issue with this for 
parent queue max capacities, looking some more.

> Reservations continue looking for one app causes other apps to starve
> -
>
> Key: YARN-4610
> URL: https://issues.apache.org/jira/browse/YARN-4610
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Blocker
> Attachments: YARN-4610.001.patch, YARN-4610.branch-2.7.001.patch
>
>
> CapacityScheduler's LeafQueue has "reservations continue looking" logic that 
> allows an application to unreserve elsewhere to fulfil a container request on 
> a node that has available space.  However in 2.7 that logic seems to break 
> allocations for subsequent apps in the queue.  Once a user hits its user 
> limit, subsequent apps in the queue for other users receive containers at a 
> significantly reduced rate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4610) Reservations continue looking for one app causes other apps to starve

2016-01-21 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110778#comment-15110778
 ] 

Thomas Graves commented on YARN-4610:
-

+1 for branch 2.7.  After investigating this some more the original patch of 
setting it to none() works. The reason is that the parents limit is passed and 
it would be taken into account int he leaf calculation.  I think the latter 
patch is safer but either is fine with me.

The master patch I'm not sure about how its taking the max capacity into 
account so I'll have to look at that more, but the unit tests are passing and 
that would be a separate issue from this fix.  +1 on that patch as well.

> Reservations continue looking for one app causes other apps to starve
> -
>
> Key: YARN-4610
> URL: https://issues.apache.org/jira/browse/YARN-4610
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Blocker
> Attachments: YARN-4610-branch-2.7.002.patch, YARN-4610.001.patch, 
> YARN-4610.branch-2.7.001.patch
>
>
> CapacityScheduler's LeafQueue has "reservations continue looking" logic that 
> allows an application to unreserve elsewhere to fulfil a container request on 
> a node that has available space.  However in 2.7 that logic seems to break 
> allocations for subsequent apps in the queue.  Once a user hits its user 
> limit, subsequent apps in the queue for other users receive containers at a 
> significantly reduced rate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4641) CapacityScheduler Active Users Info table should be sortable

2016-01-26 Thread Thomas Graves (JIRA)
Thomas Graves created YARN-4641:
---

 Summary: CapacityScheduler Active Users Info table should be 
sortable
 Key: YARN-4641
 URL: https://issues.apache.org/jira/browse/YARN-4641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler
Affects Versions: 2.7.1
Reporter: Thomas Graves


The Scheduler page when using the Capacity scheduler allows you to see all the 
Active Users Info.  If you have lots of users this is a big table and if you 
want to be able to see who is using the most it would be nice to have this 
sortable or show the %used like it used to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-5010) maxActiveApplications and maxActiveApplicationsPerUser are missing from REST API

2016-04-28 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15262914#comment-15262914
 ] 

Thomas Graves commented on YARN-5010:
-

we shouldn't just remove them as its an API compatibility issue.  I would say 
they should be added back and definition updated or we should rev rest api 
version.

> maxActiveApplications and maxActiveApplicationsPerUser are missing from REST 
> API
> 
>
> Key: YARN-5010
> URL: https://issues.apache.org/jira/browse/YARN-5010
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: Jason Lowe
>
> The RM used to report maxActiveApplications and maxActiveApplicationsPerUser 
> in the REST API for a queue, but these are missing in 2.7.0.  It appears 
> YARN-2637 replaced them with aMResourceLimit and userAMResourceLimit, 
> respectively, which broke some internal tools that were expecting the max app 
> fields to still be there.  We should at least update the REST docs to reflect 
> that change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-05-07 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3434:

Attachment: YARN-3434-branch2.7.patch

Attaching patch for branch2.7.

[~leftnoteasy] could you take a look when you have a chance?

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Fix For: 2.8.0
>
> Attachments: YARN-3434-branch2.7.patch, YARN-3434.patch, 
> YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, 
> YARN-3434.patch, YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3600) AM container link is broken (on a killed application, at least)

2015-05-08 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3600:

Labels:   (was: BB2015-05-RFC)

> AM container link is broken (on a killed application, at least)
> ---
>
> Key: YARN-3600
> URL: https://issues.apache.org/jira/browse/YARN-3600
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0
>Reporter: Sergey Shelukhin
>Assignee: Naganarasimha G R
> Attachments: YARN-3600.20150508-1.patch
>
>
> Running some fairly recent (couple weeks ago) version of 2.8.0-SNAPSHOT. 
> I have an application that ran fine for a while and then I yarn kill-ed it. 
> Now when I go to the only app attempt URL (like so: http://(snip RM host 
> name):8088/cluster/appattempt/appattempt_1429683757595_0795_01)
> I see:
> AM Container: container_1429683757595_0795_01_01
> Node: N/A 
> and the container link goes to {noformat}http://(snip RM host 
> name):8088/cluster/N/A
> {noformat}
> which obviously doesn't work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3600) AM container link is broken (on a killed application, at least)

2015-05-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534621#comment-14534621
 ] 

Thomas Graves commented on YARN-3600:
-

reviewing and kicking jenkins.

> AM container link is broken (on a killed application, at least)
> ---
>
> Key: YARN-3600
> URL: https://issues.apache.org/jira/browse/YARN-3600
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0
>Reporter: Sergey Shelukhin
>Assignee: Naganarasimha G R
> Attachments: YARN-3600.20150508-1.patch
>
>
> Running some fairly recent (couple weeks ago) version of 2.8.0-SNAPSHOT. 
> I have an application that ran fine for a while and then I yarn kill-ed it. 
> Now when I go to the only app attempt URL (like so: http://(snip RM host 
> name):8088/cluster/appattempt/appattempt_1429683757595_0795_01)
> I see:
> AM Container: container_1429683757595_0795_01_01
> Node: N/A 
> and the container link goes to {noformat}http://(snip RM host 
> name):8088/cluster/N/A
> {noformat}
> which obviously doesn't work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3600) AM container link is broken (on a killed application, at least)

2015-05-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534825#comment-14534825
 ] 

Thomas Graves commented on YARN-3600:
-

So the change does fix the broken link issue, but it seems to me other things 
are broken with this page.  Obviously if it ran for a while it got an AM and 
there fore should have a valid container.  But I guess that link only works if 
its actually running?

The container table below that also confused me a bit.  I thought at first it 
was list of AM containers, but after playing with it its really list of running 
containers.  I think we should add heading for that.  I filed separate jira for 
those things.

Anyway, +1.  Thanks!



> AM container link is broken (on a killed application, at least)
> ---
>
> Key: YARN-3600
> URL: https://issues.apache.org/jira/browse/YARN-3600
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0
>Reporter: Sergey Shelukhin
>Assignee: Naganarasimha G R
> Attachments: YARN-3600.20150508-1.patch
>
>
> Running some fairly recent (couple weeks ago) version of 2.8.0-SNAPSHOT. 
> I have an application that ran fine for a while and then I yarn kill-ed it. 
> Now when I go to the only app attempt URL (like so: http://(snip RM host 
> name):8088/cluster/appattempt/appattempt_1429683757595_0795_01)
> I see:
> AM Container: container_1429683757595_0795_01_01
> Node: N/A 
> and the container link goes to {noformat}http://(snip RM host 
> name):8088/cluster/N/A
> {noformat}
> which obviously doesn't work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3603) Application Attempts page confusing

2015-05-08 Thread Thomas Graves (JIRA)
Thomas Graves created YARN-3603:
---

 Summary: Application Attempts page confusing
 Key: YARN-3603
 URL: https://issues.apache.org/jira/browse/YARN-3603
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.8.0
Reporter: Thomas Graves


The application attempts page 
(http://RM:8088/cluster/appattempt/appattempt_1431101480046_0003_01)

is a bit confusing on what is going on.  I think the table of containers there 
is for only Running containers and when the app is completed or killed its 
empty.  The table should have a label on it stating so.  

Also the "AM Container" field is a link when running but not when its killed.  
That might be confusing.

There is no link to the logs in this page but there is in the app attempt table 
when looking at http://
rm:8088/cluster/app/application_1431101480046_0003



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-20) More information for "yarn.resourcemanager.webapp.address" in yarn-default.xml

2015-05-08 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-20:
--
Labels: newbie  (was: BB2015-05-RFC newbie)

> More information for "yarn.resourcemanager.webapp.address" in yarn-default.xml
> --
>
> Key: YARN-20
> URL: https://issues.apache.org/jira/browse/YARN-20
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation, resourcemanager
>Affects Versions: 2.0.0-alpha
>Reporter: Nemon Lou
>Assignee: Bartosz Ługowski
>Priority: Trivial
>  Labels: newbie
> Attachments: YARN-20.1.patch, YARN-20.2.patch, YARN-20.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
>   The parameter  yarn.resourcemanager.webapp.address in yarn-default.xml  is 
> in "host:port" format,which is noted in the cluster set up guide 
> (http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html).
>   When i read though the code,i find "host" format is also supported. In 
> "host" format,the port will be random.
>   So we may add more documentation in  yarn-default.xml for easy understood.
>   I will submit a patch if it's helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-20) More information for "yarn.resourcemanager.webapp.address" in yarn-default.xml

2015-05-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-20?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534967#comment-14534967
 ] 

Thomas Graves commented on YARN-20:
---

+1.  Thanks!

> More information for "yarn.resourcemanager.webapp.address" in yarn-default.xml
> --
>
> Key: YARN-20
> URL: https://issues.apache.org/jira/browse/YARN-20
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation, resourcemanager
>Affects Versions: 2.0.0-alpha
>Reporter: Nemon Lou
>Assignee: Bartosz Ługowski
>Priority: Trivial
>  Labels: newbie
> Attachments: YARN-20.1.patch, YARN-20.2.patch, YARN-20.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
>   The parameter  yarn.resourcemanager.webapp.address in yarn-default.xml  is 
> in "host:port" format,which is noted in the cluster set up guide 
> (http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html).
>   When i read though the code,i find "host" format is also supported. In 
> "host" format,the port will be random.
>   So we may add more documentation in  yarn-default.xml for easy understood.
>   I will submit a patch if it's helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3603) Application Attempts page confusing

2015-05-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535013#comment-14535013
 ] 

Thomas Graves commented on YARN-3603:
-

go for it.  Thanks!

> Application Attempts page confusing
> ---
>
> Key: YARN-3603
> URL: https://issues.apache.org/jira/browse/YARN-3603
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 2.8.0
>Reporter: Thomas Graves
>Assignee: Sunil G
>
> The application attempts page 
> (http://RM:8088/cluster/appattempt/appattempt_1431101480046_0003_01)
> is a bit confusing on what is going on.  I think the table of containers 
> there is for only Running containers and when the app is completed or killed 
> its empty.  The table should have a label on it stating so.  
> Also the "AM Container" field is a link when running but not when its killed. 
>  That might be confusing.
> There is no link to the logs in this page but there is in the app attempt 
> table when looking at http://
> rm:8088/cluster/app/application_1431101480046_0003



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-05-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535397#comment-14535397
 ] 

Thomas Graves commented on YARN-3434:
-

I'm not sure jenkins will work on this since this is for branch-2.7 unless 
we've hook it up to run for specific branches other then trunk.  Patch won't 
apply on trunk.

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Fix For: 2.8.0
>
> Attachments: YARN-3434-branch2.7.patch, YARN-3434.patch, 
> YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, 
> YARN-3434.patch, YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-172) AM logs link in RM ui redirects back to RM if AM not started

2015-05-08 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved YARN-172.

Resolution: Invalid

confirmed. 

> AM logs link in RM ui redirects back to RM if AM not started
> 
>
> Key: YARN-172
> URL: https://issues.apache.org/jira/browse/YARN-172
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 0.23.3
>Reporter: Thomas Graves
>  Labels: usability
>
> I went to the RM UI app page for an application that failed to start with the 
> error:  org.apache.hadoop.security.AccessControlException: User user cannot 
> submit applications to queue root.foo 
> I tried to click on the AM logs link and it just redirected me back to the RM 
> page.  if the AM didn't start we shouldn't show an attempt there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3600) AM container link is broken (on a killed application, at least)

2015-05-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535500#comment-14535500
 ] 

Thomas Graves commented on YARN-3600:
-

Hey [~sseth],

 Can you check to make sure YARN-3603 covers what you also think it should do?  
I added the bit about logs in there but I may have forgot about something else.

> AM container link is broken (on a killed application, at least)
> ---
>
> Key: YARN-3600
> URL: https://issues.apache.org/jira/browse/YARN-3600
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0
>Reporter: Sergey Shelukhin
>Assignee: Naganarasimha G R
> Fix For: 2.8.0
>
> Attachments: YARN-3600.20150508-1.patch
>
>
> Running some fairly recent (couple weeks ago) version of 2.8.0-SNAPSHOT. 
> I have an application that ran fine for a while and then I yarn kill-ed it. 
> Now when I go to the only app attempt URL (like so: http://(snip RM host 
> name):8088/cluster/appattempt/appattempt_1429683757595_0795_01)
> I see:
> AM Container: container_1429683757595_0795_01_01
> Node: N/A 
> and the container link goes to {noformat}http://(snip RM host 
> name):8088/cluster/N/A
> {noformat}
> which obviously doesn't work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-05-11 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538252#comment-14538252
 ] 

Thomas Graves commented on YARN-3434:
-

whats your question exactly?  For branch patches jenkins has never been hooked 
up. We generally download the patch, build and possibly the run the tests that 
apply and commit.

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Fix For: 2.8.0
>
> Attachments: YARN-3434-branch2.7.patch, YARN-3434.patch, 
> YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, 
> YARN-3434.patch, YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4045) Negative avaialbleMB is being reported for root queue.

2015-08-11 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682115#comment-14682115
 ] 

Thomas Graves commented on YARN-4045:
-

I remember seeing that this was fixed in branch-2 by some of the capacity 
scheduler work for labels.

I thought this might be fixed by 
https://issues.apache.org/jira/browse/YARN-3243 but that is included.  

This might be fixed as part of https://issues.apache.org/jira/browse/YARN-3361 
which is probably to big to backport totally.

[~leftnoteasy]  Do you remember this issue?

Note that it also shows up in capacity scheduler UI as root queue going over 
100%.  I remember when I was testing YARN-3434 it wasn't occurring for me on 
branch-2 (2.8) and I thought it was one of the above jiras that fixed.

> Negative avaialbleMB is being reported for root queue.
> --
>
> Key: YARN-4045
> URL: https://issues.apache.org/jira/browse/YARN-4045
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Rushabh S Shah
>
> We recently deployed 2.7 in one of our cluster.
> We are seeing negative availableMB being reported for queue=root.
> This is from the jmx output:
> {noformat}
> 
> ...
> -163328
> ...
> 
> {noformat}
> The following is the RM log:
> {noformat}
> 2015-08-10 14:42:28,280 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:28,404 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:35,548 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:35,549 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:39,088 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:39,089 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:39,338 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:39,339 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:39,757 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:39,758 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:43,056 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:43,070 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:44,486 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
> absoluteUsedCapacity=1.0029854 used= 
> cluster=
> 2015-08-10 14:42:44,487 [ResourceManager Event Processor] INFO 
> capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
> absoluteUsedCapacity=1.0032743 used= 
> cluster=
> 2015-08-10 14:42:44,886 [

[jira] [Commented] (YARN-656) In scheduler UI, including reserved memory in "Memory Total" can make it exceed cluster capacity.

2015-04-01 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391472#comment-14391472
 ] 

Thomas Graves commented on YARN-656:


Note this broke the UI, at least for the capacity scheduler.

It now displays total that is lacking the reserved.   Perhaps this is a 
difference in how fair scheduler and capacity scheduler keep track of allocated 
vs reservations.

> In scheduler UI, including reserved memory in "Memory Total" can make it 
> exceed cluster capacity.
> -
>
> Key: YARN-656
> URL: https://issues.apache.org/jira/browse/YARN-656
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, scheduler
>Affects Versions: 2.0.4-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Fix For: 2.1.0-beta
>
> Attachments: YARN-656-1.patch, YARN-656.patch
>
>
> "Memory Total" is currently a sum of availableMB, allocatedMB, and 
> reservedMB.  Including reservedMB in this sum can make the total exceed the 
> capacity of the cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3432) Cluster metrics have wrong Total Memory when there is reserved memory on CS

2015-04-01 Thread Thomas Graves (JIRA)
Thomas Graves created YARN-3432:
---

 Summary: Cluster metrics have wrong Total Memory when there is 
reserved memory on CS
 Key: YARN-3432
 URL: https://issues.apache.org/jira/browse/YARN-3432
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.6.0
Reporter: Thomas Graves


I noticed that when reservations happen when using the Capacity Scheduler, the 
UI and web services report the wrong total memory.

For example.  I have a 300GB of total memory in my cluster.  I allocate 50 and 
I reserve 10.  The cluster metrics for total memory get reported as 290GB.

This was broken by https://issues.apache.org/jira/browse/YARN-656 so perhaps 
there is a difference between fair scheduler and capacity scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3432) Cluster metrics have wrong Total Memory when there is reserved memory on CS

2015-04-02 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392687#comment-14392687
 ] 

Thomas Graves commented on YARN-3432:
-

that will fix it for the capacity scheduler, we need to see if that breaks the 
FairScheduler though.



> Cluster metrics have wrong Total Memory when there is reserved memory on CS
> ---
>
> Key: YARN-3432
> URL: https://issues.apache.org/jira/browse/YARN-3432
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Brahma Reddy Battula
>
> I noticed that when reservations happen when using the Capacity Scheduler, 
> the UI and web services report the wrong total memory.
> For example.  I have a 300GB of total memory in my cluster.  I allocate 50 
> and I reserve 10.  The cluster metrics for total memory get reported as 290GB.
> This was broken by https://issues.apache.org/jira/browse/YARN-656 so perhaps 
> there is a difference between fair scheduler and capacity scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-02 Thread Thomas Graves (JIRA)
Thomas Graves created YARN-3434:
---

 Summary: Interaction between reservations and userlimit can result 
in significant ULF violation
 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves


ULF was set to 1.0
User was able to consume 1.4X queue capacity.
It looks like when this application launched, it reserved about 1000 
containers, each 8G each, within about 5 seconds. I think this allowed the 
logic in assignToUser() to allow the userlimit to be surpassed.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-02 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392751#comment-14392751
 ] 

Thomas Graves commented on YARN-3434:
-

The issue here is that in if we allow the user to continue from the user limit 
checks in assignContainers because they have reservations, when it gets down 
into the assignContainer routine and its allowed to get a container and the 
node has space we don't double check the user limit in this case.  We recheck 
in all other cases but this one is missed.  

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485798#comment-14485798
 ] 

Thomas Graves commented on YARN-3434:
-

[~wangda] YARN-3243 fixes part of the problem with the max capacities, but it 
doesn't solve the user limit side of it.   The user limit check is never done 
again.  I'll have a patch up for this shortly I would appreciate it if you 
could take a look and give me feedback.

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485798#comment-14485798
 ] 

Thomas Graves edited comment on YARN-3434 at 4/8/15 6:59 PM:
-

[~wangda] YARN-3243 fixes part of the problem with the max capacities, but it 
doesn't solve the user limit side of it.   The user limit check is never done 
again in assignContainer() if it skipped the checks in assignContainers() based 
on reservations but then is allowed to shouldAllocOrReserveNewContainer.  I'll 
have a patch up for this shortly I would appreciate it if you could take a look 
and give me feedback.


was (Author: tgraves):
[~wangda] YARN-3243 fixes part of the problem with the max capacities, but it 
doesn't solve the user limit side of it.   The user limit check is never done 
again.  I'll have a patch up for this shortly I would appreciate it if you 
could take a look and give me feedback.

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-08 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3434:

Attachment: YARN-3434.patch

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485834#comment-14485834
 ] 

Thomas Graves commented on YARN-3434:
-

Note I had a reproducible test case for this.  Set userlimit% to 100%, user 
limit factor to 1.  15 nodes, 20GB each.  1 queue configured for capacity 70, 
the 2nd queue configured capacity 30.
In one queue I started a sleep job needing 10 - 12GB containers in the first 
queue.  I then started a second job in the 2nd queue that needed 25,  12GB 
containers, the second job got containers but then had to reserve others 
waiting for the first job to release some.   

Without this change when the first job started releasing containers the second 
job would grab them and go over the user limit.  With this fix it stayed within 
the user limit.  

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-09 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14488011#comment-14488011
 ] 

Thomas Graves commented on YARN-3434:
-

The code you mention is in the else part of that check where it would do a 
reservation.  The situation I'm talking about actually allocates a container, 
not reserve one.  I'll try to explain better:

Application ask for lots of containers. It acquires some containers, then it 
reserves some. At this point it hits its normal user limit which in my example 
= capacity.  It hasn't hit the max amount if can allocate or reserved 
(shouldAllocOrReserveNewContainer()).  The next node heartbeats in that isn't 
yet reserved and has enough space for it to place a container on.  It first 
checked in assignContainers -> canAssignToThisQueue.  That passes since we 
haven't hit max capacity. Then it checks assignContainers -> canAssignToUser. 
That passes but only because used - reserved < the user limit.  This allows it 
to continue down into assignContainer.  In assignContainer the node has 
available space and we haven't hit shouldAllocOrReserveNewContainer(). 
reservationsContinueLooking is on and labels are empty so it does the check:

{noformat}
if (!shouldAllocOrReserveNewContainer
|| Resources.greaterThan(resourceCalculator, clusterResource,
minimumUnreservedResource, Resources.none()))
{noformat}

as I said before its allowed to allocate or reserve so it passes that test.  
Then it hasn't met its maximum capacity (capacity = 30% and max capacity = 
100%) yet so that is None and that check doesn't kick in, so it doesn't go into 
the block to findNodeToUnreserve().   Then it goes ahead and allocates when it 
should have needed to unreserve.  Basically we needed to also do the user limit 
check again and force it to do the findNodeToUnreserve. 




> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-09 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487416#comment-14487416
 ] 

Thomas Graves commented on YARN-3434:
-

[~wangda]  I'm not sure I follow what are saying?  The reservations are already 
counted in the users usage and we do consider reserved when doing the user 
limit calculations.   Look at LeafQueue.assignContainers call to 
allocateResource is where it ends up adding to user usage.The 
canAssignToUser is where it does user limit check and substracts the 
reservations off to see if it can continue.  

Note I do think we should just get rid of the config for 
reservationsContinueLooking, but that is a separate issue.

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-09 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14488061#comment-14488061
 ] 

Thomas Graves commented on YARN-3434:
-

{quote}
And I've a question about continous reservation checking behavior, may or may 
not related to this issue: Now it will try to unreserve all containers under a 
user, but actually it will only unreserve at most one container to allocate a 
new container. Do you think is it fine to change the logic to be:
When (continousReservation-enabled) && (user.usage + required - 
min(max-allocation, user.total-reserved) <=user.limit), assignContainers will 
continue. This will prevent doing impossible allocation when user reserved lots 
of containers. (As same as queue reservation checking).
{quote}

I do think the reservation checking and unreserving can be improved.  I 
basically started with very simple thing and figured we could improve.  I'm not 
sure how much that check would help in practice.  I guess it might help the 
cases where you have 1 user in the queue and a second one shows up and your 
user limit gets decreased by a lot.  In that case it may prevent it from 
continuing when it can short circuit here.  So it would seem to be ok for that. 
 


> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-15 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496239#comment-14496239
 ] 

Thomas Graves commented on YARN-3434:
-

So I had considered putting it in the ResourceLimits but ResourceLimits seems 
to be more of a queue level thing to me (not a user level). For instance 
parentQueue passes this into leafQueue. ParentQueue cares nothing about user 
limits.  If you stored it there you would either need to track the user it was 
for or track for all users. ResourceLimits get updated when nodes are added and 
removed.  We don't need to compute a particular user limit when that happens.   
 So it would then be out of date or we change to update it when that happens, 
but that to me is fairly large change and not really needed.

The user limit calculation are lower down and recomputed per user, per 
application, per current request regularly and putting this into the global 
based on how being calculated and used didn't make sense to me. All you would 
be using it for is passing it down to assignContainer and then it would be out 
of date.  If someone else started looking at that value assuming it was up to 
date then it would be wrong (unless of course we started updating it as stated 
above).  But it would only be for a single user, not all users unless again we 
changed to calculate for every user whenever something changed. That seems a 
bit excessive.

You are correct that needToUnreserve could go away.  I started out on 2.6 which 
didn't have our changes and I could have removed it when I added in 
amountNeededUnreserve.  If we were to store it in the global ResourceLimit then 
yes the entire LimitsInfo can go away including shouldContinue as you would 
fall back to use the boolean return from each function.   But again based on my 
above comments I'm not sure ResourceLimit is the correct place to put this.

I just noticed that we are already keeping the userLimit in the User class, 
that would be another option.  But again I think we need to make it clear about 
what it is. This particular check is done per application, per user based on 
the current requested Resource.  The value stored that wouldn't necessarily 
apply to all the users applications since the resource request size could be 
different.  

thoughts or is there something I'm missing about ResourceLimits?

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-15 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496735#comment-14496735
 ] 

Thomas Graves commented on YARN-3434:
-

I am not saying child needs to know how parent calculate resource limit.  I am 
saying user limit and whether it needs to unreserve to make another reservation 
has nothing to do with the parent queue (ie it doesn't apply to parent queue).  
Remember I'm not needing to store user limit, I'm needing to store the fact of 
whether it needs to unreserve and if it does how much does it need to unreserve.

When a node heartbeats it goes through the regular assignments and updates the 
leafQueue clusterResources based on what the parent passes in. When a node is 
removed or added then it updates the resource limits (none of these apply to 
calculation of whether it needs to unreserve or not). 

Basically it comes down to is this information useful outside of the small 
window between when it calculates it and when its needed in assignContainer() 
and my thought is no.  And you said it yourself in last bullet above.  Although 
we have been referring to the userLImit and perhaps that is the problem.  I 
don't need to store the userLimit, I need to store whether it needs to 
unreserve and if so how much.  Therefore it fits better as a local transient 
variable rather then a globally stored one.  If you store just the userLImit 
then you need to recalculate stuff which I'm trying to avoid.

I understand why we are storing the current information in ResourceLimits 
because it has to do with headroom and parent limits and is recalculated at 
various points, but the current implementation in canAssignToUser doesn't use 
headroom at all and whether we need to unreserve or not on the last call to 
assignContainers doesn't affect the headroom calculation.

Again basically all we would be doing is placing an extra global variable(s) in 
the ResourceLimits class just to pass it on down a couple of functions. That to 
me is a parameter.   Now if we had multiple things needing this or updating it 
then to me fits better in the ResourceLimits.  



> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-15 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497055#comment-14497055
 ] 

Thomas Graves commented on YARN-3434:
-

I agree with Both section.  I'm not sure I completely follow the Only section. 
Are you suggesting we change the patch to modify ResourceLimits and pass down 
rather then using the LimitsInfo class?  If so that won't work, at least not 
without adding the shouldContinue flag to it.  Unless you mean keep LimitsInfo 
class for use locally in assignContainers and then pass ResourceLimits down to 
assignContainer with the value of amountNeededUnreserve as the limit.  That 
wouldn't really change much exception the object we pass down through the 
functions. 

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-15 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497076#comment-14497076
 ] 

Thomas Graves commented on YARN-3434:
-

so you are saying add amountNeededUnreserve to ResourceLimits and then set the 
global currentResourceLimits.amountNeededUnreserve inside of canAssignToUser?  
This is what I was not in favor of above and there would be no need to pass it 
down as parameter.

Or were you saying create a ResourceLimit and pass it as parameter to 
canAssignToUser and canAssignToThisQueue and modify that instance. That 
instance would then be passed down though to assignContainer()?

I don't see how else you set the ResourceLimit.

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-17 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14499803#comment-14499803
 ] 

Thomas Graves commented on YARN-3434:
-

Ok, I'll make the changes and post an updated patch

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-20 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3434:

Attachment: YARN-3434.patch

Updated patch with review comments.

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-3434.patch, YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-20 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3434:

Attachment: YARN-3434.patch

Upmerged to latest

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3294) Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period

2015-04-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503760#comment-14503760
 ] 

Thomas Graves commented on YARN-3294:
-

[~xgong] [~vvasudev]  I saw this show up in the UI on branch-2.  I don't see 
any permissions checks on this, am I perhaps missing it?  We don't want 
arbitrary users to be able to change log level on the RM.  They could slow it 
down and cause disks to fill up.

I also don't see an option to disable this, is there one?  If not I think we 
want it.   

Honestly I don't really see a need for this button at all as you can change in 
the logLevel app.  But since its in we atleast need to protect it and in my 
opinion disable it for normal users.

> Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time 
> period
> -
>
> Key: YARN-3294
> URL: https://issues.apache.org/jira/browse/YARN-3294
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Fix For: 2.8.0
>
> Attachments: Screen Shot 2015-03-12 at 8.51.25 PM.png, 
> apache-yarn-3294.0.patch, apache-yarn-3294.1.patch, apache-yarn-3294.2.patch, 
> apache-yarn-3294.3.patch, apache-yarn-3294.4.patch
>
>
> It would be nice to have a button on the web UI that would allow dumping of 
> debug logs for just the capacity scheduler for a fixed period of time(1 min, 
> 5 min or so) in a separate log file. It would be useful when debugging 
> scheduler behavior without affecting the rest of the resourcemanager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only

2015-04-21 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504954#comment-14504954
 ] 

Thomas Graves commented on YARN-3517:
-

Thanks for following up on this.  Could you also change it to not show the 
button if you aren't an admin?  I don't want to confuse users by having a 
button there that doesn't do anything.

One other thing is could you add some css or something to make it look more 
like a button.  Right now it just looks like text and I didn't know it was 
clickable at first.   The placement of it seems a bit weird to me also but as 
along as its only showing up for admins that is less of an issue.

I haven't looked at the patch if details but I see we are creating a new 
AdminACLsManager each time. It would be nice if we didn't have to do that.

> RM web ui for dumping scheduler logs should be for admins only
> --
>
> Key: YARN-3517
> URL: https://issues.apache.org/jira/browse/YARN-3517
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, security
>Affects Versions: 2.7.0
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>  Labels: security
> Attachments: YARN-3517.001.patch
>
>
> YARN-3294 allows users to dump scheduler logs from the web UI. This should be 
> for admins only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-21 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3434:

Attachment: YARN-3434.patch

updated based on review comments

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, 
> YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-22 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3434:

Attachment: YARN-3434.patch

Upmerged patch to latest 

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, 
> YARN-3434.patch, YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-22 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3434:

Attachment: YARN-3434.patch

Fixed the line length and the white space style issues.  Other then that I 
moved things around and its just complaining about the same things more.

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, 
> YARN-3434.patch, YARN-3434.patch, YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-22 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3434:

Attachment: YARN-3434.patch

Attaching exact same patch to kick jenkins again

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --
>
> Key: YARN-3434
> URL: https://issues.apache.org/jira/browse/YARN-3434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, 
> YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only

2015-04-23 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509152#comment-14509152
 ] 

Thomas Graves commented on YARN-3517:
-


+  // non-secure mode with no acls enabled
+  if (!isAdmin && !UserGroupInformation.isSecurityEnabled()
+  && !adminACLsManager.areACLsEnabled()) {
+isAdmin = true;
+  }
+

We don't need the isSecurityEnabled check,  just keep the one for 
areAclsEnabled. This could be combined with the previous if, make this the else 
if part but that isn't a big deal.

in QueuesBlock we are creating the AdminACLsManager every web page load.   
Perhaps a better way would be to use the this.rm.getApplicationACLsManager() 
and extend the ApplicationAclsManager to explose an isAdmin functionality

> RM web ui for dumping scheduler logs should be for admins only
> --
>
> Key: YARN-3517
> URL: https://issues.apache.org/jira/browse/YARN-3517
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, security
>Affects Versions: 2.7.0
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>Priority: Blocker
>  Labels: security
> Attachments: YARN-3517.001.patch, YARN-3517.002.patch, 
> YARN-3517.003.patch
>
>
> YARN-3294 allows users to dump scheduler logs from the web UI. This should be 
> for admins only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only

2015-04-28 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned YARN-3517:
---

Assignee: Thomas Graves  (was: Varun Vasudev)

> RM web ui for dumping scheduler logs should be for admins only
> --
>
> Key: YARN-3517
> URL: https://issues.apache.org/jira/browse/YARN-3517
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, security
>Reporter: Varun Vasudev
>Assignee: Thomas Graves
>Priority: Blocker
>  Labels: security
> Attachments: YARN-3517.001.patch, YARN-3517.002.patch, 
> YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch
>
>
> YARN-3294 allows users to dump scheduler logs from the web UI. This should be 
> for admins only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only

2015-04-28 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518054#comment-14518054
 ] 

Thomas Graves commented on YARN-3517:
-

in RMWebServices.java we don't need the isSecurityEnabled check.  Just remove 
the entire check.  My reasoning is that logLevel app does not do those checks, 
it simply makes sure you are an admin.

+if (UserGroupInformation.isSecurityEnabled() && callerUGI == null) {
+  String msg = "Unable to obtain user name, user not authenticated";
+  throw new AuthorizationException(msg);
+}

in the test TestRMWebServices.java.  We aren't actually asserting anything.  we 
should assert that the expected files exist.  Personally I would also like to 
see an assert that the expected exception occurred.

> RM web ui for dumping scheduler logs should be for admins only
> --
>
> Key: YARN-3517
> URL: https://issues.apache.org/jira/browse/YARN-3517
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, security
>Reporter: Varun Vasudev
>Assignee: Thomas Graves
>Priority: Blocker
>  Labels: security
> Attachments: YARN-3517.001.patch, YARN-3517.002.patch, 
> YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch
>
>
> YARN-3294 allows users to dump scheduler logs from the web UI. This should be 
> for admins only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only

2015-04-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14520252#comment-14520252
 ] 

Thomas Graves commented on YARN-3517:
-

changes look good, +1.   thanks [~vvasudev]  

> RM web ui for dumping scheduler logs should be for admins only
> --
>
> Key: YARN-3517
> URL: https://issues.apache.org/jira/browse/YARN-3517
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, security
>Reporter: Varun Vasudev
>Assignee: Thomas Graves
>Priority: Blocker
>  Labels: security
> Attachments: YARN-3517.001.patch, YARN-3517.002.patch, 
> YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch, 
> YARN-3517.006.patch
>
>
> YARN-3294 allows users to dump scheduler logs from the web UI. This should be 
> for admins only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only

2015-04-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14520355#comment-14520355
 ] 

Thomas Graves commented on YARN-3517:
-

thanks [~vinodkv] I missed that.

> RM web ui for dumping scheduler logs should be for admins only
> --
>
> Key: YARN-3517
> URL: https://issues.apache.org/jira/browse/YARN-3517
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, security
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>Priority: Blocker
>  Labels: security
> Fix For: 2.8.0
>
> Attachments: YARN-3517.001.patch, YARN-3517.002.patch, 
> YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch, 
> YARN-3517.006.patch
>
>
> YARN-3294 allows users to dump scheduler logs from the web UI. This should be 
> for admins only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.

2015-04-30 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14521580#comment-14521580
 ] 

Thomas Graves commented on YARN-3243:
-

[~leftnoteasy] Can we pull this back into the branch-2.7?  

> CapacityScheduler should pass headroom from parent to children to make sure 
> ParentQueue obey its capacity limits.
> -
>
> Key: YARN-3243
> URL: https://issues.apache.org/jira/browse/YARN-3243
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 2.8.0
>
> Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, 
> YARN-3243.4.patch, YARN-3243.5.patch
>
>
> Now CapacityScheduler has some issues to make sure ParentQueue always obeys 
> its capacity limits, for example:
> 1) When allocating container of a parent queue, it will only check 
> parentQueue.usage < parentQueue.max. If leaf queue allocated a container.size 
> > (parentQueue.max - parentQueue.usage), parent queue can excess its max 
> resource limit, as following example:
> {code}
> A  (usage=54, max=55)
>/ \
>   A1 A2 (usage=1, max=55)
> (usage=53, max=53)
> {code}
> Queue-A2 is able to allocate container since its usage < max, but if we do 
> that, A's usage can excess A.max.
> 2) When doing continous reservation check, parent queue will only tell 
> children "you need unreserve *some* resource, so that I will less than my 
> maximum resource", but it will not tell how many resource need to be 
> unreserved. This may lead to parent queue excesses configured maximum 
> capacity as well.
> With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, 
> *here is my proposal*:
> - ParentQueue will set its children's ResourceUsage.headroom, which means, 
> *maximum resource its children can allocate*.
> - ParentQueue will set its children's headroom to be (saying parent's name is 
> "qA"): min(qA.headroom, qA.max - qA.used). This will make sure qA's 
> ancestors' capacity will be enforced as well (qA.headroom is set by qA's 
> parent).
> - {{needToUnReserve}} is not necessary, instead, children can get how much 
> resource need to be unreserved to keep its parent's resource limit.
> - More over, with this, YARN-3026 will make a clear boundary between 
> LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.

2015-04-30 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14522021#comment-14522021
 ] 

Thomas Graves commented on YARN-3243:
-

I was wanting to pull YARN-3434 back into 2.7.  It kind of depends on this one. 
Atleast I think it would merge cleanly if this one was there. 
This is also fixing a bug which I would like to see fixed in the 2.7 line if we 
are going to use it.  Its not a blocker since it exists in our 2.6 but it would 
be nice to have.  If we decide its to big then I'll just port YARN-3434 back 
without it   

> CapacityScheduler should pass headroom from parent to children to make sure 
> ParentQueue obey its capacity limits.
> -
>
> Key: YARN-3243
> URL: https://issues.apache.org/jira/browse/YARN-3243
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 2.8.0
>
> Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, 
> YARN-3243.4.patch, YARN-3243.5.patch
>
>
> Now CapacityScheduler has some issues to make sure ParentQueue always obeys 
> its capacity limits, for example:
> 1) When allocating container of a parent queue, it will only check 
> parentQueue.usage < parentQueue.max. If leaf queue allocated a container.size 
> > (parentQueue.max - parentQueue.usage), parent queue can excess its max 
> resource limit, as following example:
> {code}
> A  (usage=54, max=55)
>/ \
>   A1 A2 (usage=1, max=55)
> (usage=53, max=53)
> {code}
> Queue-A2 is able to allocate container since its usage < max, but if we do 
> that, A's usage can excess A.max.
> 2) When doing continous reservation check, parent queue will only tell 
> children "you need unreserve *some* resource, so that I will less than my 
> maximum resource", but it will not tell how many resource need to be 
> unreserved. This may lead to parent queue excesses configured maximum 
> capacity as well.
> With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, 
> *here is my proposal*:
> - ParentQueue will set its children's ResourceUsage.headroom, which means, 
> *maximum resource its children can allocate*.
> - ParentQueue will set its children's headroom to be (saying parent's name is 
> "qA"): min(qA.headroom, qA.max - qA.used). This will make sure qA's 
> ancestors' capacity will be enforced as well (qA.headroom is set by qA's 
> parent).
> - {{needToUnReserve}} is not necessary, instead, children can get how much 
> resource need to be unreserved to keep its parent's resource limit.
> - More over, with this, YARN-3026 will make a clear boundary between 
> LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.

2015-04-30 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14522066#comment-14522066
 ] 

Thomas Graves commented on YARN-3243:
-

It might to merge completely clean but it wouldn't require it for 
functionality.   It would be nice to have this in 2.7 either way though.

> CapacityScheduler should pass headroom from parent to children to make sure 
> ParentQueue obey its capacity limits.
> -
>
> Key: YARN-3243
> URL: https://issues.apache.org/jira/browse/YARN-3243
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 2.8.0
>
> Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, 
> YARN-3243.4.patch, YARN-3243.5.patch
>
>
> Now CapacityScheduler has some issues to make sure ParentQueue always obeys 
> its capacity limits, for example:
> 1) When allocating container of a parent queue, it will only check 
> parentQueue.usage < parentQueue.max. If leaf queue allocated a container.size 
> > (parentQueue.max - parentQueue.usage), parent queue can excess its max 
> resource limit, as following example:
> {code}
> A  (usage=54, max=55)
>/ \
>   A1 A2 (usage=1, max=55)
> (usage=53, max=53)
> {code}
> Queue-A2 is able to allocate container since its usage < max, but if we do 
> that, A's usage can excess A.max.
> 2) When doing continous reservation check, parent queue will only tell 
> children "you need unreserve *some* resource, so that I will less than my 
> maximum resource", but it will not tell how many resource need to be 
> unreserved. This may lead to parent queue excesses configured maximum 
> capacity as well.
> With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, 
> *here is my proposal*:
> - ParentQueue will set its children's ResourceUsage.headroom, which means, 
> *maximum resource its children can allocate*.
> - ParentQueue will set its children's headroom to be (saying parent's name is 
> "qA"): min(qA.headroom, qA.max - qA.used). This will make sure qA's 
> ancestors' capacity will be enforced as well (qA.headroom is set by qA's 
> parent).
> - {{needToUnReserve}} is not necessary, instead, children can get how much 
> resource need to be unreserved to keep its parent's resource limit.
> - More over, with this, YARN-3026 will make a clear boundary between 
> LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.

2015-04-30 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14522066#comment-14522066
 ] 

Thomas Graves edited comment on YARN-3243 at 4/30/15 7:02 PM:
--

It might to merge completely clean but it wouldn't require it for 
functionality.   It would be nice to have this in 2.7 either way though.

I'll try it out later and see.


was (Author: tgraves):
It might to merge completely clean but it wouldn't require it for 
functionality.   It would be nice to have this in 2.7 either way though.

> CapacityScheduler should pass headroom from parent to children to make sure 
> ParentQueue obey its capacity limits.
> -
>
> Key: YARN-3243
> URL: https://issues.apache.org/jira/browse/YARN-3243
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 2.8.0
>
> Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, 
> YARN-3243.4.patch, YARN-3243.5.patch
>
>
> Now CapacityScheduler has some issues to make sure ParentQueue always obeys 
> its capacity limits, for example:
> 1) When allocating container of a parent queue, it will only check 
> parentQueue.usage < parentQueue.max. If leaf queue allocated a container.size 
> > (parentQueue.max - parentQueue.usage), parent queue can excess its max 
> resource limit, as following example:
> {code}
> A  (usage=54, max=55)
>/ \
>   A1 A2 (usage=1, max=55)
> (usage=53, max=53)
> {code}
> Queue-A2 is able to allocate container since its usage < max, but if we do 
> that, A's usage can excess A.max.
> 2) When doing continous reservation check, parent queue will only tell 
> children "you need unreserve *some* resource, so that I will less than my 
> maximum resource", but it will not tell how many resource need to be 
> unreserved. This may lead to parent queue excesses configured maximum 
> capacity as well.
> With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, 
> *here is my proposal*:
> - ParentQueue will set its children's ResourceUsage.headroom, which means, 
> *maximum resource its children can allocate*.
> - ParentQueue will set its children's headroom to be (saying parent's name is 
> "qA"): min(qA.headroom, qA.max - qA.used). This will make sure qA's 
> ancestors' capacity will be enforced as well (qA.headroom is set by qA's 
> parent).
> - {{needToUnReserve}} is not necessary, instead, children can get how much 
> resource need to be unreserved to keep its parent's resource limit.
> - More over, with this, YARN-3026 will make a clear boundary between 
> LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.

2015-05-01 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14523187#comment-14523187
 ] 

Thomas Graves commented on YARN-3243:
-

thanks [~leftnoteasy] I'll attempt to merge YARN-3434. If its not clean I'll 
put up a patch for it.

> CapacityScheduler should pass headroom from parent to children to make sure 
> ParentQueue obey its capacity limits.
> -
>
> Key: YARN-3243
> URL: https://issues.apache.org/jira/browse/YARN-3243
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 2.8.0
>
> Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, 
> YARN-3243.4.patch, YARN-3243.5.patch
>
>
> Now CapacityScheduler has some issues to make sure ParentQueue always obeys 
> its capacity limits, for example:
> 1) When allocating container of a parent queue, it will only check 
> parentQueue.usage < parentQueue.max. If leaf queue allocated a container.size 
> > (parentQueue.max - parentQueue.usage), parent queue can excess its max 
> resource limit, as following example:
> {code}
> A  (usage=54, max=55)
>/ \
>   A1 A2 (usage=1, max=55)
> (usage=53, max=53)
> {code}
> Queue-A2 is able to allocate container since its usage < max, but if we do 
> that, A's usage can excess A.max.
> 2) When doing continous reservation check, parent queue will only tell 
> children "you need unreserve *some* resource, so that I will less than my 
> maximum resource", but it will not tell how many resource need to be 
> unreserved. This may lead to parent queue excesses configured maximum 
> capacity as well.
> With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, 
> *here is my proposal*:
> - ParentQueue will set its children's ResourceUsage.headroom, which means, 
> *maximum resource its children can allocate*.
> - ParentQueue will set its children's headroom to be (saying parent's name is 
> "qA"): min(qA.headroom, qA.max - qA.used). This will make sure qA's 
> ancestors' capacity will be enforced as well (qA.headroom is set by qA's 
> parent).
> - {{needToUnReserve}} is not necessary, instead, children can get how much 
> resource need to be unreserved to keep its parent's resource limit.
> - More over, with this, YARN-3026 will make a clear boundary between 
> LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.

2015-05-01 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3243:

Fix Version/s: 2.7.1

> CapacityScheduler should pass headroom from parent to children to make sure 
> ParentQueue obey its capacity limits.
> -
>
> Key: YARN-3243
> URL: https://issues.apache.org/jira/browse/YARN-3243
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 2.8.0, 2.7.1
>
> Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, 
> YARN-3243.4.patch, YARN-3243.5.patch
>
>
> Now CapacityScheduler has some issues to make sure ParentQueue always obeys 
> its capacity limits, for example:
> 1) When allocating container of a parent queue, it will only check 
> parentQueue.usage < parentQueue.max. If leaf queue allocated a container.size 
> > (parentQueue.max - parentQueue.usage), parent queue can excess its max 
> resource limit, as following example:
> {code}
> A  (usage=54, max=55)
>/ \
>   A1 A2 (usage=1, max=55)
> (usage=53, max=53)
> {code}
> Queue-A2 is able to allocate container since its usage < max, but if we do 
> that, A's usage can excess A.max.
> 2) When doing continous reservation check, parent queue will only tell 
> children "you need unreserve *some* resource, so that I will less than my 
> maximum resource", but it will not tell how many resource need to be 
> unreserved. This may lead to parent queue excesses configured maximum 
> capacity as well.
> With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, 
> *here is my proposal*:
> - ParentQueue will set its children's ResourceUsage.headroom, which means, 
> *maximum resource its children can allocate*.
> - ParentQueue will set its children's headroom to be (saying parent's name is 
> "qA"): min(qA.headroom, qA.max - qA.used). This will make sure qA's 
> ancestors' capacity will be enforced as well (qA.headroom is set by qA's 
> parent).
> - {{needToUnReserve}} is not necessary, instead, children can get how much 
> resource need to be unreserved to keep its parent's resource limit.
> - More over, with this, YARN-3026 will make a clear boundary between 
> LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1631) Container allocation issue in Leafqueue assignContainers()

2015-05-01 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524042#comment-14524042
 ] 

Thomas Graves commented on YARN-1631:
-

we need to be careful with this.  You could end up starving out the first 
application.  It definitely changes current semantics.

What version of hadoop are you seeing this issue? With my patch for 
reservations continue looking it should actually look at node 2 and take that 
one and unreserve node 1.  There is the logic for the needsContainer that might 
be affecting this that I would have to look at more.

> Container allocation issue in Leafqueue assignContainers()
> --
>
> Key: YARN-1631
> URL: https://issues.apache.org/jira/browse/YARN-1631
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.2.0
> Environment: SuSe 11 Linux 
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: Yarn-1631.1.patch, Yarn-1631.2.patch
>
>
> Application1 has a demand of 8GB[Map Task Size as 8GB] which is more than 
> Node_1 can handle.
> Node_1 has a size of 8GB and 2GB is used by Application1's AM.
> Hence reservation happened for remaining 6GB in Node_1 by Application1.
> A new job is submitted with 2GB AM size and 2GB task size with only 2 Maps to 
> run.
> Node_2 also has 8GB capability.
> But Application2's AM cannot be launched in Node_2. And Application2 waits 
> longer as only 2 Nodes are available in cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-8149) Revisit behavior of Re-Reservation in Capacity Scheduler

2018-04-12 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436295#comment-16436295
 ] 

Thomas Graves commented on YARN-8149:
-

are you going to do anything with starvation then or allocation a certain % 
more then what is required? I am hesitant to remove this without doing some 
major testing.  I haven't had a chance to look at the latest code to 
investigate.

It might be more fine now that we do continue looking at other nodes after 
reservation where as originally that didn't happen. Is in queue preemption on 
by default?

> Revisit behavior of Re-Reservation in Capacity Scheduler
> 
>
> Key: YARN-8149
> URL: https://issues.apache.org/jira/browse/YARN-8149
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Priority: Critical
>
> Frankly speaking, I'm not sure why we need the re-reservation. The formula is 
> not that easy to understand:
> Inside: 
> {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#shouldAllocOrReserveNewContainer}}
> {code:java}
> starvation = re-reservation / (#reserved-container * 
>  (1 - min(requested-resource / max-alloc, 
>   max-alloc - min-alloc / max-alloc))
> should_allocate = starvation + requiredContainers - reservedContainers > 
> 0{code}
> I think we should be able to remove the starvation computation, just to check 
> requiredContainers > reservedContainers should be enough.
> In a large cluster, we can easily overflow re-reservation to MAX_INT, see 
> YARN-7636. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8149) Revisit behavior of Re-Reservation in Capacity Scheduler

2018-04-12 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436366#comment-16436366
 ] 

Thomas Graves commented on YARN-8149:
-

thinking about this a little more, even with the current preemption on, I don't 
think preemption is smart enough to keep starvation from happening.  If 
preemption was smart enough to kill enough containers on a reserved node to 
make it so the big container actually gets scheduled there that might be ok.  
But last time I checked it doesn't do that.

Without that or having another way to prevent starvation I wouldn't want to 
remove this.  I think adding a config would be alright but if anyone finds it 
useful you can't remove and would just be an extra config.  

If we have other ideas to simply or make this better, great we should look at. 
Or if there is a way for us to get stats on if this is useful we could add 
those and run and determine if we should remove.

> Revisit behavior of Re-Reservation in Capacity Scheduler
> 
>
> Key: YARN-8149
> URL: https://issues.apache.org/jira/browse/YARN-8149
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Priority: Critical
>
> Frankly speaking, I'm not sure why we need the re-reservation. The formula is 
> not that easy to understand:
> Inside: 
> {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#shouldAllocOrReserveNewContainer}}
> {code:java}
> starvation = re-reservation / (#reserved-container * 
>  (1 - min(requested-resource / max-alloc, 
>   max-alloc - min-alloc / max-alloc))
> should_allocate = starvation + requiredContainers - reservedContainers > 
> 0{code}
> I think we should be able to remove the starvation computation, just to check 
> requiredContainers > reservedContainers should be enough.
> In a large cluster, we can easily overflow re-reservation to MAX_INT, see 
> YARN-7636. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7935) Expose container's hostname to applications running within the docker container

2018-02-22 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16373039#comment-16373039
 ] 

Thomas Graves commented on YARN-7935:
-

[~mridulm80] what is the spark Jira for this?  If this goes in it will still 
have to grab this from env to pass in to the executorRunnable.

> Expose container's hostname to applications running within the docker 
> container
> ---
>
> Key: YARN-7935
> URL: https://issues.apache.org/jira/browse/YARN-7935
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
> Attachments: YARN-7935.1.patch, YARN-7935.2.patch
>
>
> Some applications have a need to bind to the container's hostname (like 
> Spark) which is different from the NodeManager's hostname(NM_HOST which is 
> available as an env during container launch) when launched through Docker 
> runtime. The container's hostname can be exposed to applications via an env 
> CONTAINER_HOSTNAME. Another potential candidate is the container's IP but 
> this can be addressed in a separate jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7935) Expose container's hostname to applications running within the docker container

2018-02-23 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374598#comment-16374598
 ] 

Thomas Graves commented on YARN-7935:
-

thanks for the explanation Mridul. I'm fine with waiting on the spark Jira til 
you know the scope better, I'm currently not doing anything with bridge mode so 
won't be able to help there at this point.

> Expose container's hostname to applications running within the docker 
> container
> ---
>
> Key: YARN-7935
> URL: https://issues.apache.org/jira/browse/YARN-7935
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
> Attachments: YARN-7935.1.patch, YARN-7935.2.patch
>
>
> Some applications have a need to bind to the container's hostname (like 
> Spark) which is different from the NodeManager's hostname(NM_HOST which is 
> available as an env during container launch) when launched through Docker 
> runtime. The container's hostname can be exposed to applications via an env 
> CONTAINER_HOSTNAME. Another potential candidate is the container's IP but 
> this can be addressed in a separate jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8991) nodemanager not cleaning blockmgr directories inside appcache

2018-11-09 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16681525#comment-16681525
 ] 

Thomas Graves commented on YARN-8991:
-

[~teonadi] can you clarify here.  Are you saying its not getting cleaned up 
while the Spark application is still running or its not getting cleaned up 
after the spark application finishes?

> nodemanager not cleaning blockmgr directories inside appcache 
> --
>
> Key: YARN-8991
> URL: https://issues.apache.org/jira/browse/YARN-8991
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: Hidayat Teonadi
>Priority: Major
> Attachments: yarn-nm-log.txt
>
>
> Hi, I'm running spark on yarn and have enabled the Spark Shuffle Service. I'm 
> noticing that during the lifetime of my spark streaming application, the nm 
> appcache folder is building up with blockmgr directories (filled with 
> shuffle_*.data).
> Looking into the nm logs, it seems like the blockmgr directories is not part 
> of the cleanup process of the application. Eventually disk will fill up and 
> app will crash. I have both 
> {{yarn.nodemanager.localizer.cache.cleanup.interval-ms}} and 
> {{yarn.nodemanager.localizer.cache.target-size-mb}} set, so I don't think its 
> a configuration issue.
> What is stumping me is the executor ID listed by spark during the external 
> shuffle block registration doesn't match the executor ID listed in yarn's nm 
> log. Maybe this executorID disconnect explains why the cleanup is not done ? 
> I'm assuming that blockmgr directories are supposed to be cleaned up ?
>  
> {noformat}
> 2018-11-05 15:01:21,349 INFO 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver: Registered 
> executor AppExecId{appId=application_1541045942679_0193, execId=1299} with 
> ExecutorShuffleInfo{localDirs=[/mnt1/yarn/nm/usercache/auction_importer/appcache/application_1541045942679_0193/blockmgr-b9703ae3-722c-47d1-a374-abf1cc954f42],
>  subDirsPerLocalDir=64, 
> shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
>  {noformat}
>  
> seems similar to https://issues.apache.org/jira/browse/YARN-7070, although 
> I'm not sure if the behavior I'm seeing is spark use related.
> [https://stackoverflow.com/questions/52923386/spark-streaming-job-doesnt-delete-shuffle-files]
>  has a stop gap solution of cleaning up via cron.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8991) nodemanager not cleaning blockmgr directories inside appcache

2018-11-12 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16683946#comment-16683946
 ] 

Thomas Graves commented on YARN-8991:
-

if its while its running then you should file this with Spark. Its very similar 
to https://issues.apache.org/jira/browse/SPARK-17233.

The spark external shuffle service doesn't supports that at this point.   The 
problem with that is that you may have an Spark Executor running on one host, 
generate some map output data to shuffle and then that executor exits as its 
not needed anymore.  When a reduce starts it just talked to the Yarn 
nodemanager and the external shuffle server to get the map output.   Now there 
is no executor left on the node to cleanup the shuffle output.   Support would 
have to be added for like the driver to tell the spark external shuffle service 
to cleanup.

If you don't use dynamic allocation and the external shuffle service it should 
cleanup properly.

> nodemanager not cleaning blockmgr directories inside appcache 
> --
>
> Key: YARN-8991
> URL: https://issues.apache.org/jira/browse/YARN-8991
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: Hidayat Teonadi
>Priority: Major
> Attachments: yarn-nm-log.txt
>
>
> Hi, I'm running spark on yarn and have enabled the Spark Shuffle Service. I'm 
> noticing that during the lifetime of my spark streaming application, the nm 
> appcache folder is building up with blockmgr directories (filled with 
> shuffle_*.data).
> Looking into the nm logs, it seems like the blockmgr directories is not part 
> of the cleanup process of the application. Eventually disk will fill up and 
> app will crash. I have both 
> {{yarn.nodemanager.localizer.cache.cleanup.interval-ms}} and 
> {{yarn.nodemanager.localizer.cache.target-size-mb}} set, so I don't think its 
> a configuration issue.
> What is stumping me is the executor ID listed by spark during the external 
> shuffle block registration doesn't match the executor ID listed in yarn's nm 
> log. Maybe this executorID disconnect explains why the cleanup is not done ? 
> I'm assuming that blockmgr directories are supposed to be cleaned up ?
>  
> {noformat}
> 2018-11-05 15:01:21,349 INFO 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver: Registered 
> executor AppExecId{appId=application_1541045942679_0193, execId=1299} with 
> ExecutorShuffleInfo{localDirs=[/mnt1/yarn/nm/usercache/auction_importer/appcache/application_1541045942679_0193/blockmgr-b9703ae3-722c-47d1-a374-abf1cc954f42],
>  subDirsPerLocalDir=64, 
> shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
>  {noformat}
>  
> seems similar to https://issues.apache.org/jira/browse/YARN-7070, although 
> I'm not sure if the behavior I'm seeing is spark use related.
> [https://stackoverflow.com/questions/52923386/spark-streaming-job-doesnt-delete-shuffle-files]
>  has a stop gap solution of cleaning up via cron.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7204) Localizer errors on archive without any files

2017-09-15 Thread Thomas Graves (JIRA)
Thomas Graves created YARN-7204:
---

 Summary: Localizer errors on archive without any files
 Key: YARN-7204
 URL: https://issues.apache.org/jira/browse/YARN-7204
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.8.1
Reporter: Thomas Graves


If a user sends an archive without any files in it (only directories), yarn 
fails to localize it with the error below.  I ran into this specifically 
running spark job but looks generic to localizer.


 Application application_1505252418630_25423 failed 3 times due to AM Container 
for appattempt_1505252418630_25423_03 exited with exitCode: -1000
Failing this attempt.Diagnostics: No such file or directory
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767)
at 
org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218)
at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945)
at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
For more detailed output, check the application tracking page: 
https://axonitered-jt1.red.ygrid.yahoo.com:50508/applicationhistory/app/application_1505252418630_25423
 Then click on links to logs of each attempt.
. Failing the application. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7204) Localizer errors on archive without any files

2017-09-15 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-7204:

Description: 
If a user sends an archive without any files in it (only directories), yarn 
fails to localize it with the error below.  I ran into this specifically 
running spark job but looks generic to localizer.


 Application application_1505252418630_25423 failed 3 times due to AM Container 
for appattempt_1505252418630_25423_03 exited with exitCode: -1000
Failing this attempt.Diagnostics: No such file or directory
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767)
at 
org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218)
at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945)
at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
For more detailed output, check the application tracking page: 
https://rm.com:50508/applicationhistory/app/application_1505252418630_25423 
Then click on links to logs of each attempt.
. Failing the application. 

  was:
If a user sends an archive without any files in it (only directories), yarn 
fails to localize it with the error below.  I ran into this specifically 
running spark job but looks generic to localizer.


 Application application_1505252418630_25423 failed 3 times due to AM Container 
for appattempt_1505252418630_25423_03 exited with exitCode: -1000
Failing this attempt.Diagnostics: No such file or directory
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767)
at 
org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218)
at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945)
at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at

[jira] [Updated] (YARN-7204) Localizer errors on archive without any files

2017-09-15 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-7204:

Description: 
If a user sends an archive without any files in it (only directories), yarn 
fails to localize it with the error below.  I ran into this specifically 
running spark job but looks generic to localizer.


 Application application_1505252418630_25423 failed 3 times due to AM Container 
for appattempt_1505252418630_25423_03 exited with exitCode: -1000
Failing this attempt.Diagnostics: No such file or directory
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767)
at 
org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218)
at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945)
at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
For more detailed output, check the application tracking page: 
https://rm.com:50708/applicationhistory/app/application_1505252418630_25423 
Then click on links to logs of each attempt.
. Failing the application. 

  was:
If a user sends an archive without any files in it (only directories), yarn 
fails to localize it with the error below.  I ran into this specifically 
running spark job but looks generic to localizer.


 Application application_1505252418630_25423 failed 3 times due to AM Container 
for appattempt_1505252418630_25423_03 exited with exitCode: -1000
Failing this attempt.Diagnostics: No such file or directory
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767)
at 
org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218)
at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945)
at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at

[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-25 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1769:

Attachment: YARN-1769.patch

attaching the same patch to kick jenkins.


> CapacityScheduler:  Improve reservations
> 
>
> Key: YARN-1769
> URL: https://issues.apache.org/jira/browse/YARN-1769
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch
>
>
> Currently the CapacityScheduler uses reservations in order to handle requests 
> for large containers and the fact there might not currently be enough space 
> available on a single host.
> The current algorithm for reservations is to reserve as many containers as 
> currently required and then it will start to reserve more above that after a 
> certain number of re-reservations (currently biased against larger 
> containers).  Anytime it hits the limit of number reserved it stops looking 
> at any other nodes. This results in potentially missing nodes that have 
> enough space to fullfill the request.   
> The other place for improvement is currently reservations count against your 
> queue capacity.  If you have reservations you could hit the various limits 
> which would then stop you from looking further at that node.  
> The above 2 cases can cause an application requesting a larger container to 
> take a long time to gets it resources.  
> We could improve upon both of those by simply continuing to look at incoming 
> nodes to see if we could potentially swap out a reservation for an actual 
> allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-25 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1769:

Attachment: YARN-1769.patch

Update tests for handle SystemMetricsPublisher

> CapacityScheduler:  Improve reservations
> 
>
> Key: YARN-1769
> URL: https://issues.apache.org/jira/browse/YARN-1769
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch
>
>
> Currently the CapacityScheduler uses reservations in order to handle requests 
> for large containers and the fact there might not currently be enough space 
> available on a single host.
> The current algorithm for reservations is to reserve as many containers as 
> currently required and then it will start to reserve more above that after a 
> certain number of re-reservations (currently biased against larger 
> containers).  Anytime it hits the limit of number reserved it stops looking 
> at any other nodes. This results in potentially missing nodes that have 
> enough space to fullfill the request.   
> The other place for improvement is currently reservations count against your 
> queue capacity.  If you have reservations you could hit the various limits 
> which would then stop you from looking further at that node.  
> The above 2 cases can cause an application requesting a larger container to 
> take a long time to gets it resources.  
> We could improve upon both of those by simply continuing to look at incoming 
> nodes to see if we could potentially swap out a reservation for an actual 
> allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-25 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1769:

Attachment: YARN-1769.patch

fix patch

> CapacityScheduler:  Improve reservations
> 
>
> Key: YARN-1769
> URL: https://issues.apache.org/jira/browse/YARN-1769
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch
>
>
> Currently the CapacityScheduler uses reservations in order to handle requests 
> for large containers and the fact there might not currently be enough space 
> available on a single host.
> The current algorithm for reservations is to reserve as many containers as 
> currently required and then it will start to reserve more above that after a 
> certain number of re-reservations (currently biased against larger 
> containers).  Anytime it hits the limit of number reserved it stops looking 
> at any other nodes. This results in potentially missing nodes that have 
> enough space to fullfill the request.   
> The other place for improvement is currently reservations count against your 
> queue capacity.  If you have reservations you could hit the various limits 
> which would then stop you from looking further at that node.  
> The above 2 cases can cause an application requesting a larger container to 
> take a long time to gets it resources.  
> We could improve upon both of those by simply continuing to look at incoming 
> nodes to see if we could potentially swap out a reservation for an actual 
> allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-25 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1769:

Attachment: YARN-1769.patch

> CapacityScheduler:  Improve reservations
> 
>
> Key: YARN-1769
> URL: https://issues.apache.org/jira/browse/YARN-1769
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch
>
>
> Currently the CapacityScheduler uses reservations in order to handle requests 
> for large containers and the fact there might not currently be enough space 
> available on a single host.
> The current algorithm for reservations is to reserve as many containers as 
> currently required and then it will start to reserve more above that after a 
> certain number of re-reservations (currently biased against larger 
> containers).  Anytime it hits the limit of number reserved it stops looking 
> at any other nodes. This results in potentially missing nodes that have 
> enough space to fullfill the request.   
> The other place for improvement is currently reservations count against your 
> queue capacity.  If you have reservations you could hit the various limits 
> which would then stop you from looking further at that node.  
> The above 2 cases can cause an application requesting a larger container to 
> take a long time to gets it resources.  
> We could improve upon both of those by simply continuing to look at incoming 
> nodes to see if we could potentially swap out a reservation for an actual 
> allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-25 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148098#comment-14148098
 ] 

Thomas Graves commented on YARN-1769:
-

We've been running this now on cluster for quite a while and its showing great 
improvements in the time to get larger containers.  I would like to put this in.

> CapacityScheduler:  Improve reservations
> 
>
> Key: YARN-1769
> URL: https://issues.apache.org/jira/browse/YARN-1769
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch
>
>
> Currently the CapacityScheduler uses reservations in order to handle requests 
> for large containers and the fact there might not currently be enough space 
> available on a single host.
> The current algorithm for reservations is to reserve as many containers as 
> currently required and then it will start to reserve more above that after a 
> certain number of re-reservations (currently biased against larger 
> containers).  Anytime it hits the limit of number reserved it stops looking 
> at any other nodes. This results in potentially missing nodes that have 
> enough space to fullfill the request.   
> The other place for improvement is currently reservations count against your 
> queue capacity.  If you have reservations you could hit the various limits 
> which would then stop you from looking further at that node.  
> The above 2 cases can cause an application requesting a larger container to 
> take a long time to gets it resources.  
> We could improve upon both of those by simply continuing to look at incoming 
> nodes to see if we could potentially swap out a reservation for an actual 
> allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-26 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149565#comment-14149565
 ] 

Thomas Graves commented on YARN-1769:
-

Thanks for the review Jason. I'll update the patch and remove some of the 
logging or make it truly debug.

> CapacityScheduler:  Improve reservations
> 
>
> Key: YARN-1769
> URL: https://issues.apache.org/jira/browse/YARN-1769
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch
>
>
> Currently the CapacityScheduler uses reservations in order to handle requests 
> for large containers and the fact there might not currently be enough space 
> available on a single host.
> The current algorithm for reservations is to reserve as many containers as 
> currently required and then it will start to reserve more above that after a 
> certain number of re-reservations (currently biased against larger 
> containers).  Anytime it hits the limit of number reserved it stops looking 
> at any other nodes. This results in potentially missing nodes that have 
> enough space to fullfill the request.   
> The other place for improvement is currently reservations count against your 
> queue capacity.  If you have reservations you could hit the various limits 
> which would then stop you from looking further at that node.  
> The above 2 cases can cause an application requesting a larger container to 
> take a long time to gets it resources.  
> We could improve upon both of those by simply continuing to look at incoming 
> nodes to see if we could potentially swap out a reservation for an actual 
> allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-26 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1769:

Attachment: YARN-1769.patch

patch with log statments changed to debug

> CapacityScheduler:  Improve reservations
> 
>
> Key: YARN-1769
> URL: https://issues.apache.org/jira/browse/YARN-1769
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch
>
>
> Currently the CapacityScheduler uses reservations in order to handle requests 
> for large containers and the fact there might not currently be enough space 
> available on a single host.
> The current algorithm for reservations is to reserve as many containers as 
> currently required and then it will start to reserve more above that after a 
> certain number of re-reservations (currently biased against larger 
> containers).  Anytime it hits the limit of number reserved it stops looking 
> at any other nodes. This results in potentially missing nodes that have 
> enough space to fullfill the request.   
> The other place for improvement is currently reservations count against your 
> queue capacity.  If you have reservations you could hit the various limits 
> which would then stop you from looking further at that node.  
> The above 2 cases can cause an application requesting a larger container to 
> take a long time to gets it resources.  
> We could improve upon both of those by simply continuing to look at incoming 
> nodes to see if we could potentially swap out a reservation for an actual 
> allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-443) allow OS scheduling priority of NM to be different than the containers it launches

2014-10-27 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14185578#comment-14185578
 ] 

Thomas Graves commented on YARN-443:


Can you be more specific, what is different about it and why it is a problem? 
The trunk patch shows that there was an existing getRunCommand() routine 
(before this change) where as the other didn't have one before (it looks like 
for windows support).

> allow OS scheduling priority of NM to be different than the containers it 
> launches
> --
>
> Key: YARN-443
> URL: https://issues.apache.org/jira/browse/YARN-443
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.0.3-alpha, 0.23.6
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Fix For: 0.23.7, 2.0.4-alpha
>
> Attachments: YARN-443-branch-0.23.patch, YARN-443-branch-0.23.patch, 
> YARN-443-branch-0.23.patch, YARN-443-branch-0.23.patch, 
> YARN-443-branch-2.patch, YARN-443-branch-2.patch, YARN-443-branch-2.patch, 
> YARN-443.patch, YARN-443.patch, YARN-443.patch, YARN-443.patch, 
> YARN-443.patch, YARN-443.patch, YARN-443.patch
>
>
> It would be nice if we could have the nodemanager run at a different OS 
> scheduling priority than the containers so that you can still communicate 
> with the nodemanager if the containers out of control.  
> On linux we could launch the nodemanager at a higher priority, but then all 
> the containers it launches would also be at that higher priority, so we need 
> a way for the container executor to launch them at a lower priority.
> I'm not sure how this applies to windows if at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2828) Enable auto refresh of web pages (using http parameter)

2014-11-07 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202033#comment-14202033
 ] 

Thomas Graves commented on YARN-2828:
-

auto refresh was removed because some pages load a lot of data and you actually 
may not want it to update.  It can make debugging harder if you are looking at 
a lot of data and the screen keeps refreshing on you.

I think the only way to bring it back is to make it optional.

> Enable auto refresh of web pages (using http parameter)
> ---
>
> Key: YARN-2828
> URL: https://issues.apache.org/jira/browse/YARN-2828
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Tim Robertson
>Priority: Minor
>
> The MR1 Job Tracker had a useful HTTP parameter of e.g. "&refresh=3" that 
> could be appended to URLs which enabled a page reload.  This was very useful 
> when developing mapreduce jobs, especially to watch counters changing.  This 
> is lost in the the Yarn interface.
> Could be implemented as a page element (e.g. drop down or so), but I'd 
> recommend that the page not be more cluttered, and simply bring back the 
> optional "refresh" HTTP param.  It worked really nicely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-563) Add application type to ApplicationReport

2013-05-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662015#comment-13662015
 ] 

Thomas Graves commented on YARN-563:


I agree with Hitesh, I think this should be in the web UI and webservices as 
well as the CLI.   This could be very useful to anyone debugging their 
application that uses the web UI, SE looking for patterns or issues with 
particular type of application, tools using the webservices to aggregate info 
and create their own useful experiences, etc. 

Mayank, I'm not sure what you consider attributes?  Are you referring just to 
the filtering part?  The web ui and webservices print almost everything that is 
a part of the application report.  

I'm OK with web ui/webservices being added under a separate jira but would have 
rather seen them done here with the CLI part. 

> Add application type to ApplicationReport 
> --
>
> Key: YARN-563
> URL: https://issues.apache.org/jira/browse/YARN-563
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Thomas Weise
>Assignee: Mayank Bansal
> Attachments: YARN-563-trunk-1.patch, YARN-563-trunk-2.patch, 
> YARN-563-trunk-3.patch, YARN-563-trunk-4.patch
>
>
> This field is needed to distinguish different types of applications (app 
> master implementations). For example, we may run applications of type XYZ in 
> a cluster alongside MR and would like to filter applications by type.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-563) Add application type to ApplicationReport

2013-05-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662237#comment-13662237
 ] 

Thomas Graves commented on YARN-563:


Sorry if I wasn't clear. I think we might be mixing terms here. In my mind 
there is showing them at all in the web ui/webservices and then there is the 
additional thing of supporting filtering on them.  I agree with you that the 
filtering part is separate. The showing them in web ui and webservices to me is 
the same thing as showing them in the output of the yarn application CLI 
command.  

> Add application type to ApplicationReport 
> --
>
> Key: YARN-563
> URL: https://issues.apache.org/jira/browse/YARN-563
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Thomas Weise
>Assignee: Mayank Bansal
> Attachments: YARN-563-trunk-1.patch, YARN-563-trunk-2.patch, 
> YARN-563-trunk-3.patch, YARN-563-trunk-4.patch
>
>
> This field is needed to distinguish different types of applications (app 
> master implementations). For example, we may run applications of type XYZ in 
> a cluster alongside MR and would like to filter applications by type.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-563) Add application type to ApplicationReport

2013-05-22 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664068#comment-13664068
 ] 

Thomas Graves commented on YARN-563:


Thanks Mayank,  can you please update the web services documentation also?  
Similar to 
http://hadoop.apache.org/docs/r2.0.4-alpha/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html

its in 
./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm


> Add application type to ApplicationReport 
> --
>
> Key: YARN-563
> URL: https://issues.apache.org/jira/browse/YARN-563
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Thomas Weise
>Assignee: Mayank Bansal
> Attachments: YARN-563-trunk-10-jenkins.patch, 
> YARN-563-trunk-10-review.patch, YARN-563-trunk-1.patch, 
> YARN-563-trunk-2.patch, YARN-563-trunk-3.patch, YARN-563-trunk-4.patch, 
> YARN-563-trunk-5.patch, YARN-563-trunk-6.patch, YARN-563-trunk-7.patch, 
> YARN-563-trunk-8.patch, YARN-563-trunk-9-jenkins.patch, 
> YARN-563-trunk-9-review.patch
>
>
> This field is needed to distinguish different types of applications (app 
> master implementations). For example, we may run applications of type XYZ in 
> a cluster alongside MR and would like to filter applications by type.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-126) yarn rmadmin help message contains reference to hadoop cli and JT

2013-05-28 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-126:
---

Target Version/s: 3.0.0, 2.0.5-beta, 0.23.9  (was: 3.0.0, 2.0.5-beta, 
0.23.8)

> yarn rmadmin help message contains reference to hadoop cli and JT
> -
>
> Key: YARN-126
> URL: https://issues.apache.org/jira/browse/YARN-126
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.0.3-alpha
>Reporter: Thomas Graves
>Assignee: Rémy SAISSY
>  Labels: usability
> Attachments: YARN-126.patch
>
>
> has option to specify a job tracker and the last line for general command 
> line syntax had "bin/hadoop command [genericOptions] [commandOptions]"
> ran "yarn rmadmin" to get usage:
> RMAdmin
> Usage: java RMAdmin
>[-refreshQueues]
>[-refreshNodes]
>[-refreshUserToGroupsMappings]
>[-refreshSuperUserGroupsConfiguration]
>[-refreshAdminAcls]
>[-refreshServiceAcl]
>[-help [cmd]]
> Generic options supported are
> -conf  specify an application configuration file
> -D use value for given property
> -fs   specify a namenode
> -jt specify a job tracker
> -files specify comma separated files to be 
> copied to the map reduce cluster
> -libjars specify comma separated jar files 
> to include in the classpath.
> -archives specify comma separated 
> archives to be unarchived on the compute machines.
> The general command line syntax is
> bin/hadoop command [genericOptions] [commandOptions]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-459) DefaultContainerExecutor doesn't log stderr from container launch

2013-05-28 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-459:
---

Target Version/s: 2.0.5-beta, 0.23.9  (was: 2.0.4-alpha, 0.23.8)

> DefaultContainerExecutor doesn't log stderr from container launch
> -
>
> Key: YARN-459
> URL: https://issues.apache.org/jira/browse/YARN-459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.3-alpha, 0.23.7
>Reporter: Thomas Graves
>Assignee: Sandy Ryza
>
> The DefaultContainerExecutor does not log stderr or add it to the diagnostics 
> message it something fails during the container launch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-276) Capacity Scheduler can hang when submit many jobs concurrently

2013-06-03 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673458#comment-13673458
 ] 

Thomas Graves commented on YARN-276:


Nemon, Sorry it appears this got lost in the shuffle and it no longer applies, 
could you update the patch for the current trunk/branch-2?

> Capacity Scheduler can hang when submit many jobs concurrently
> --
>
> Key: YARN-276
> URL: https://issues.apache.org/jira/browse/YARN-276
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.0.0, 2.0.1-alpha
>Reporter: nemon lou
>Assignee: nemon lou
>  Labels: incompatible
> Attachments: YARN-276.patch, YARN-276.patch, YARN-276.patch, 
> YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, 
> YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, 
> YARN-276.patch, YARN-276.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In hadoop2.0.1,When i submit many jobs concurrently at the same time,Capacity 
> scheduler can hang with most resources taken up by AM and don't have enough 
> resources for tasks.And then all applications hang there.
> The cause is that "yarn.scheduler.capacity.maximum-am-resource-percent" not 
> check directly.Instead ,this property only used for maxActiveApplications. 
> And maxActiveApplications is computed by minimumAllocation (not by Am 
> actually used).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-750) Allow for black-listing resources in CS

2013-06-04 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13674461#comment-13674461
 ] 

Thomas Graves commented on YARN-750:


javadocs warnings are complaining about + * {@link ResourceRequest#ANY} in 
InvalidBlacklistRequestException.java
.

> Allow for black-listing resources in CS
> ---
>
> Key: YARN-750
> URL: https://issues.apache.org/jira/browse/YARN-750
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun C Murthy
>Assignee: Arun C Murthy
> Attachments: YARN-750.patch, YARN-750.patch, YARN-750.patch, 
> YARN-750.patch, YARN-750.patch, YARN-750.patch
>
>
> YARN-392 and YARN-398 enhance scheduler api to allow for white-lists of 
> resources.
> This jira is a companion to allow for black-listing (in CS).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (YARN-750) Allow for black-listing resources in CS

2013-06-04 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13674461#comment-13674461
 ] 

Thomas Graves edited comment on YARN-750 at 6/4/13 2:59 PM:


javadocs warnings are complaining about + * {@link ResourceRequest#ANY} in 
InvalidBlacklistRequestException.java
.

Update: Ignore looks like Arun updated patch as I was commenting.

  was (Author: tgraves):
javadocs warnings are complaining about + * {@link ResourceRequest#ANY} in 
InvalidBlacklistRequestException.java
.
  
> Allow for black-listing resources in CS
> ---
>
> Key: YARN-750
> URL: https://issues.apache.org/jira/browse/YARN-750
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun C Murthy
>Assignee: Arun C Murthy
> Attachments: YARN-750.patch, YARN-750.patch, YARN-750.patch, 
> YARN-750.patch, YARN-750.patch, YARN-750.patch
>
>
> YARN-392 and YARN-398 enhance scheduler api to allow for white-lists of 
> resources.
> This jira is a companion to allow for black-listing (in CS).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-750) Allow for black-listing resources in CS

2013-06-04 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13674527#comment-13674527
 ] 

Thomas Graves commented on YARN-750:


Can we make BlacklistRequestPBImpl immutable since we are changing that in 
other places (YARN-735)?

> Allow for black-listing resources in CS
> ---
>
> Key: YARN-750
> URL: https://issues.apache.org/jira/browse/YARN-750
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun C Murthy
>Assignee: Arun C Murthy
> Attachments: YARN-750.patch, YARN-750.patch, YARN-750.patch, 
> YARN-750.patch, YARN-750.patch, YARN-750.patch
>
>
> YARN-392 and YARN-398 enhance scheduler api to allow for white-lists of 
> resources.
> This jira is a companion to allow for black-listing (in CS).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-276) Capacity Scheduler can hang when submit many jobs concurrently

2013-06-04 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13675315#comment-13675315
 ] 

Thomas Graves commented on YARN-276:


Thanks Nemon,  I'm still reviewing it, here are a couple of things so far.  I 
hope to finish reviewing later tonight. 

- LeafQueue - please wrap at 80 characters
- LeafQueue - please use the @VisibleForTesting annoation in 
setMaxAMResourcePerQueuePerUserPercent
- FicaSchedulerApp - for misspelled as foe
- FicaSchedulerApp - please use the   @VisibleForTesting annotation around 
setAMResource

I ran a few tests and looked at the scheduler webui for the queue I was running 
in and the used resource and am used resources showed up blank even though 
there were jobs running. Can you please take a look to see why?   The REST web 
services call were returning values for those fields.

> Capacity Scheduler can hang when submit many jobs concurrently
> --
>
> Key: YARN-276
> URL: https://issues.apache.org/jira/browse/YARN-276
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.0.0, 2.0.1-alpha
>Reporter: nemon lou
>Assignee: nemon lou
>  Labels: incompatible
> Attachments: YARN-276.patch, YARN-276.patch, YARN-276.patch, 
> YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, 
> YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, 
> YARN-276.patch, YARN-276.patch, YARN-276.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In hadoop2.0.1,When i submit many jobs concurrently at the same time,Capacity 
> scheduler can hang with most resources taken up by AM and don't have enough 
> resources for tasks.And then all applications hang there.
> The cause is that "yarn.scheduler.capacity.maximum-am-resource-percent" not 
> check directly.Instead ,this property only used for maxActiveApplications. 
> And maxActiveApplications is computed by minimumAllocation (not by Am 
> actually used).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-276) Capacity Scheduler can hang when submit many jobs concurrently

2013-06-04 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13675574#comment-13675574
 ] 

Thomas Graves commented on YARN-276:


I need to spend more time looking through the new logic, here are a few more 
comments for now.

remove the comment from overAMUsedPercent about max active application since 
its no longer present.

FicaSchedulerApp getAMResource, change amRequedResource -> amRequiredResource

the max active applications per user used to use the absolute queue capacity 
instead of the absolute max queue capacity. It was changed to use the absolute 
capacity because it uses the userlimitfactor in the calculation, which should 
be applied to the capacity and not max capacity (see MAPREDUCE-3897 for more 
details).  We should change the overAMUsedPercentPerUser similarly to use 
absolute capacity, not absolute max capacity.

This can be filed as a separate jira since it was pre-existing but a bad app 
that requests 0 for the memory could cause divide by zero exception.


> Capacity Scheduler can hang when submit many jobs concurrently
> --
>
> Key: YARN-276
> URL: https://issues.apache.org/jira/browse/YARN-276
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.0.0, 2.0.1-alpha
>Reporter: nemon lou
>Assignee: nemon lou
>  Labels: incompatible
> Attachments: YARN-276.patch, YARN-276.patch, YARN-276.patch, 
> YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, 
> YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, 
> YARN-276.patch, YARN-276.patch, YARN-276.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In hadoop2.0.1,When i submit many jobs concurrently at the same time,Capacity 
> scheduler can hang with most resources taken up by AM and don't have enough 
> resources for tasks.And then all applications hang there.
> The cause is that "yarn.scheduler.capacity.maximum-am-resource-percent" not 
> check directly.Instead ,this property only used for maxActiveApplications. 
> And maxActiveApplications is computed by minimumAllocation (not by Am 
> actually used).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-764) blank Used Resources on Capacity Scheduler page

2013-06-05 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13675917#comment-13675917
 ] 

Thomas Graves commented on YARN-764:


Nemon, thanks for looking at this.  I guess its because it uses the raw "<" 
character instead of "<". 

Another option, which I would prefer, is just to escape the string using 
StringEscapeUtils.escapeHtml() in CapacitySchedulerPage.java.  That way if 
someone adds something else to the string in the future or it accidentally gets 
changed back it will still work.

> blank Used Resources on Capacity Scheduler page 
> 
>
> Key: YARN-764
> URL: https://issues.apache.org/jira/browse/YARN-764
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.4-alpha
>Reporter: nemon lou
>Assignee: nemon lou
> Attachments: YARN-764.patch
>
>
> Even when there are jobs running,used resources is empty on Capacity 
> Scheduler page for leaf queue.(I use google-chrome on windows 7.)
> After changing resource.java's toString method by replacing "<>" with 
> "{}",this bug gets fixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-768) RM crashes due to DNS issue

2013-06-06 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677026#comment-13677026
 ] 

Thomas Graves commented on YARN-768:


This is a dup of YARN-713. Its had more work done on it, so lets use that one.

> RM crashes due to DNS issue
> ---
>
> Key: YARN-768
> URL: https://issues.apache.org/jira/browse/YARN-768
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: PengZhang
> Attachments: YARN-768_v1.patch
>
>
> I encountered problem described in MAPREDUCE-4295. And I think that patch has 
> been removed since YARN-39.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-713) ResourceManager can exit unexpectedly if DNS is unavailable

2013-06-06 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677027#comment-13677027
 ] 

Thomas Graves commented on YARN-713:


Note that this had been fixed at one time by MAPREDUCE-4295, but was lost.

> ResourceManager can exit unexpectedly if DNS is unavailable
> ---
>
> Key: YARN-713
> URL: https://issues.apache.org/jira/browse/YARN-713
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Priority: Critical
> Attachments: YARN-713.patch, YARN-713.patch
>
>
> As discussed in MAPREDUCE-5261, there's a possibility that a DNS outage could 
> lead to an unhandled exception in the ResourceManager's AsyncDispatcher, and 
> that ultimately would cause the RM to exit.  The RM should not exit during 
> DNS hiccups.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-713) ResourceManager can exit unexpectedly if DNS is unavailable

2013-06-06 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-713:
---

Assignee: Maysam Yabandeh

> ResourceManager can exit unexpectedly if DNS is unavailable
> ---
>
> Key: YARN-713
> URL: https://issues.apache.org/jira/browse/YARN-713
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Assignee: Maysam Yabandeh
>Priority: Critical
> Attachments: YARN-713.patch, YARN-713.patch
>
>
> As discussed in MAPREDUCE-5261, there's a possibility that a DNS outage could 
> lead to an unhandled exception in the ResourceManager's AsyncDispatcher, and 
> that ultimately would cause the RM to exit.  The RM should not exit during 
> DNS hiccups.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-764) blank Used Resources on Capacity Scheduler page

2013-06-06 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677340#comment-13677340
 ] 

Thomas Graves commented on YARN-764:


+1, Thanks Nemon, I'll commit this shortly.

> blank Used Resources on Capacity Scheduler page 
> 
>
> Key: YARN-764
> URL: https://issues.apache.org/jira/browse/YARN-764
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.4-alpha
>Reporter: nemon lou
>Assignee: nemon lou
> Attachments: YARN-764.patch, YARN-764.patch
>
>
> Even when there are jobs running,used resources is empty on Capacity 
> Scheduler page for leaf queue.(I use google-chrome on windows 7.)
> After changing resource.java's toString method by replacing "<>" with 
> "{}",this bug gets fixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-276) Capacity Scheduler can hang when submit many jobs concurrently

2013-06-06 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677544#comment-13677544
 ] 

Thomas Graves commented on YARN-276:


Thanks for the updates, some comments:

- we need to escapeHtml the AM used resources similar to YARN-764

- I think you should put back maxAMResourcePerQueuePerUserPercent.  The main 
reason being its useful to show to users so that they know what limit they 
might be hitting.  Otherwise their job could be waiting to activate and the UI 
doesn't show them any limits they might be hitting.  The 
overAMUsedPercentPerUser should use the Capacity not maxCapacity.  

The per user checks need to taking into account the minimum user percent as 
well as the user limit factor (like it did in previous version of the patch). 
Ideally this is dynamically figured out instead of it being hardcoded like 
before since you could have a user limit % at like 20%, but if there is only 2 
users each user really gets 50%.  That could be complicated based on the timing 
of things. The downside to the dynamic is that it makes it much harder for 
users to understand why there job might not be launched.   It might make more 
sense to keep the formula similar to before where it uses both user limit 
factor and user limit percent for now and file a separate jira to investigate 
making that more dynamic.  That jira could also look into addressing the 
amresource percent applying to the absolute max capacity.  

- can you update the web services documentation 
(./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm)

- we can remove the "Per Queue" from the web ui: Max AM Resource Per Queue 
Percent. I think we can remove the "PerQueue" bit from the REST web services 
too: maxAMResourcePerQueuePercent -> maxAMResourcePercent

- we are keeping the AM used resource percent at the user level.  It might be 
nice to print output this atleast through the REST webservices. It would be 
nice to have in the UI too but I'm a bit afraid its going to get to cluttered 
there. 

- the REST webservices print out of the amUsedResources should be of type 
ResourceInfo so that you get it in separated fields like:

4096
2


The old format that we kept for backwards compatibility was: 
. We don't need that 
format since this is new.


- TestApplicationLimits - remove the old comment -  // set max active to 2
- TestApplicationLimits - why are you multiplying by the userLimitFactor?
+Resource queueResource = Resources.multiply(clusterResources,
+queue.getAbsoluteCapacity() * queue.getUserLimitFactor());

- what are the changes in TestClientTokens.java?

- In the MiniYarnCluster why are we setting the AM resource percent to 100%?

> Capacity Scheduler can hang when submit many jobs concurrently
> --
>
> Key: YARN-276
> URL: https://issues.apache.org/jira/browse/YARN-276
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.0.0, 2.0.1-alpha
>Reporter: nemon lou
>Assignee: nemon lou
>  Labels: incompatible
> Attachments: YARN-276.patch, YARN-276.patch, YARN-276.patch, 
> YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, 
> YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, 
> YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In hadoop2.0.1,When i submit many jobs concurrently at the same time,Capacity 
> scheduler can hang with most resources taken up by AM and don't have enough 
> resources for tasks.And then all applications hang there.
> The cause is that "yarn.scheduler.capacity.maximum-am-resource-percent" not 
> check directly.Instead ,this property only used for maxActiveApplications. 
> And maxActiveApplications is computed by minimumAllocation (not by Am 
> actually used).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-862) ResourceManager and NodeManager versions should match on node registration or error out

2013-07-01 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-862:
---

Target Version/s: 0.23.10  (was: 0.23.9)

> ResourceManager and NodeManager versions should match on node registration or 
> error out
> ---
>
> Key: YARN-862
> URL: https://issues.apache.org/jira/browse/YARN-862
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, resourcemanager
>Affects Versions: 0.23.8
>Reporter: Robert Parker
>Assignee: Robert Parker
> Attachments: YARN-862-b0.23-v1.patch, YARN-862-b0.23-v2.patch
>
>
> For branch-0.23 the versions of the node manager and the resource manager 
> should match to complete a successful registration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-902) "Used Resources" field in Resourcemanager scheduler UI not displaying any values

2013-07-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13701988#comment-13701988
 ] 

Thomas Graves commented on YARN-902:


Are you using the latest branch-2 or the released 2.0.5-alpha?  This might be a 
duplicate of YARN-764.

> "Used Resources" field in Resourcemanager scheduler UI not displaying any 
> values
> 
>
> Key: YARN-902
> URL: https://issues.apache.org/jira/browse/YARN-902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.0.5-alpha
>Reporter: Nishan Shetty
>Priority: Minor
>
> "Used Resources" field in Resourcemanager scheduler UI not displaying any 
> values

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-964) Give a parameter that can set AM retry interval

2013-07-24 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13718338#comment-13718338
 ] 

Thomas Graves commented on YARN-964:


How many NM's did you have?  How would an AM retry interval have helped this?

> Give a parameter that can set  AM retry interval
> 
>
> Key: YARN-964
> URL: https://issues.apache.org/jira/browse/YARN-964
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: qus-jiawei
>
> Our am retry number is 4.
> As one nodemanager 's disk is full,the container of am couldn't allocate on 
> this nodemanager.But RM try this AM on the same NM every 3 secondes.
> i think there shoule be a params to set the  AM retry interval.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-589) Expose a REST API for monitoring the fair scheduler

2013-08-09 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734756#comment-13734756
 ] 

Thomas Graves commented on YARN-589:


sorry for jumping in late on this, do we have another jira for adding 
documentation?

> Expose a REST API for monitoring the fair scheduler
> ---
>
> Key: YARN-589
> URL: https://issues.apache.org/jira/browse/YARN-589
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.0.3-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Fix For: 2.1.1-beta
>
> Attachments: fairscheduler.xml, YARN-589-1.patch, YARN-589-2.patch, 
> YARN-589.patch
>
>
> The fair scheduler should have an HTTP interface that exposes information 
> such as applications per queue, fair shares, demands, current allocations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-337) RM handles killed application tracking URL poorly

2013-08-13 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738968#comment-13738968
 ] 

Thomas Graves commented on YARN-337:


+1 looks good. Thanks Jason!  Feel free to commit it.

> RM handles killed application tracking URL poorly
> -
>
> Key: YARN-337
> URL: https://issues.apache.org/jira/browse/YARN-337
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.0.2-alpha, 0.23.5
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>  Labels: usability
> Attachments: YARN-337.patch
>
>
> When the ResourceManager kills an application, it leaves the proxy URL 
> redirecting to the original tracking URL for the application even though the 
> ApplicationMaster is no longer there to service it.  It should redirect it 
> somewhere more useful, like the RM's web page for the application, where the 
> user can find that the application was killed and links to the AM logs.
> In addition, sometimes the AM during teardown from the kill can attempt to 
> unregister and provide an updated tracking URL, but unfortunately the RM has 
> "forgotten" the AM due to the kill and refuses to process the unregistration. 
>  Instead it logs:
> {noformat}
> 2013-01-09 17:37:49,671 [IPC Server handler 2 on 8030] ERROR
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
> AppAttemptId doesnt exist in cache appattempt_1357575694478_28614_01
> {noformat}
> It should go ahead and process the unregistration to update the tracking URL 
> since the application offered it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   3   4   5   >