[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-3.0/branch-2

2020-01-28 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025285#comment-17025285
 ] 

Thomas Graves commented on YARN-8200:
-

After messing with this a bit more I removed the maximum allocation 
configurations after seeing the documentation didn't have them in the 2.10 
release. so removed this setting:


 yarn.resource-types.yarn.io/gpu.maximum-allocation
 4
 

And it appears now  yarn doesn't allocate me a container unless it has 
fullfilled all of the gpus I requested.   So in this case my nodemanager has 4 
gpus so if I request 5 then it just hangs waiting to fullfill the request. This 
behavior is much better then giving me one that is less then I requested.

 

> Backport resource types/GPU features to branch-3.0/branch-2
> ---
>
> Key: YARN-8200
> URL: https://issues.apache.org/jira/browse/YARN-8200
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Fix For: 2.10.0
>
> Attachments: YARN-8200-branch-2.001.patch, 
> YARN-8200-branch-2.002.patch, YARN-8200-branch-2.003.patch, 
> YARN-8200-branch-3.0.001.patch, 
> counter.scheduler.operation.allocate.csv.defaultResources, 
> counter.scheduler.operation.allocate.csv.gpuResources, synth_sls.json
>
>
> Currently we have a need for GPU scheduling on our YARN clusters to support 
> deep learning workloads. However, our main production clusters are running 
> older versions of branch-2 (2.7 in our case). To prevent supporting too many 
> very different hadoop versions across multiple clusters, we would like to 
> backport the resource types/resource profiles feature to branch-2, as well as 
> the GPU specific support.
>  
> We have done a trial backport of YARN-3926 and some miscellaneous patches in 
> YARN-7069 based on issues we uncovered, and the backport was fairly smooth. 
> We also did a trial backport of most of YARN-6223 (sans docker support).
>  
> Regarding the backports, perhaps we can do the development in a feature 
> branch and then merge to branch-2 when ready.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-3.0/branch-2

2020-01-28 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025156#comment-17025156
 ] 

Thomas Graves commented on YARN-8200:
-

Hey [~jhung] ,

I am trying out the gpu scheduling in hadoop 2.10 and the first thing I noticed 
is it doesn't error properly if you ask for to many GPU's. It seems to happyily 
say it gave them to me, although I think its really giving me the max 
configured.  Is this a known issue already or did configuration change?

I have gpu max configured at 4 and I try to allocate 8, on hadoop 3 I get:

 

Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException):
 Invalid resource request, requested resource type=[yarn.io/gpu] < 0 or greater 
than maximum allowed allocation. Requested resource=, maximum allowed allocation=, please note that maximum allowed allocation is calculated by 
scheduler based on maximum resource of registered NodeManagers, which might be 
less than configured maximum allocation=

 

On hadoop 2.10 I get a container allocated but the logs and UI says it only has 
4 gpus. 

> Backport resource types/GPU features to branch-3.0/branch-2
> ---
>
> Key: YARN-8200
> URL: https://issues.apache.org/jira/browse/YARN-8200
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Fix For: 2.10.0
>
> Attachments: YARN-8200-branch-2.001.patch, 
> YARN-8200-branch-2.002.patch, YARN-8200-branch-2.003.patch, 
> YARN-8200-branch-3.0.001.patch, 
> counter.scheduler.operation.allocate.csv.defaultResources, 
> counter.scheduler.operation.allocate.csv.gpuResources, synth_sls.json
>
>
> Currently we have a need for GPU scheduling on our YARN clusters to support 
> deep learning workloads. However, our main production clusters are running 
> older versions of branch-2 (2.7 in our case). To prevent supporting too many 
> very different hadoop versions across multiple clusters, we would like to 
> backport the resource types/resource profiles feature to branch-2, as well as 
> the GPU specific support.
>  
> We have done a trial backport of YARN-3926 and some miscellaneous patches in 
> YARN-7069 based on issues we uncovered, and the backport was fairly smooth. 
> We also did a trial backport of most of YARN-6223 (sans docker support).
>  
> Regarding the backports, perhaps we can do the development in a feature 
> branch and then merge to branch-2 when ready.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9116) Capacity Scheduler: add the default maximum-allocation-mb and maximum-allocation-vcores for the queues

2019-01-08 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737392#comment-16737392
 ] 

Thomas Graves commented on YARN-9116:
-

Yes so you want to keep the behavior that the cluster level maximum is the 
absolute maximum and no child queues can be larger then that, otherwise it 
breaks backwards compatibility.  

> Capacity Scheduler: add the default maximum-allocation-mb and 
> maximum-allocation-vcores for the queues
> --
>
> Key: YARN-9116
> URL: https://issues.apache.org/jira/browse/YARN-9116
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 2.7.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
> Attachments: YARN-9116.1.patch
>
>
> YARN-1582 adds the support of maximum-allocation-mb configuration per queue 
> which is targeting to support larger container features on dedicated queues 
> (larger maximum-allocation-mb/maximum-allocation-vcores for such queue) . 
> While to achieve larger container configuration, we need to increase the 
> global maximum-allocation-mb/maximum-allocation-vcores (e.g. 120G/256) and 
> then override those configurations with desired values on the queues since 
> queue configuration can't be larger than cluster configuration. There are 
> many queues in the system and if we forget to configure such values when 
> adding a new queue, then such queue gets default 120G/256 which typically is 
> not what we want.  
> We can come up with a queue-default configuration (set to normal queue 
> configuration like 16G/8), so the leaf queues gets such values by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9055) Capacity Scheduler: allow larger queue level maximum-allocation-mb to override the cluster configuration

2018-11-27 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700511#comment-16700511
 ] 

Thomas Graves commented on YARN-9055:
-

It would definitely be a change in behavior which could surprise people with 
existing configurations.   I do think its easier to have this way so you don't 
have to configure all the queues.  I don't remember all the details on why I 
did it this way, I think it was mostly to not break the existing functionality 
of the cluster level max.,  

> Capacity Scheduler: allow larger queue level maximum-allocation-mb to 
> override the cluster configuration
> 
>
> Key: YARN-9055
> URL: https://issues.apache.org/jira/browse/YARN-9055
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.7.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
> Attachments: YARN-9055.1.patch
>
>
> YARN-1582 adds the support of maximum-allocation-mb configuration per queue. 
> That feature gives the flexibility to give different memory requirements for 
> different queues. Such patch adds the limitation that the queue level 
> configuration can't exceed the cluster level default configuration, but I 
> feel it may make more sense to remove such limitation to allow any overrides 
> since 
> # Such configuration is controlled by the admin so it shouldn't get abused; 
> # It's common that typical queues require standard size containers while some 
> job (queues) have requirements for larger containers. With current 
> limitation, we have to set larger configuration on the cluster setting which 
> will cause resource abuse unless we override them on all the queues.
> We can remove such limitation in CapacitySchedulerConfiguration.java so the 
> cluster setting provides the default value and queue setting can override it. 
> {noformat}
>if (maxAllocationMbPerQueue > clusterMax.getMemorySize()
> || maxAllocationVcoresPerQueue > clusterMax.getVirtualCores()) {
>   throw new IllegalArgumentException(
>   "Queue maximum allocation cannot be larger than the cluster setting"
>   + " for queue " + queue
>   + " max allocation per queue: " + result
>   + " cluster setting: " + clusterMax);
> }
> {noformat}
> Let me know if it makes sense.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8991) nodemanager not cleaning blockmgr directories inside appcache

2018-11-12 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683946#comment-16683946
 ] 

Thomas Graves commented on YARN-8991:
-

if its while its running then you should file this with Spark. Its very similar 
to https://issues.apache.org/jira/browse/SPARK-17233.

The spark external shuffle service doesn't supports that at this point.   The 
problem with that is that you may have an Spark Executor running on one host, 
generate some map output data to shuffle and then that executor exits as its 
not needed anymore.  When a reduce starts it just talked to the Yarn 
nodemanager and the external shuffle server to get the map output.   Now there 
is no executor left on the node to cleanup the shuffle output.   Support would 
have to be added for like the driver to tell the spark external shuffle service 
to cleanup.

If you don't use dynamic allocation and the external shuffle service it should 
cleanup properly.

> nodemanager not cleaning blockmgr directories inside appcache 
> --
>
> Key: YARN-8991
> URL: https://issues.apache.org/jira/browse/YARN-8991
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: Hidayat Teonadi
>Priority: Major
> Attachments: yarn-nm-log.txt
>
>
> Hi, I'm running spark on yarn and have enabled the Spark Shuffle Service. I'm 
> noticing that during the lifetime of my spark streaming application, the nm 
> appcache folder is building up with blockmgr directories (filled with 
> shuffle_*.data).
> Looking into the nm logs, it seems like the blockmgr directories is not part 
> of the cleanup process of the application. Eventually disk will fill up and 
> app will crash. I have both 
> {{yarn.nodemanager.localizer.cache.cleanup.interval-ms}} and 
> {{yarn.nodemanager.localizer.cache.target-size-mb}} set, so I don't think its 
> a configuration issue.
> What is stumping me is the executor ID listed by spark during the external 
> shuffle block registration doesn't match the executor ID listed in yarn's nm 
> log. Maybe this executorID disconnect explains why the cleanup is not done ? 
> I'm assuming that blockmgr directories are supposed to be cleaned up ?
>  
> {noformat}
> 2018-11-05 15:01:21,349 INFO 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver: Registered 
> executor AppExecId{appId=application_1541045942679_0193, execId=1299} with 
> ExecutorShuffleInfo{localDirs=[/mnt1/yarn/nm/usercache/auction_importer/appcache/application_1541045942679_0193/blockmgr-b9703ae3-722c-47d1-a374-abf1cc954f42],
>  subDirsPerLocalDir=64, 
> shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
>  {noformat}
>  
> seems similar to https://issues.apache.org/jira/browse/YARN-7070, although 
> I'm not sure if the behavior I'm seeing is spark use related.
> [https://stackoverflow.com/questions/52923386/spark-streaming-job-doesnt-delete-shuffle-files]
>  has a stop gap solution of cleaning up via cron.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8991) nodemanager not cleaning blockmgr directories inside appcache

2018-11-09 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681525#comment-16681525
 ] 

Thomas Graves commented on YARN-8991:
-

[~teonadi] can you clarify here.  Are you saying its not getting cleaned up 
while the Spark application is still running or its not getting cleaned up 
after the spark application finishes?

> nodemanager not cleaning blockmgr directories inside appcache 
> --
>
> Key: YARN-8991
> URL: https://issues.apache.org/jira/browse/YARN-8991
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: Hidayat Teonadi
>Priority: Major
> Attachments: yarn-nm-log.txt
>
>
> Hi, I'm running spark on yarn and have enabled the Spark Shuffle Service. I'm 
> noticing that during the lifetime of my spark streaming application, the nm 
> appcache folder is building up with blockmgr directories (filled with 
> shuffle_*.data).
> Looking into the nm logs, it seems like the blockmgr directories is not part 
> of the cleanup process of the application. Eventually disk will fill up and 
> app will crash. I have both 
> {{yarn.nodemanager.localizer.cache.cleanup.interval-ms}} and 
> {{yarn.nodemanager.localizer.cache.target-size-mb}} set, so I don't think its 
> a configuration issue.
> What is stumping me is the executor ID listed by spark during the external 
> shuffle block registration doesn't match the executor ID listed in yarn's nm 
> log. Maybe this executorID disconnect explains why the cleanup is not done ? 
> I'm assuming that blockmgr directories are supposed to be cleaned up ?
>  
> {noformat}
> 2018-11-05 15:01:21,349 INFO 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver: Registered 
> executor AppExecId{appId=application_1541045942679_0193, execId=1299} with 
> ExecutorShuffleInfo{localDirs=[/mnt1/yarn/nm/usercache/auction_importer/appcache/application_1541045942679_0193/blockmgr-b9703ae3-722c-47d1-a374-abf1cc954f42],
>  subDirsPerLocalDir=64, 
> shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
>  {noformat}
>  
> seems similar to https://issues.apache.org/jira/browse/YARN-7070, although 
> I'm not sure if the behavior I'm seeing is spark use related.
> [https://stackoverflow.com/questions/52923386/spark-streaming-job-doesnt-delete-shuffle-files]
>  has a stop gap solution of cleaning up via cron.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8149) Revisit behavior of Re-Reservation in Capacity Scheduler

2018-04-12 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436366#comment-16436366
 ] 

Thomas Graves commented on YARN-8149:
-

thinking about this a little more, even with the current preemption on, I don't 
think preemption is smart enough to keep starvation from happening.  If 
preemption was smart enough to kill enough containers on a reserved node to 
make it so the big container actually gets scheduled there that might be ok.  
But last time I checked it doesn't do that.

Without that or having another way to prevent starvation I wouldn't want to 
remove this.  I think adding a config would be alright but if anyone finds it 
useful you can't remove and would just be an extra config.  

If we have other ideas to simply or make this better, great we should look at. 
Or if there is a way for us to get stats on if this is useful we could add 
those and run and determine if we should remove.

> Revisit behavior of Re-Reservation in Capacity Scheduler
> 
>
> Key: YARN-8149
> URL: https://issues.apache.org/jira/browse/YARN-8149
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Priority: Critical
>
> Frankly speaking, I'm not sure why we need the re-reservation. The formula is 
> not that easy to understand:
> Inside: 
> {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#shouldAllocOrReserveNewContainer}}
> {code:java}
> starvation = re-reservation / (#reserved-container * 
>  (1 - min(requested-resource / max-alloc, 
>   max-alloc - min-alloc / max-alloc))
> should_allocate = starvation + requiredContainers - reservedContainers > 
> 0{code}
> I think we should be able to remove the starvation computation, just to check 
> requiredContainers > reservedContainers should be enough.
> In a large cluster, we can easily overflow re-reservation to MAX_INT, see 
> YARN-7636. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8149) Revisit behavior of Re-Reservation in Capacity Scheduler

2018-04-12 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436295#comment-16436295
 ] 

Thomas Graves commented on YARN-8149:
-

are you going to do anything with starvation then or allocation a certain % 
more then what is required? I am hesitant to remove this without doing some 
major testing.  I haven't had a chance to look at the latest code to 
investigate.

It might be more fine now that we do continue looking at other nodes after 
reservation where as originally that didn't happen. Is in queue preemption on 
by default?

> Revisit behavior of Re-Reservation in Capacity Scheduler
> 
>
> Key: YARN-8149
> URL: https://issues.apache.org/jira/browse/YARN-8149
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Priority: Critical
>
> Frankly speaking, I'm not sure why we need the re-reservation. The formula is 
> not that easy to understand:
> Inside: 
> {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#shouldAllocOrReserveNewContainer}}
> {code:java}
> starvation = re-reservation / (#reserved-container * 
>  (1 - min(requested-resource / max-alloc, 
>   max-alloc - min-alloc / max-alloc))
> should_allocate = starvation + requiredContainers - reservedContainers > 
> 0{code}
> I think we should be able to remove the starvation computation, just to check 
> requiredContainers > reservedContainers should be enough.
> In a large cluster, we can easily overflow re-reservation to MAX_INT, see 
> YARN-7636. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7935) Expose container's hostname to applications running within the docker container

2018-02-23 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374598#comment-16374598
 ] 

Thomas Graves commented on YARN-7935:
-

thanks for the explanation Mridul. I'm fine with waiting on the spark Jira til 
you know the scope better, I'm currently not doing anything with bridge mode so 
won't be able to help there at this point.

> Expose container's hostname to applications running within the docker 
> container
> ---
>
> Key: YARN-7935
> URL: https://issues.apache.org/jira/browse/YARN-7935
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
> Attachments: YARN-7935.1.patch, YARN-7935.2.patch
>
>
> Some applications have a need to bind to the container's hostname (like 
> Spark) which is different from the NodeManager's hostname(NM_HOST which is 
> available as an env during container launch) when launched through Docker 
> runtime. The container's hostname can be exposed to applications via an env 
> CONTAINER_HOSTNAME. Another potential candidate is the container's IP but 
> this can be addressed in a separate jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7935) Expose container's hostname to applications running within the docker container

2018-02-22 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373039#comment-16373039
 ] 

Thomas Graves commented on YARN-7935:
-

[~mridulm80] what is the spark Jira for this?  If this goes in it will still 
have to grab this from env to pass in to the executorRunnable.

> Expose container's hostname to applications running within the docker 
> container
> ---
>
> Key: YARN-7935
> URL: https://issues.apache.org/jira/browse/YARN-7935
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
> Attachments: YARN-7935.1.patch, YARN-7935.2.patch
>
>
> Some applications have a need to bind to the container's hostname (like 
> Spark) which is different from the NodeManager's hostname(NM_HOST which is 
> available as an env during container launch) when launched through Docker 
> runtime. The container's hostname can be exposed to applications via an env 
> CONTAINER_HOSTNAME. Another potential candidate is the container's IP but 
> this can be addressed in a separate jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7204) Localizer errors on archive without any files

2017-09-15 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-7204:

Description: 
If a user sends an archive without any files in it (only directories), yarn 
fails to localize it with the error below.  I ran into this specifically 
running spark job but looks generic to localizer.


 Application application_1505252418630_25423 failed 3 times due to AM Container 
for appattempt_1505252418630_25423_03 exited with exitCode: -1000
Failing this attempt.Diagnostics: No such file or directory
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767)
at 
org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218)
at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945)
at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
For more detailed output, check the application tracking page: 
https://rm.com:50708/applicationhistory/app/application_1505252418630_25423 
Then click on links to logs of each attempt.
. Failing the application. 

  was:
If a user sends an archive without any files in it (only directories), yarn 
fails to localize it with the error below.  I ran into this specifically 
running spark job but looks generic to localizer.


 Application application_1505252418630_25423 failed 3 times due to AM Container 
for appattempt_1505252418630_25423_03 exited with exitCode: -1000
Failing this attempt.Diagnostics: No such file or directory
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767)
at 
org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218)
at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945)
at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)

[jira] [Updated] (YARN-7204) Localizer errors on archive without any files

2017-09-15 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-7204:

Description: 
If a user sends an archive without any files in it (only directories), yarn 
fails to localize it with the error below.  I ran into this specifically 
running spark job but looks generic to localizer.


 Application application_1505252418630_25423 failed 3 times due to AM Container 
for appattempt_1505252418630_25423_03 exited with exitCode: -1000
Failing this attempt.Diagnostics: No such file or directory
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767)
at 
org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218)
at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945)
at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
For more detailed output, check the application tracking page: 
https://rm.com:50508/applicationhistory/app/application_1505252418630_25423 
Then click on links to logs of each attempt.
. Failing the application. 

  was:
If a user sends an archive without any files in it (only directories), yarn 
fails to localize it with the error below.  I ran into this specifically 
running spark job but looks generic to localizer.


 Application application_1505252418630_25423 failed 3 times due to AM Container 
for appattempt_1505252418630_25423_03 exited with exitCode: -1000
Failing this attempt.Diagnostics: No such file or directory
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767)
at 
org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218)
at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945)
at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)

[jira] [Created] (YARN-7204) Localizer errors on archive without any files

2017-09-15 Thread Thomas Graves (JIRA)
Thomas Graves created YARN-7204:
---

 Summary: Localizer errors on archive without any files
 Key: YARN-7204
 URL: https://issues.apache.org/jira/browse/YARN-7204
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.8.1
Reporter: Thomas Graves


If a user sends an archive without any files in it (only directories), yarn 
fails to localize it with the error below.  I ran into this specifically 
running spark job but looks generic to localizer.


 Application application_1505252418630_25423 failed 3 times due to AM Container 
for appattempt_1505252418630_25423_03 exited with exitCode: -1000
Failing this attempt.Diagnostics: No such file or directory
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767)
at 
org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218)
at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009)
at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945)
at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
For more detailed output, check the application tracking page: 
https://axonitered-jt1.red.ygrid.yahoo.com:50508/applicationhistory/app/application_1505252418630_25423
 Then click on links to logs of each attempt.
. Failing the application. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5010) maxActiveApplications and maxActiveApplicationsPerUser are missing from REST API

2016-04-28 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262914#comment-15262914
 ] 

Thomas Graves commented on YARN-5010:
-

we shouldn't just remove them as its an API compatibility issue.  I would say 
they should be added back and definition updated or we should rev rest api 
version.

> maxActiveApplications and maxActiveApplicationsPerUser are missing from REST 
> API
> 
>
> Key: YARN-5010
> URL: https://issues.apache.org/jira/browse/YARN-5010
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: Jason Lowe
>
> The RM used to report maxActiveApplications and maxActiveApplicationsPerUser 
> in the REST API for a queue, but these are missing in 2.7.0.  It appears 
> YARN-2637 replaced them with aMResourceLimit and userAMResourceLimit, 
> respectively, which broke some internal tools that were expecting the max app 
> fields to still be there.  We should at least update the REST docs to reflect 
> that change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4641) CapacityScheduler Active Users Info table should be sortable

2016-01-26 Thread Thomas Graves (JIRA)
Thomas Graves created YARN-4641:
---

 Summary: CapacityScheduler Active Users Info table should be 
sortable
 Key: YARN-4641
 URL: https://issues.apache.org/jira/browse/YARN-4641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler
Affects Versions: 2.7.1
Reporter: Thomas Graves


The Scheduler page when using the Capacity scheduler allows you to see all the 
Active Users Info.  If you have lots of users this is a big table and if you 
want to be able to see who is using the most it would be nice to have this 
sortable or show the %used like it used to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4610) Reservations continue looking for one app causes other apps to starve

2016-01-21 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110778#comment-15110778
 ] 

Thomas Graves commented on YARN-4610:
-

+1 for branch 2.7.  After investigating this some more the original patch of 
setting it to none() works. The reason is that the parents limit is passed and 
it would be taken into account int he leaf calculation.  I think the latter 
patch is safer but either is fine with me.

The master patch I'm not sure about how its taking the max capacity into 
account so I'll have to look at that more, but the unit tests are passing and 
that would be a separate issue from this fix.  +1 on that patch as well.

> Reservations continue looking for one app causes other apps to starve
> -
>
> Key: YARN-4610
> URL: https://issues.apache.org/jira/browse/YARN-4610
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Blocker
> Attachments: YARN-4610-branch-2.7.002.patch, YARN-4610.001.patch, 
> YARN-4610.branch-2.7.001.patch
>
>
> CapacityScheduler's LeafQueue has "reservations continue looking" logic that 
> allows an application to unreserve elsewhere to fulfil a container request on 
> a node that has available space.  However in 2.7 that logic seems to break 
> allocations for subsequent apps in the queue.  Once a user hits its user 
> limit, subsequent apps in the queue for other users receive containers at a 
> significantly reduced rate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4610) Reservations continue looking for one app causes other apps to starve

2016-01-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15108697#comment-15108697
 ] 

Thomas Graves commented on YARN-4610:
-

+1.  Thanks for fixing this. 

> Reservations continue looking for one app causes other apps to starve
> -
>
> Key: YARN-4610
> URL: https://issues.apache.org/jira/browse/YARN-4610
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Blocker
> Attachments: YARN-4610.001.patch
>
>
> CapacityScheduler's LeafQueue has "reservations continue looking" logic that 
> allows an application to unreserve elsewhere to fulfil a container request on 
> a node that has available space.  However in 2.7 that logic seems to break 
> allocations for subsequent apps in the queue.  Once a user hits its user 
> limit, subsequent apps in the queue for other users receive containers at a 
> significantly reduced rate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4610) Reservations continue looking for one app causes other apps to starve

2016-01-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15109330#comment-15109330
 ] 

Thomas Graves commented on YARN-4610:
-

Ok thanks for investigating.  +1 from me feel free to commit.

> Reservations continue looking for one app causes other apps to starve
> -
>
> Key: YARN-4610
> URL: https://issues.apache.org/jira/browse/YARN-4610
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Blocker
> Attachments: YARN-4610.001.patch
>
>
> CapacityScheduler's LeafQueue has "reservations continue looking" logic that 
> allows an application to unreserve elsewhere to fulfil a container request on 
> a node that has available space.  However in 2.7 that logic seems to break 
> allocations for subsequent apps in the queue.  Once a user hits its user 
> limit, subsequent apps in the queue for other users receive containers at a 
> significantly reduced rate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4610) Reservations continue looking for one app causes other apps to starve

2016-01-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15109516#comment-15109516
 ] 

Thomas Graves commented on YARN-4610:
-

Sorry after looking some more I think there might be an issue with this for 
parent queue max capacities, looking some more.

> Reservations continue looking for one app causes other apps to starve
> -
>
> Key: YARN-4610
> URL: https://issues.apache.org/jira/browse/YARN-4610
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Blocker
> Attachments: YARN-4610.001.patch, YARN-4610.branch-2.7.001.patch
>
>
> CapacityScheduler's LeafQueue has "reservations continue looking" logic that 
> allows an application to unreserve elsewhere to fulfil a container request on 
> a node that has available space.  However in 2.7 that logic seems to break 
> allocations for subsequent apps in the queue.  Once a user hits its user 
> limit, subsequent apps in the queue for other users receive containers at a 
> significantly reduced rate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4045) Negative avaialbleMB is being reported for root queue.

2015-08-11 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14682115#comment-14682115
 ] 

Thomas Graves commented on YARN-4045:
-

I remember seeing that this was fixed in branch-2 by some of the capacity 
scheduler work for labels.

I thought this might be fixed by 
https://issues.apache.org/jira/browse/YARN-3243 but that is included.  

This might be fixed as part of https://issues.apache.org/jira/browse/YARN-3361 
which is probably to big to backport totally.

[~leftnoteasy]  Do you remember this issue?

Note that it also shows up in capacity scheduler UI as root queue going over 
100%.  I remember when I was testing YARN-3434 it wasn't occurring for me on 
branch-2 (2.8) and I thought it was one of the above jiras that fixed.

 Negative avaialbleMB is being reported for root queue.
 --

 Key: YARN-4045
 URL: https://issues.apache.org/jira/browse/YARN-4045
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.1
Reporter: Rushabh S Shah

 We recently deployed 2.7 in one of our cluster.
 We are seeing negative availableMB being reported for queue=root.
 This is from the jmx output:
 {noformat}
 clusterMetrics
 ...
 availableMB-163328/availableMB
 ...
 /clusterMetrics
 {noformat}
 The following is the RM log:
 {noformat}
 2015-08-10 14:42:28,280 [ResourceManager Event Processor] INFO 
 capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
 absoluteUsedCapacity=1.0029854 used=memory:5332480, vCores:6202 
 cluster=memory:5316608, vCores:28320
 2015-08-10 14:42:28,404 [ResourceManager Event Processor] INFO 
 capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
 absoluteUsedCapacity=1.0032743 used=memory:5334016, vCores:6212 
 cluster=memory:5316608, vCores:28320
 2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO 
 capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
 absoluteUsedCapacity=1.0029854 used=memory:5332480, vCores:6202 
 cluster=memory:5316608, vCores:28320
 2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO 
 capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
 absoluteUsedCapacity=1.0032743 used=memory:5334016, vCores:6212 
 cluster=memory:5316608, vCores:28320
 2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO 
 capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
 absoluteUsedCapacity=1.0029854 used=memory:5332480, vCores:6202 
 cluster=memory:5316608, vCores:28320
 2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO 
 capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
 absoluteUsedCapacity=1.0032743 used=memory:5334016, vCores:6212 
 cluster=memory:5316608, vCores:28320
 2015-08-10 14:42:35,548 [ResourceManager Event Processor] INFO 
 capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
 absoluteUsedCapacity=1.0029854 used=memory:5332480, vCores:6202 
 cluster=memory:5316608, vCores:28320
 2015-08-10 14:42:35,549 [ResourceManager Event Processor] INFO 
 capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
 absoluteUsedCapacity=1.0032743 used=memory:5334016, vCores:6212 
 cluster=memory:5316608, vCores:28320
 2015-08-10 14:42:39,088 [ResourceManager Event Processor] INFO 
 capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
 absoluteUsedCapacity=1.0029854 used=memory:5332480, vCores:6202 
 cluster=memory:5316608, vCores:28320
 2015-08-10 14:42:39,089 [ResourceManager Event Processor] INFO 
 capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
 absoluteUsedCapacity=1.0032743 used=memory:5334016, vCores:6212 
 cluster=memory:5316608, vCores:28320
 2015-08-10 14:42:39,338 [ResourceManager Event Processor] INFO 
 capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
 absoluteUsedCapacity=1.0029854 used=memory:5332480, vCores:6202 
 cluster=memory:5316608, vCores:28320
 2015-08-10 14:42:39,339 [ResourceManager Event Processor] INFO 
 capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
 absoluteUsedCapacity=1.0032743 used=memory:5334016, vCores:6212 
 cluster=memory:5316608, vCores:28320
 2015-08-10 14:42:39,757 [ResourceManager Event Processor] INFO 
 capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 
 absoluteUsedCapacity=1.0029854 used=memory:5332480, vCores:6202 
 cluster=memory:5316608, vCores:28320
 2015-08-10 14:42:39,758 [ResourceManager Event Processor] INFO 
 capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 
 absoluteUsedCapacity=1.0032743 used=memory:5334016, vCores:6212 
 cluster=memory:5316608, vCores:28320
 2015-08-10 14:42:43,056 [ResourceManager Event Processor] 

[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-05-11 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538252#comment-14538252
 ] 

Thomas Graves commented on YARN-3434:
-

whats your question exactly?  For branch patches jenkins has never been hooked 
up. We generally download the patch, build and possibly the run the tests that 
apply and commit.

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Fix For: 2.8.0

 Attachments: YARN-3434-branch2.7.patch, YARN-3434.patch, 
 YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, 
 YARN-3434.patch, YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3600) AM container link is broken (on a killed application, at least)

2015-05-08 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3600:

Labels:   (was: BB2015-05-RFC)

 AM container link is broken (on a killed application, at least)
 ---

 Key: YARN-3600
 URL: https://issues.apache.org/jira/browse/YARN-3600
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.8.0
Reporter: Sergey Shelukhin
Assignee: Naganarasimha G R
 Attachments: YARN-3600.20150508-1.patch


 Running some fairly recent (couple weeks ago) version of 2.8.0-SNAPSHOT. 
 I have an application that ran fine for a while and then I yarn kill-ed it. 
 Now when I go to the only app attempt URL (like so: http://(snip RM host 
 name):8088/cluster/appattempt/appattempt_1429683757595_0795_01)
 I see:
 AM Container: container_1429683757595_0795_01_01
 Node: N/A 
 and the container link goes to {noformat}http://(snip RM host 
 name):8088/cluster/N/A
 {noformat}
 which obviously doesn't work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3600) AM container link is broken (on a killed application, at least)

2015-05-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534621#comment-14534621
 ] 

Thomas Graves commented on YARN-3600:
-

reviewing and kicking jenkins.

 AM container link is broken (on a killed application, at least)
 ---

 Key: YARN-3600
 URL: https://issues.apache.org/jira/browse/YARN-3600
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.8.0
Reporter: Sergey Shelukhin
Assignee: Naganarasimha G R
 Attachments: YARN-3600.20150508-1.patch


 Running some fairly recent (couple weeks ago) version of 2.8.0-SNAPSHOT. 
 I have an application that ran fine for a while and then I yarn kill-ed it. 
 Now when I go to the only app attempt URL (like so: http://(snip RM host 
 name):8088/cluster/appattempt/appattempt_1429683757595_0795_01)
 I see:
 AM Container: container_1429683757595_0795_01_01
 Node: N/A 
 and the container link goes to {noformat}http://(snip RM host 
 name):8088/cluster/N/A
 {noformat}
 which obviously doesn't work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3603) Application Attempts page confusing

2015-05-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535013#comment-14535013
 ] 

Thomas Graves commented on YARN-3603:
-

go for it.  Thanks!

 Application Attempts page confusing
 ---

 Key: YARN-3603
 URL: https://issues.apache.org/jira/browse/YARN-3603
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.8.0
Reporter: Thomas Graves
Assignee: Sunil G

 The application attempts page 
 (http://RM:8088/cluster/appattempt/appattempt_1431101480046_0003_01)
 is a bit confusing on what is going on.  I think the table of containers 
 there is for only Running containers and when the app is completed or killed 
 its empty.  The table should have a label on it stating so.  
 Also the AM Container field is a link when running but not when its killed. 
  That might be confusing.
 There is no link to the logs in this page but there is in the app attempt 
 table when looking at http://
 rm:8088/cluster/app/application_1431101480046_0003



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3603) Application Attempts page confusing

2015-05-08 Thread Thomas Graves (JIRA)
Thomas Graves created YARN-3603:
---

 Summary: Application Attempts page confusing
 Key: YARN-3603
 URL: https://issues.apache.org/jira/browse/YARN-3603
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.8.0
Reporter: Thomas Graves


The application attempts page 
(http://RM:8088/cluster/appattempt/appattempt_1431101480046_0003_01)

is a bit confusing on what is going on.  I think the table of containers there 
is for only Running containers and when the app is completed or killed its 
empty.  The table should have a label on it stating so.  

Also the AM Container field is a link when running but not when its killed.  
That might be confusing.

There is no link to the logs in this page but there is in the app attempt table 
when looking at http://
rm:8088/cluster/app/application_1431101480046_0003



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-20) More information for yarn.resourcemanager.webapp.address in yarn-default.xml

2015-05-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-20?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534967#comment-14534967
 ] 

Thomas Graves commented on YARN-20:
---

+1.  Thanks!

 More information for yarn.resourcemanager.webapp.address in yarn-default.xml
 --

 Key: YARN-20
 URL: https://issues.apache.org/jira/browse/YARN-20
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation, resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Nemon Lou
Assignee: Bartosz Ługowski
Priority: Trivial
  Labels: newbie
 Attachments: YARN-20.1.patch, YARN-20.2.patch, YARN-20.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

   The parameter  yarn.resourcemanager.webapp.address in yarn-default.xml  is 
 in host:port format,which is noted in the cluster set up guide 
 (http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html).
   When i read though the code,i find host format is also supported. In 
 host format,the port will be random.
   So we may add more documentation in  yarn-default.xml for easy understood.
   I will submit a patch if it's helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-20) More information for yarn.resourcemanager.webapp.address in yarn-default.xml

2015-05-08 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-20:
--
Labels: newbie  (was: BB2015-05-RFC newbie)

 More information for yarn.resourcemanager.webapp.address in yarn-default.xml
 --

 Key: YARN-20
 URL: https://issues.apache.org/jira/browse/YARN-20
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation, resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Nemon Lou
Assignee: Bartosz Ługowski
Priority: Trivial
  Labels: newbie
 Attachments: YARN-20.1.patch, YARN-20.2.patch, YARN-20.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

   The parameter  yarn.resourcemanager.webapp.address in yarn-default.xml  is 
 in host:port format,which is noted in the cluster set up guide 
 (http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html).
   When i read though the code,i find host format is also supported. In 
 host format,the port will be random.
   So we may add more documentation in  yarn-default.xml for easy understood.
   I will submit a patch if it's helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3600) AM container link is broken (on a killed application, at least)

2015-05-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534825#comment-14534825
 ] 

Thomas Graves commented on YARN-3600:
-

So the change does fix the broken link issue, but it seems to me other things 
are broken with this page.  Obviously if it ran for a while it got an AM and 
there fore should have a valid container.  But I guess that link only works if 
its actually running?

The container table below that also confused me a bit.  I thought at first it 
was list of AM containers, but after playing with it its really list of running 
containers.  I think we should add heading for that.  I filed separate jira for 
those things.

Anyway, +1.  Thanks!



 AM container link is broken (on a killed application, at least)
 ---

 Key: YARN-3600
 URL: https://issues.apache.org/jira/browse/YARN-3600
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.8.0
Reporter: Sergey Shelukhin
Assignee: Naganarasimha G R
 Attachments: YARN-3600.20150508-1.patch


 Running some fairly recent (couple weeks ago) version of 2.8.0-SNAPSHOT. 
 I have an application that ran fine for a while and then I yarn kill-ed it. 
 Now when I go to the only app attempt URL (like so: http://(snip RM host 
 name):8088/cluster/appattempt/appattempt_1429683757595_0795_01)
 I see:
 AM Container: container_1429683757595_0795_01_01
 Node: N/A 
 and the container link goes to {noformat}http://(snip RM host 
 name):8088/cluster/N/A
 {noformat}
 which obviously doesn't work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-05-07 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3434:

Attachment: YARN-3434-branch2.7.patch

Attaching patch for branch2.7.

[~leftnoteasy] could you take a look when you have a chance?

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Fix For: 2.8.0

 Attachments: YARN-3434-branch2.7.patch, YARN-3434.patch, 
 YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, 
 YARN-3434.patch, YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1631) Container allocation issue in Leafqueue assignContainers()

2015-05-01 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524042#comment-14524042
 ] 

Thomas Graves commented on YARN-1631:
-

we need to be careful with this.  You could end up starving out the first 
application.  It definitely changes current semantics.

What version of hadoop are you seeing this issue? With my patch for 
reservations continue looking it should actually look at node 2 and take that 
one and unreserve node 1.  There is the logic for the needsContainer that might 
be affecting this that I would have to look at more.

 Container allocation issue in Leafqueue assignContainers()
 --

 Key: YARN-1631
 URL: https://issues.apache.org/jira/browse/YARN-1631
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: SuSe 11 Linux 
Reporter: Sunil G
Assignee: Sunil G
 Attachments: Yarn-1631.1.patch, Yarn-1631.2.patch


 Application1 has a demand of 8GB[Map Task Size as 8GB] which is more than 
 Node_1 can handle.
 Node_1 has a size of 8GB and 2GB is used by Application1's AM.
 Hence reservation happened for remaining 6GB in Node_1 by Application1.
 A new job is submitted with 2GB AM size and 2GB task size with only 2 Maps to 
 run.
 Node_2 also has 8GB capability.
 But Application2's AM cannot be launched in Node_2. And Application2 waits 
 longer as only 2 Nodes are available in cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.

2015-05-01 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523187#comment-14523187
 ] 

Thomas Graves commented on YARN-3243:
-

thanks [~leftnoteasy] I'll attempt to merge YARN-3434. If its not clean I'll 
put up a patch for it.

 CapacityScheduler should pass headroom from parent to children to make sure 
 ParentQueue obey its capacity limits.
 -

 Key: YARN-3243
 URL: https://issues.apache.org/jira/browse/YARN-3243
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Fix For: 2.8.0

 Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, 
 YARN-3243.4.patch, YARN-3243.5.patch


 Now CapacityScheduler has some issues to make sure ParentQueue always obeys 
 its capacity limits, for example:
 1) When allocating container of a parent queue, it will only check 
 parentQueue.usage  parentQueue.max. If leaf queue allocated a container.size 
  (parentQueue.max - parentQueue.usage), parent queue can excess its max 
 resource limit, as following example:
 {code}
 A  (usage=54, max=55)
/ \
   A1 A2 (usage=1, max=55)
 (usage=53, max=53)
 {code}
 Queue-A2 is able to allocate container since its usage  max, but if we do 
 that, A's usage can excess A.max.
 2) When doing continous reservation check, parent queue will only tell 
 children you need unreserve *some* resource, so that I will less than my 
 maximum resource, but it will not tell how many resource need to be 
 unreserved. This may lead to parent queue excesses configured maximum 
 capacity as well.
 With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, 
 *here is my proposal*:
 - ParentQueue will set its children's ResourceUsage.headroom, which means, 
 *maximum resource its children can allocate*.
 - ParentQueue will set its children's headroom to be (saying parent's name is 
 qA): min(qA.headroom, qA.max - qA.used). This will make sure qA's 
 ancestors' capacity will be enforced as well (qA.headroom is set by qA's 
 parent).
 - {{needToUnReserve}} is not necessary, instead, children can get how much 
 resource need to be unreserved to keep its parent's resource limit.
 - More over, with this, YARN-3026 will make a clear boundary between 
 LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.

2015-05-01 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3243:

Fix Version/s: 2.7.1

 CapacityScheduler should pass headroom from parent to children to make sure 
 ParentQueue obey its capacity limits.
 -

 Key: YARN-3243
 URL: https://issues.apache.org/jira/browse/YARN-3243
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Fix For: 2.8.0, 2.7.1

 Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, 
 YARN-3243.4.patch, YARN-3243.5.patch


 Now CapacityScheduler has some issues to make sure ParentQueue always obeys 
 its capacity limits, for example:
 1) When allocating container of a parent queue, it will only check 
 parentQueue.usage  parentQueue.max. If leaf queue allocated a container.size 
  (parentQueue.max - parentQueue.usage), parent queue can excess its max 
 resource limit, as following example:
 {code}
 A  (usage=54, max=55)
/ \
   A1 A2 (usage=1, max=55)
 (usage=53, max=53)
 {code}
 Queue-A2 is able to allocate container since its usage  max, but if we do 
 that, A's usage can excess A.max.
 2) When doing continous reservation check, parent queue will only tell 
 children you need unreserve *some* resource, so that I will less than my 
 maximum resource, but it will not tell how many resource need to be 
 unreserved. This may lead to parent queue excesses configured maximum 
 capacity as well.
 With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, 
 *here is my proposal*:
 - ParentQueue will set its children's ResourceUsage.headroom, which means, 
 *maximum resource its children can allocate*.
 - ParentQueue will set its children's headroom to be (saying parent's name is 
 qA): min(qA.headroom, qA.max - qA.used). This will make sure qA's 
 ancestors' capacity will be enforced as well (qA.headroom is set by qA's 
 parent).
 - {{needToUnReserve}} is not necessary, instead, children can get how much 
 resource need to be unreserved to keep its parent's resource limit.
 - More over, with this, YARN-3026 will make a clear boundary between 
 LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.

2015-04-30 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14521580#comment-14521580
 ] 

Thomas Graves commented on YARN-3243:
-

[~leftnoteasy] Can we pull this back into the branch-2.7?  

 CapacityScheduler should pass headroom from parent to children to make sure 
 ParentQueue obey its capacity limits.
 -

 Key: YARN-3243
 URL: https://issues.apache.org/jira/browse/YARN-3243
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Fix For: 2.8.0

 Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, 
 YARN-3243.4.patch, YARN-3243.5.patch


 Now CapacityScheduler has some issues to make sure ParentQueue always obeys 
 its capacity limits, for example:
 1) When allocating container of a parent queue, it will only check 
 parentQueue.usage  parentQueue.max. If leaf queue allocated a container.size 
  (parentQueue.max - parentQueue.usage), parent queue can excess its max 
 resource limit, as following example:
 {code}
 A  (usage=54, max=55)
/ \
   A1 A2 (usage=1, max=55)
 (usage=53, max=53)
 {code}
 Queue-A2 is able to allocate container since its usage  max, but if we do 
 that, A's usage can excess A.max.
 2) When doing continous reservation check, parent queue will only tell 
 children you need unreserve *some* resource, so that I will less than my 
 maximum resource, but it will not tell how many resource need to be 
 unreserved. This may lead to parent queue excesses configured maximum 
 capacity as well.
 With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, 
 *here is my proposal*:
 - ParentQueue will set its children's ResourceUsage.headroom, which means, 
 *maximum resource its children can allocate*.
 - ParentQueue will set its children's headroom to be (saying parent's name is 
 qA): min(qA.headroom, qA.max - qA.used). This will make sure qA's 
 ancestors' capacity will be enforced as well (qA.headroom is set by qA's 
 parent).
 - {{needToUnReserve}} is not necessary, instead, children can get how much 
 resource need to be unreserved to keep its parent's resource limit.
 - More over, with this, YARN-3026 will make a clear boundary between 
 LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.

2015-04-30 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522021#comment-14522021
 ] 

Thomas Graves commented on YARN-3243:
-

I was wanting to pull YARN-3434 back into 2.7.  It kind of depends on this one. 
Atleast I think it would merge cleanly if this one was there. 
This is also fixing a bug which I would like to see fixed in the 2.7 line if we 
are going to use it.  Its not a blocker since it exists in our 2.6 but it would 
be nice to have.  If we decide its to big then I'll just port YARN-3434 back 
without it   

 CapacityScheduler should pass headroom from parent to children to make sure 
 ParentQueue obey its capacity limits.
 -

 Key: YARN-3243
 URL: https://issues.apache.org/jira/browse/YARN-3243
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Fix For: 2.8.0

 Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, 
 YARN-3243.4.patch, YARN-3243.5.patch


 Now CapacityScheduler has some issues to make sure ParentQueue always obeys 
 its capacity limits, for example:
 1) When allocating container of a parent queue, it will only check 
 parentQueue.usage  parentQueue.max. If leaf queue allocated a container.size 
  (parentQueue.max - parentQueue.usage), parent queue can excess its max 
 resource limit, as following example:
 {code}
 A  (usage=54, max=55)
/ \
   A1 A2 (usage=1, max=55)
 (usage=53, max=53)
 {code}
 Queue-A2 is able to allocate container since its usage  max, but if we do 
 that, A's usage can excess A.max.
 2) When doing continous reservation check, parent queue will only tell 
 children you need unreserve *some* resource, so that I will less than my 
 maximum resource, but it will not tell how many resource need to be 
 unreserved. This may lead to parent queue excesses configured maximum 
 capacity as well.
 With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, 
 *here is my proposal*:
 - ParentQueue will set its children's ResourceUsage.headroom, which means, 
 *maximum resource its children can allocate*.
 - ParentQueue will set its children's headroom to be (saying parent's name is 
 qA): min(qA.headroom, qA.max - qA.used). This will make sure qA's 
 ancestors' capacity will be enforced as well (qA.headroom is set by qA's 
 parent).
 - {{needToUnReserve}} is not necessary, instead, children can get how much 
 resource need to be unreserved to keep its parent's resource limit.
 - More over, with this, YARN-3026 will make a clear boundary between 
 LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.

2015-04-30 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522066#comment-14522066
 ] 

Thomas Graves commented on YARN-3243:
-

It might to merge completely clean but it wouldn't require it for 
functionality.   It would be nice to have this in 2.7 either way though.

 CapacityScheduler should pass headroom from parent to children to make sure 
 ParentQueue obey its capacity limits.
 -

 Key: YARN-3243
 URL: https://issues.apache.org/jira/browse/YARN-3243
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Fix For: 2.8.0

 Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, 
 YARN-3243.4.patch, YARN-3243.5.patch


 Now CapacityScheduler has some issues to make sure ParentQueue always obeys 
 its capacity limits, for example:
 1) When allocating container of a parent queue, it will only check 
 parentQueue.usage  parentQueue.max. If leaf queue allocated a container.size 
  (parentQueue.max - parentQueue.usage), parent queue can excess its max 
 resource limit, as following example:
 {code}
 A  (usage=54, max=55)
/ \
   A1 A2 (usage=1, max=55)
 (usage=53, max=53)
 {code}
 Queue-A2 is able to allocate container since its usage  max, but if we do 
 that, A's usage can excess A.max.
 2) When doing continous reservation check, parent queue will only tell 
 children you need unreserve *some* resource, so that I will less than my 
 maximum resource, but it will not tell how many resource need to be 
 unreserved. This may lead to parent queue excesses configured maximum 
 capacity as well.
 With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, 
 *here is my proposal*:
 - ParentQueue will set its children's ResourceUsage.headroom, which means, 
 *maximum resource its children can allocate*.
 - ParentQueue will set its children's headroom to be (saying parent's name is 
 qA): min(qA.headroom, qA.max - qA.used). This will make sure qA's 
 ancestors' capacity will be enforced as well (qA.headroom is set by qA's 
 parent).
 - {{needToUnReserve}} is not necessary, instead, children can get how much 
 resource need to be unreserved to keep its parent's resource limit.
 - More over, with this, YARN-3026 will make a clear boundary between 
 LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.

2015-04-30 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522066#comment-14522066
 ] 

Thomas Graves edited comment on YARN-3243 at 4/30/15 7:02 PM:
--

It might to merge completely clean but it wouldn't require it for 
functionality.   It would be nice to have this in 2.7 either way though.

I'll try it out later and see.


was (Author: tgraves):
It might to merge completely clean but it wouldn't require it for 
functionality.   It would be nice to have this in 2.7 either way though.

 CapacityScheduler should pass headroom from parent to children to make sure 
 ParentQueue obey its capacity limits.
 -

 Key: YARN-3243
 URL: https://issues.apache.org/jira/browse/YARN-3243
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Fix For: 2.8.0

 Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, 
 YARN-3243.4.patch, YARN-3243.5.patch


 Now CapacityScheduler has some issues to make sure ParentQueue always obeys 
 its capacity limits, for example:
 1) When allocating container of a parent queue, it will only check 
 parentQueue.usage  parentQueue.max. If leaf queue allocated a container.size 
  (parentQueue.max - parentQueue.usage), parent queue can excess its max 
 resource limit, as following example:
 {code}
 A  (usage=54, max=55)
/ \
   A1 A2 (usage=1, max=55)
 (usage=53, max=53)
 {code}
 Queue-A2 is able to allocate container since its usage  max, but if we do 
 that, A's usage can excess A.max.
 2) When doing continous reservation check, parent queue will only tell 
 children you need unreserve *some* resource, so that I will less than my 
 maximum resource, but it will not tell how many resource need to be 
 unreserved. This may lead to parent queue excesses configured maximum 
 capacity as well.
 With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, 
 *here is my proposal*:
 - ParentQueue will set its children's ResourceUsage.headroom, which means, 
 *maximum resource its children can allocate*.
 - ParentQueue will set its children's headroom to be (saying parent's name is 
 qA): min(qA.headroom, qA.max - qA.used). This will make sure qA's 
 ancestors' capacity will be enforced as well (qA.headroom is set by qA's 
 parent).
 - {{needToUnReserve}} is not necessary, instead, children can get how much 
 resource need to be unreserved to keep its parent's resource limit.
 - More over, with this, YARN-3026 will make a clear boundary between 
 LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only

2015-04-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520252#comment-14520252
 ] 

Thomas Graves commented on YARN-3517:
-

changes look good, +1.   thanks [~vvasudev]  

 RM web ui for dumping scheduler logs should be for admins only
 --

 Key: YARN-3517
 URL: https://issues.apache.org/jira/browse/YARN-3517
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, security
Reporter: Varun Vasudev
Assignee: Thomas Graves
Priority: Blocker
  Labels: security
 Attachments: YARN-3517.001.patch, YARN-3517.002.patch, 
 YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch, 
 YARN-3517.006.patch


 YARN-3294 allows users to dump scheduler logs from the web UI. This should be 
 for admins only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only

2015-04-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520355#comment-14520355
 ] 

Thomas Graves commented on YARN-3517:
-

thanks [~vinodkv] I missed that.

 RM web ui for dumping scheduler logs should be for admins only
 --

 Key: YARN-3517
 URL: https://issues.apache.org/jira/browse/YARN-3517
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, security
Reporter: Varun Vasudev
Assignee: Varun Vasudev
Priority: Blocker
  Labels: security
 Fix For: 2.8.0

 Attachments: YARN-3517.001.patch, YARN-3517.002.patch, 
 YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch, 
 YARN-3517.006.patch


 YARN-3294 allows users to dump scheduler logs from the web UI. This should be 
 for admins only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only

2015-04-28 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518054#comment-14518054
 ] 

Thomas Graves commented on YARN-3517:
-

in RMWebServices.java we don't need the isSecurityEnabled check.  Just remove 
the entire check.  My reasoning is that logLevel app does not do those checks, 
it simply makes sure you are an admin.

+if (UserGroupInformation.isSecurityEnabled()  callerUGI == null) {
+  String msg = Unable to obtain user name, user not authenticated;
+  throw new AuthorizationException(msg);
+}

in the test TestRMWebServices.java.  We aren't actually asserting anything.  we 
should assert that the expected files exist.  Personally I would also like to 
see an assert that the expected exception occurred.

 RM web ui for dumping scheduler logs should be for admins only
 --

 Key: YARN-3517
 URL: https://issues.apache.org/jira/browse/YARN-3517
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, security
Reporter: Varun Vasudev
Assignee: Thomas Graves
Priority: Blocker
  Labels: security
 Attachments: YARN-3517.001.patch, YARN-3517.002.patch, 
 YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch


 YARN-3294 allows users to dump scheduler logs from the web UI. This should be 
 for admins only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only

2015-04-28 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned YARN-3517:
---

Assignee: Thomas Graves  (was: Varun Vasudev)

 RM web ui for dumping scheduler logs should be for admins only
 --

 Key: YARN-3517
 URL: https://issues.apache.org/jira/browse/YARN-3517
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, security
Reporter: Varun Vasudev
Assignee: Thomas Graves
Priority: Blocker
  Labels: security
 Attachments: YARN-3517.001.patch, YARN-3517.002.patch, 
 YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch


 YARN-3294 allows users to dump scheduler logs from the web UI. This should be 
 for admins only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only

2015-04-23 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509152#comment-14509152
 ] 

Thomas Graves commented on YARN-3517:
-


+  // non-secure mode with no acls enabled
+  if (!isAdmin  !UserGroupInformation.isSecurityEnabled()
+   !adminACLsManager.areACLsEnabled()) {
+isAdmin = true;
+  }
+

We don't need the isSecurityEnabled check,  just keep the one for 
areAclsEnabled. This could be combined with the previous if, make this the else 
if part but that isn't a big deal.

in QueuesBlock we are creating the AdminACLsManager every web page load.   
Perhaps a better way would be to use the this.rm.getApplicationACLsManager() 
and extend the ApplicationAclsManager to explose an isAdmin functionality

 RM web ui for dumping scheduler logs should be for admins only
 --

 Key: YARN-3517
 URL: https://issues.apache.org/jira/browse/YARN-3517
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, security
Affects Versions: 2.7.0
Reporter: Varun Vasudev
Assignee: Varun Vasudev
Priority: Blocker
  Labels: security
 Attachments: YARN-3517.001.patch, YARN-3517.002.patch, 
 YARN-3517.003.patch


 YARN-3294 allows users to dump scheduler logs from the web UI. This should be 
 for admins only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-22 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3434:

Attachment: YARN-3434.patch

Upmerged patch to latest 

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, 
 YARN-3434.patch, YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-22 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3434:

Attachment: YARN-3434.patch

Fixed the line length and the white space style issues.  Other then that I 
moved things around and its just complaining about the same things more.

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, 
 YARN-3434.patch, YARN-3434.patch, YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-22 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3434:

Attachment: YARN-3434.patch

Attaching exact same patch to kick jenkins again

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, 
 YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-21 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3434:

Attachment: YARN-3434.patch

updated based on review comments

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, 
 YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only

2015-04-21 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504954#comment-14504954
 ] 

Thomas Graves commented on YARN-3517:
-

Thanks for following up on this.  Could you also change it to not show the 
button if you aren't an admin?  I don't want to confuse users by having a 
button there that doesn't do anything.

One other thing is could you add some css or something to make it look more 
like a button.  Right now it just looks like text and I didn't know it was 
clickable at first.   The placement of it seems a bit weird to me also but as 
along as its only showing up for admins that is less of an issue.

I haven't looked at the patch if details but I see we are creating a new 
AdminACLsManager each time. It would be nice if we didn't have to do that.

 RM web ui for dumping scheduler logs should be for admins only
 --

 Key: YARN-3517
 URL: https://issues.apache.org/jira/browse/YARN-3517
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, security
Affects Versions: 2.7.0
Reporter: Varun Vasudev
Assignee: Varun Vasudev
  Labels: security
 Attachments: YARN-3517.001.patch


 YARN-3294 allows users to dump scheduler logs from the web UI. This should be 
 for admins only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3294) Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period

2015-04-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503760#comment-14503760
 ] 

Thomas Graves commented on YARN-3294:
-

[~xgong] [~vvasudev]  I saw this show up in the UI on branch-2.  I don't see 
any permissions checks on this, am I perhaps missing it?  We don't want 
arbitrary users to be able to change log level on the RM.  They could slow it 
down and cause disks to fill up.

I also don't see an option to disable this, is there one?  If not I think we 
want it.   

Honestly I don't really see a need for this button at all as you can change in 
the logLevel app.  But since its in we atleast need to protect it and in my 
opinion disable it for normal users.

 Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time 
 period
 -

 Key: YARN-3294
 URL: https://issues.apache.org/jira/browse/YARN-3294
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Fix For: 2.8.0

 Attachments: Screen Shot 2015-03-12 at 8.51.25 PM.png, 
 apache-yarn-3294.0.patch, apache-yarn-3294.1.patch, apache-yarn-3294.2.patch, 
 apache-yarn-3294.3.patch, apache-yarn-3294.4.patch


 It would be nice to have a button on the web UI that would allow dumping of 
 debug logs for just the capacity scheduler for a fixed period of time(1 min, 
 5 min or so) in a separate log file. It would be useful when debugging 
 scheduler behavior without affecting the rest of the resourcemanager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-20 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3434:

Attachment: YARN-3434.patch

Upmerged to latest

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-20 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3434:

Attachment: YARN-3434.patch

Updated patch with review comments.

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-3434.patch, YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-17 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499803#comment-14499803
 ] 

Thomas Graves commented on YARN-3434:
-

Ok, I'll make the changes and post an updated patch

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-15 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496239#comment-14496239
 ] 

Thomas Graves commented on YARN-3434:
-

So I had considered putting it in the ResourceLimits but ResourceLimits seems 
to be more of a queue level thing to me (not a user level). For instance 
parentQueue passes this into leafQueue. ParentQueue cares nothing about user 
limits.  If you stored it there you would either need to track the user it was 
for or track for all users. ResourceLimits get updated when nodes are added and 
removed.  We don't need to compute a particular user limit when that happens.   
 So it would then be out of date or we change to update it when that happens, 
but that to me is fairly large change and not really needed.

The user limit calculation are lower down and recomputed per user, per 
application, per current request regularly and putting this into the global 
based on how being calculated and used didn't make sense to me. All you would 
be using it for is passing it down to assignContainer and then it would be out 
of date.  If someone else started looking at that value assuming it was up to 
date then it would be wrong (unless of course we started updating it as stated 
above).  But it would only be for a single user, not all users unless again we 
changed to calculate for every user whenever something changed. That seems a 
bit excessive.

You are correct that needToUnreserve could go away.  I started out on 2.6 which 
didn't have our changes and I could have removed it when I added in 
amountNeededUnreserve.  If we were to store it in the global ResourceLimit then 
yes the entire LimitsInfo can go away including shouldContinue as you would 
fall back to use the boolean return from each function.   But again based on my 
above comments I'm not sure ResourceLimit is the correct place to put this.

I just noticed that we are already keeping the userLimit in the User class, 
that would be another option.  But again I think we need to make it clear about 
what it is. This particular check is done per application, per user based on 
the current requested Resource.  The value stored that wouldn't necessarily 
apply to all the users applications since the resource request size could be 
different.  

thoughts or is there something I'm missing about ResourceLimits?

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-15 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496735#comment-14496735
 ] 

Thomas Graves commented on YARN-3434:
-

I am not saying child needs to know how parent calculate resource limit.  I am 
saying user limit and whether it needs to unreserve to make another reservation 
has nothing to do with the parent queue (ie it doesn't apply to parent queue).  
Remember I'm not needing to store user limit, I'm needing to store the fact of 
whether it needs to unreserve and if it does how much does it need to unreserve.

When a node heartbeats it goes through the regular assignments and updates the 
leafQueue clusterResources based on what the parent passes in. When a node is 
removed or added then it updates the resource limits (none of these apply to 
calculation of whether it needs to unreserve or not). 

Basically it comes down to is this information useful outside of the small 
window between when it calculates it and when its needed in assignContainer() 
and my thought is no.  And you said it yourself in last bullet above.  Although 
we have been referring to the userLImit and perhaps that is the problem.  I 
don't need to store the userLimit, I need to store whether it needs to 
unreserve and if so how much.  Therefore it fits better as a local transient 
variable rather then a globally stored one.  If you store just the userLImit 
then you need to recalculate stuff which I'm trying to avoid.

I understand why we are storing the current information in ResourceLimits 
because it has to do with headroom and parent limits and is recalculated at 
various points, but the current implementation in canAssignToUser doesn't use 
headroom at all and whether we need to unreserve or not on the last call to 
assignContainers doesn't affect the headroom calculation.

Again basically all we would be doing is placing an extra global variable(s) in 
the ResourceLimits class just to pass it on down a couple of functions. That to 
me is a parameter.   Now if we had multiple things needing this or updating it 
then to me fits better in the ResourceLimits.  



 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-15 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497055#comment-14497055
 ] 

Thomas Graves commented on YARN-3434:
-

I agree with Both section.  I'm not sure I completely follow the Only section. 
Are you suggesting we change the patch to modify ResourceLimits and pass down 
rather then using the LimitsInfo class?  If so that won't work, at least not 
without adding the shouldContinue flag to it.  Unless you mean keep LimitsInfo 
class for use locally in assignContainers and then pass ResourceLimits down to 
assignContainer with the value of amountNeededUnreserve as the limit.  That 
wouldn't really change much exception the object we pass down through the 
functions. 

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-15 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497076#comment-14497076
 ] 

Thomas Graves commented on YARN-3434:
-

so you are saying add amountNeededUnreserve to ResourceLimits and then set the 
global currentResourceLimits.amountNeededUnreserve inside of canAssignToUser?  
This is what I was not in favor of above and there would be no need to pass it 
down as parameter.

Or were you saying create a ResourceLimit and pass it as parameter to 
canAssignToUser and canAssignToThisQueue and modify that instance. That 
instance would then be passed down though to assignContainer()?

I don't see how else you set the ResourceLimit.

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-09 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488011#comment-14488011
 ] 

Thomas Graves commented on YARN-3434:
-

The code you mention is in the else part of that check where it would do a 
reservation.  The situation I'm talking about actually allocates a container, 
not reserve one.  I'll try to explain better:

Application ask for lots of containers. It acquires some containers, then it 
reserves some. At this point it hits its normal user limit which in my example 
= capacity.  It hasn't hit the max amount if can allocate or reserved 
(shouldAllocOrReserveNewContainer()).  The next node heartbeats in that isn't 
yet reserved and has enough space for it to place a container on.  It first 
checked in assignContainers - canAssignToThisQueue.  That passes since we 
haven't hit max capacity. Then it checks assignContainers - canAssignToUser. 
That passes but only because used - reserved  the user limit.  This allows it 
to continue down into assignContainer.  In assignContainer the node has 
available space and we haven't hit shouldAllocOrReserveNewContainer(). 
reservationsContinueLooking is on and labels are empty so it does the check:

{noformat}
if (!shouldAllocOrReserveNewContainer
|| Resources.greaterThan(resourceCalculator, clusterResource,
minimumUnreservedResource, Resources.none()))
{noformat}

as I said before its allowed to allocate or reserve so it passes that test.  
Then it hasn't met its maximum capacity (capacity = 30% and max capacity = 
100%) yet so that is None and that check doesn't kick in, so it doesn't go into 
the block to findNodeToUnreserve().   Then it goes ahead and allocates when it 
should have needed to unreserve.  Basically we needed to also do the user limit 
check again and force it to do the findNodeToUnreserve. 




 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-09 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487416#comment-14487416
 ] 

Thomas Graves commented on YARN-3434:
-

[~wangda]  I'm not sure I follow what are saying?  The reservations are already 
counted in the users usage and we do consider reserved when doing the user 
limit calculations.   Look at LeafQueue.assignContainers call to 
allocateResource is where it ends up adding to user usage.The 
canAssignToUser is where it does user limit check and substracts the 
reservations off to see if it can continue.  

Note I do think we should just get rid of the config for 
reservationsContinueLooking, but that is a separate issue.

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-09 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488061#comment-14488061
 ] 

Thomas Graves commented on YARN-3434:
-

{quote}
And I've a question about continous reservation checking behavior, may or may 
not related to this issue: Now it will try to unreserve all containers under a 
user, but actually it will only unreserve at most one container to allocate a 
new container. Do you think is it fine to change the logic to be:
When (continousReservation-enabled)  (user.usage + required - 
min(max-allocation, user.total-reserved) =user.limit), assignContainers will 
continue. This will prevent doing impossible allocation when user reserved lots 
of containers. (As same as queue reservation checking).
{quote}

I do think the reservation checking and unreserving can be improved.  I 
basically started with very simple thing and figured we could improve.  I'm not 
sure how much that check would help in practice.  I guess it might help the 
cases where you have 1 user in the queue and a second one shows up and your 
user limit gets decreased by a lot.  In that case it may prevent it from 
continuing when it can short circuit here.  So it would seem to be ok for that. 
 


 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485798#comment-14485798
 ] 

Thomas Graves commented on YARN-3434:
-

[~wangda] YARN-3243 fixes part of the problem with the max capacities, but it 
doesn't solve the user limit side of it.   The user limit check is never done 
again.  I'll have a patch up for this shortly I would appreciate it if you 
could take a look and give me feedback.

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves

 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485798#comment-14485798
 ] 

Thomas Graves edited comment on YARN-3434 at 4/8/15 6:59 PM:
-

[~wangda] YARN-3243 fixes part of the problem with the max capacities, but it 
doesn't solve the user limit side of it.   The user limit check is never done 
again in assignContainer() if it skipped the checks in assignContainers() based 
on reservations but then is allowed to shouldAllocOrReserveNewContainer.  I'll 
have a patch up for this shortly I would appreciate it if you could take a look 
and give me feedback.


was (Author: tgraves):
[~wangda] YARN-3243 fixes part of the problem with the max capacities, but it 
doesn't solve the user limit side of it.   The user limit check is never done 
again.  I'll have a patch up for this shortly I would appreciate it if you 
could take a look and give me feedback.

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves

 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-08 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3434:

Attachment: YARN-3434.patch

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485834#comment-14485834
 ] 

Thomas Graves commented on YARN-3434:
-

Note I had a reproducible test case for this.  Set userlimit% to 100%, user 
limit factor to 1.  15 nodes, 20GB each.  1 queue configured for capacity 70, 
the 2nd queue configured capacity 30.
In one queue I started a sleep job needing 10 - 12GB containers in the first 
queue.  I then started a second job in the 2nd queue that needed 25,  12GB 
containers, the second job got containers but then had to reserve others 
waiting for the first job to release some.   

Without this change when the first job started releasing containers the second 
job would grab them and go over the user limit.  With this fix it stayed within 
the user limit.  

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-02 Thread Thomas Graves (JIRA)
Thomas Graves created YARN-3434:
---

 Summary: Interaction between reservations and userlimit can result 
in significant ULF violation
 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves


ULF was set to 1.0
User was able to consume 1.4X queue capacity.
It looks like when this application launched, it reserved about 1000 
containers, each 8G each, within about 5 seconds. I think this allowed the 
logic in assignToUser() to allow the userlimit to be surpassed.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

2015-04-02 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392751#comment-14392751
 ] 

Thomas Graves commented on YARN-3434:
-

The issue here is that in if we allow the user to continue from the user limit 
checks in assignContainers because they have reservations, when it gets down 
into the assignContainer routine and its allowed to get a container and the 
node has space we don't double check the user limit in this case.  We recheck 
in all other cases but this one is missed.  

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves

 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3432) Cluster metrics have wrong Total Memory when there is reserved memory on CS

2015-04-02 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392687#comment-14392687
 ] 

Thomas Graves commented on YARN-3432:
-

that will fix it for the capacity scheduler, we need to see if that breaks the 
FairScheduler though.



 Cluster metrics have wrong Total Memory when there is reserved memory on CS
 ---

 Key: YARN-3432
 URL: https://issues.apache.org/jira/browse/YARN-3432
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Brahma Reddy Battula

 I noticed that when reservations happen when using the Capacity Scheduler, 
 the UI and web services report the wrong total memory.
 For example.  I have a 300GB of total memory in my cluster.  I allocate 50 
 and I reserve 10.  The cluster metrics for total memory get reported as 290GB.
 This was broken by https://issues.apache.org/jira/browse/YARN-656 so perhaps 
 there is a difference between fair scheduler and capacity scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3432) Cluster metrics have wrong Total Memory when there is reserved memory on CS

2015-04-01 Thread Thomas Graves (JIRA)
Thomas Graves created YARN-3432:
---

 Summary: Cluster metrics have wrong Total Memory when there is 
reserved memory on CS
 Key: YARN-3432
 URL: https://issues.apache.org/jira/browse/YARN-3432
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.6.0
Reporter: Thomas Graves


I noticed that when reservations happen when using the Capacity Scheduler, the 
UI and web services report the wrong total memory.

For example.  I have a 300GB of total memory in my cluster.  I allocate 50 and 
I reserve 10.  The cluster metrics for total memory get reported as 290GB.

This was broken by https://issues.apache.org/jira/browse/YARN-656 so perhaps 
there is a difference between fair scheduler and capacity scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-656) In scheduler UI, including reserved memory in Memory Total can make it exceed cluster capacity.

2015-04-01 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391472#comment-14391472
 ] 

Thomas Graves commented on YARN-656:


Note this broke the UI, at least for the capacity scheduler.

It now displays total that is lacking the reserved.   Perhaps this is a 
difference in how fair scheduler and capacity scheduler keep track of allocated 
vs reservations.

 In scheduler UI, including reserved memory in Memory Total can make it 
 exceed cluster capacity.
 -

 Key: YARN-656
 URL: https://issues.apache.org/jira/browse/YARN-656
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.0.4-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 2.1.0-beta

 Attachments: YARN-656-1.patch, YARN-656.patch


 Memory Total is currently a sum of availableMB, allocatedMB, and 
 reservedMB.  Including reservedMB in this sum can make the total exceed the 
 capacity of the cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1582) Capacity Scheduler: add a maximum-allocation-mb setting per queue

2015-01-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14297753#comment-14297753
 ] 

Thomas Graves commented on YARN-1582:
-

+1 looks good. Thanks Jason. Feel free to commit.

 Capacity Scheduler: add a maximum-allocation-mb setting per queue 
 --

 Key: YARN-1582
 URL: https://issues.apache.org/jira/browse/YARN-1582
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 3.0.0, 0.23.10, 2.2.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1582-branch-0.23.patch, YARN-1582.002.patch, 
 YARN-1582.003.patch


 We want to allow certain queues to use larger container sizes while limiting 
 other queues to smaller container sizes.  Setting it per queue will help 
 prevent abuse, help limit the impact of reservations, and allow changes in 
 the maximum container size to be rolled out more easily.
 One reason this is needed is more application types are becoming available on 
 yarn and certain applications require more memory to run efficiently. While 
 we want to allow for that we don't want other applications to abuse that and 
 start requesting bigger containers then what they really need.  
 Note that we could have this based on application type, but that might not be 
 totally accurate either since for example you might want to allow certain 
 users on MapReduce to use larger containers, while limiting other users of 
 MapReduce to smaller containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2828) Enable auto refresh of web pages (using http parameter)

2014-11-07 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202033#comment-14202033
 ] 

Thomas Graves commented on YARN-2828:
-

auto refresh was removed because some pages load a lot of data and you actually 
may not want it to update.  It can make debugging harder if you are looking at 
a lot of data and the screen keeps refreshing on you.

I think the only way to bring it back is to make it optional.

 Enable auto refresh of web pages (using http parameter)
 ---

 Key: YARN-2828
 URL: https://issues.apache.org/jira/browse/YARN-2828
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Tim Robertson
Priority: Minor

 The MR1 Job Tracker had a useful HTTP parameter of e.g. refresh=3 that 
 could be appended to URLs which enabled a page reload.  This was very useful 
 when developing mapreduce jobs, especially to watch counters changing.  This 
 is lost in the the Yarn interface.
 Could be implemented as a page element (e.g. drop down or so), but I'd 
 recommend that the page not be more cluttered, and simply bring back the 
 optional refresh HTTP param.  It worked really nicely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-443) allow OS scheduling priority of NM to be different than the containers it launches

2014-10-27 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185578#comment-14185578
 ] 

Thomas Graves commented on YARN-443:


Can you be more specific, what is different about it and why it is a problem? 
The trunk patch shows that there was an existing getRunCommand() routine 
(before this change) where as the other didn't have one before (it looks like 
for windows support).

 allow OS scheduling priority of NM to be different than the containers it 
 launches
 --

 Key: YARN-443
 URL: https://issues.apache.org/jira/browse/YARN-443
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.0.3-alpha, 0.23.6
Reporter: Thomas Graves
Assignee: Thomas Graves
 Fix For: 0.23.7, 2.0.4-alpha

 Attachments: YARN-443-branch-0.23.patch, YARN-443-branch-0.23.patch, 
 YARN-443-branch-0.23.patch, YARN-443-branch-0.23.patch, 
 YARN-443-branch-2.patch, YARN-443-branch-2.patch, YARN-443-branch-2.patch, 
 YARN-443.patch, YARN-443.patch, YARN-443.patch, YARN-443.patch, 
 YARN-443.patch, YARN-443.patch, YARN-443.patch


 It would be nice if we could have the nodemanager run at a different OS 
 scheduling priority than the containers so that you can still communicate 
 with the nodemanager if the containers out of control.  
 On linux we could launch the nodemanager at a higher priority, but then all 
 the containers it launches would also be at that higher priority, so we need 
 a way for the container executor to launch them at a lower priority.
 I'm not sure how this applies to windows if at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-26 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149565#comment-14149565
 ] 

Thomas Graves commented on YARN-1769:
-

Thanks for the review Jason. I'll update the patch and remove some of the 
logging or make it truly debug.

 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-26 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1769:

Attachment: YARN-1769.patch

patch with log statments changed to debug

 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-25 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1769:

Attachment: YARN-1769.patch

attaching the same patch to kick jenkins.


 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-25 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1769:

Attachment: YARN-1769.patch

Update tests for handle SystemMetricsPublisher

 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-25 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1769:

Attachment: YARN-1769.patch

fix patch

 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-25 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1769:

Attachment: YARN-1769.patch

 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-25 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148098#comment-14148098
 ] 

Thomas Graves commented on YARN-1769:
-

We've been running this now on cluster for quite a while and its showing great 
improvements in the time to get larger containers.  I would like to put this in.

 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2431) NM restart: cgroup is not removed for reacquired containers

2014-09-04 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121796#comment-14121796
 ] 

Thomas Graves commented on YARN-2431:
-

+1. Thanks Jason! Feel free to check it in.

 NM restart: cgroup is not removed for reacquired containers
 ---

 Key: YARN-2431
 URL: https://issues.apache.org/jira/browse/YARN-2431
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-2431.patch


 The cgroup for a reacquired container is not being removed when the container 
 exits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2419) RM applications page doesn't sort application id properly

2014-08-14 Thread Thomas Graves (JIRA)
Thomas Graves created YARN-2419:
---

 Summary: RM applications page doesn't sort application id properly
 Key: YARN-2419
 URL: https://issues.apache.org/jira/browse/YARN-2419
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Thomas Graves


The ResourceManager apps page doesn't sort the application ids properly when 
the app id rolls over from  to 1.

When it rolls over the 1+ application ids end up being many pages down by 
the 0XXX numbers.

I assume we just sort alphabetically so we would need a special sorter that 
knows about application ids.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-07-14 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1769:


Attachment: YARN-1769.patch

fixed merged conflict.

 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2072) RM/NM UIs and webservices are missing vcore information

2014-06-24 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042565#comment-14042565
 ] 

Thomas Graves commented on YARN-2072:
-

+1, looks good. Thanks Nathan.

 RM/NM UIs and webservices are missing vcore information
 ---

 Key: YARN-2072
 URL: https://issues.apache.org/jira/browse/YARN-2072
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 3.0.0, 2.4.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: YARN-2072.patch, YARN-2072.patch


 Change RM and NM UIs and webservices to include virtual cores.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2072) RM/NM UIs and webservices are missing vcore information

2014-06-24 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-2072:


Issue Type: Improvement  (was: Bug)

 RM/NM UIs and webservices are missing vcore information
 ---

 Key: YARN-2072
 URL: https://issues.apache.org/jira/browse/YARN-2072
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 3.0.0, 2.4.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: YARN-2072.patch, YARN-2072.patch


 Change RM and NM UIs and webservices to include virtual cores.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2072) RM/NM UIs and webservices are missing vcore information

2014-06-18 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035901#comment-14035901
 ] 

Thomas Graves commented on YARN-2072:
-

Thanks Nathan, it mostly looks good. A few comments:

- In the UserMetricsInfo getReservedVirtualCores is returning 
allocatedVirtualCores and should be returning reservedVirtualCores
- In MetricsOverviewTable you should capitalize the C in Vcores Reserved 

The only other thing is if we want to do something special if the 
DominantResourceCalculator isn't being used. Right now with the 
capacityScheduler and FifoScheduler it ends up showings it using 1 vcore per 
container.  The fairscheduler reports what the user would be using but I 
believe doesn't enforce it.   The fairscheduler reporting seems more intuitive 
so perhaps we should change the capacityscheduler/fifo to do similar reporting. 
 I think that would be a follow up jira though.

Sandy, do you have any comments on this about how it shows up with 
fairscheduler?

 RM/NM UIs and webservices are missing vcore information
 ---

 Key: YARN-2072
 URL: https://issues.apache.org/jira/browse/YARN-2072
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 3.0.0, 2.4.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: YARN-2072.patch


 Change RM and NM UIs and webservices to include virtual cores.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()

2014-06-18 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-2171:


Priority: Major  (was: Critical)

 AMs block on the CapacityScheduler lock during allocate()
 -

 Key: YARN-2171
 URL: https://issues.apache.org/jira/browse/YARN-2171
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-2171.patch, YARN-2171v2.patch


 When AMs heartbeat into the RM via the allocate() call they are blocking on 
 the CapacityScheduler lock when trying to get the number of nodes in the 
 cluster via getNumClusterNodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()

2014-06-18 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-2171:


Priority: Critical  (was: Major)

 AMs block on the CapacityScheduler lock during allocate()
 -

 Key: YARN-2171
 URL: https://issues.apache.org/jira/browse/YARN-2171
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2171.patch, YARN-2171v2.patch


 When AMs heartbeat into the RM via the allocate() call they are blocking on 
 the CapacityScheduler lock when trying to get the number of nodes in the 
 cluster via getNumClusterNodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()

2014-06-18 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-2171:


Target Version/s: 2.5.0  (was: 0.23.11, 2.5.0)

 AMs block on the CapacityScheduler lock during allocate()
 -

 Key: YARN-2171
 URL: https://issues.apache.org/jira/browse/YARN-2171
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2171.patch, YARN-2171v2.patch


 When AMs heartbeat into the RM via the allocate() call they are blocking on 
 the CapacityScheduler lock when trying to get the number of nodes in the 
 cluster via getNumClusterNodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-06-16 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1769:


Attachment: YARN-1769.patch

fix patch . I generated it from the wrong directory.

 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-06-16 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1769:


Attachment: YARN-1769.patch

 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-06-11 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1769:


Attachment: YARN-1769.patch

patch to fix TestReservations after YARN-1474

 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-05-27 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1769:


Attachment: YARN-1769.patch

upmerged to latest

 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations

2014-05-27 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010358#comment-14010358
 ] 

Thomas Graves commented on YARN-1769:
-

TestFairScheduler is failing for other reasons.  see 
https://issues.apache.org/jira/browse/YARN-2105.

 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1946) need Public interface for WebAppUtils.getProxyHostAndPort

2014-04-17 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13972959#comment-13972959
 ] 

Thomas Graves commented on YARN-1946:
-

{quote}
We've replaced the normal AmFilter with one that doesn't proxy under ws/*, 
looking into is code
{quote}
sorry I don't understand what you are saying here? You are saying your 
application has?  If it doesn't proxy it doesn't really follow yarn security 
rules?

You are correct that the YarnConfiguration is public, but the 
getProxyHostAndPort and now getProxyHostsAndPortsForAmFilter handle other 
things for you.  https, RM HA, etc...

 need Public interface for WebAppUtils.getProxyHostAndPort
 -

 Key: YARN-1946
 URL: https://issues.apache.org/jira/browse/YARN-1946
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, webapp
Affects Versions: 2.4.0
Reporter: Thomas Graves
Priority: Critical

 ApplicationMasters are supposed to go through the ResourceManager web app 
 proxy if they have web UI's so they are properly secured.  There is currently 
 no public interface for Application Masters to conveniently get the proxy 
 host and port.  There is a function in WebAppUtils, but that class is 
 private.  
 We should provide this as a utility since any properly written AM will need 
 to do this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1946) need Public interface for WebAppUtils.getProxyHostAndPort

2014-04-16 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13971421#comment-13971421
 ] 

Thomas Graves commented on YARN-1946:
-

Thanks for the suggestion Steve.  That is only the proxy base /proxy/...  
Which I am actually also using:

val proxy = WebAppUtils.getProxyHostAndPort(conf)
val uriBase = http://; + proxy +
  System.getenv(ApplicationConstants.APPLICATION_WEB_PROXY_BASE_ENV)

Do you know if there is a way to more easily get the proxy host and port? 

If you are an application that doesn't use the Hadoop HttpServer for webapps 
then in order to use the web app proxy you have to use install AmIpFilter 
directly and provide the PROXY_HOST and PROXY_URI_BASE.  If you are using the 
HttpServer then you can use the AmFilterInitializer which figures out the host 
and port for you.   Thus why I need the host and port.

Note that I would basically copy this code into the application trying to do 
this but if I need to copy it more then likely other applications would also so 
perhaps we should make Public utility functions for it.

Looking some more I could also try to use the AmFilterInitializer by creating a 
class with the interface for FilterContainer and then call initFilter directly. 
This however would specialize the code whereas using the AmIpFilter directly 
fits in with installing any normal java servlet filter.  I would prefer not to 
do the specialized code since an application may run on multiple framework - 
yarn being on of them. 

Also note back in hadoop 0.23 this was easier to get (publicly available) as it 
was in YarnConfiguration.getProxyHostAndPort

it actually also looks like my code is out of date as it was updated recently 
to handle HA: YARN-1811

 need Public interface for WebAppUtils.getProxyHostAndPort
 -

 Key: YARN-1946
 URL: https://issues.apache.org/jira/browse/YARN-1946
 Project: Hadoop YARN
  Issue Type: Bug
  Components: api, webapp
Affects Versions: 2.4.0
Reporter: Thomas Graves
Priority: Critical

 ApplicationMasters are supposed to go through the ResourceManager web app 
 proxy if they have web UI's so they are properly secured.  There is currently 
 no public interface for Application Masters to conveniently get the proxy 
 host and port.  There is a function in WebAppUtils, but that class is 
 private.  
 We should provide this as a utility since any properly written AM will need 
 to do this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1931) Private API change in YARN-1824 in 2.4 broke compatibility with previous releases

2014-04-16 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13971567#comment-13971567
 ] 

Thomas Graves commented on YARN-1931:
-

+1, Thanks Sandy.  I'll wait til this afternoon to commit in case Vinod has any 
further comments.



 Private API change in YARN-1824 in 2.4 broke compatibility with previous 
 releases
 -

 Key: YARN-1931
 URL: https://issues.apache.org/jira/browse/YARN-1931
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications
Affects Versions: 2.4.0
Reporter: Thomas Graves
Assignee: Sandy Ryza
Priority: Blocker
 Attachments: YARN-1931-1.patch, YARN-1931-2.patch, YARN-1931.patch


 YARN-1824 broke compatibility with previous 2.x releases by changes the API's 
 in org.apache.hadoop.yarn.util.Apps.{setEnvFromInputString,addToEnvironment}  
 The old api should be added back in.
 This affects any ApplicationMasters who were using this api.  It also breaks 
 previously built MapReduce libraries from working with the new Yarn release 
 as MR uses this api. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1942) ConverterUtils should not be Private

2014-04-15 Thread Thomas Graves (JIRA)
Thomas Graves created YARN-1942:
---

 Summary: ConverterUtils should not be Private
 Key: YARN-1942
 URL: https://issues.apache.org/jira/browse/YARN-1942
 Project: Hadoop YARN
  Issue Type: Bug
  Components: api
Affects Versions: 2.4.0
Reporter: Thomas Graves


ConverterUtils has a bunch of functions that are useful to application masters. 
  It should either be made public or we make some of the utilities in it public 
or we provide other external apis for application masters to use.  Note that 
distributedshell and MR are both using these interfaces. 

For instance the main use case I see right now is for getting the application 
attempt id within the appmaster:
String containerIdStr =
  System.getenv(Environment.CONTAINER_ID.name());
ConverterUtils.toContainerId

ContainerId containerId = ConverterUtils.toContainerId(containerIdStr);
  ApplicationAttemptId applicationAttemptId =
  containerId.getApplicationAttemptId();

I don't see any other way for the application master to get this information.  
If there is please let me know.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1942) ConverterUtils should not be Private

2014-04-15 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969695#comment-13969695
 ] 

Thomas Graves commented on YARN-1942:
-

Note that tez and spark are also using these utils.

 ConverterUtils should not be Private
 

 Key: YARN-1942
 URL: https://issues.apache.org/jira/browse/YARN-1942
 Project: Hadoop YARN
  Issue Type: Bug
  Components: api
Affects Versions: 2.4.0
Reporter: Thomas Graves
Priority: Critical

 ConverterUtils has a bunch of functions that are useful to application 
 masters.   It should either be made public or we make some of the utilities 
 in it public or we provide other external apis for application masters to 
 use.  Note that distributedshell and MR are both using these interfaces. 
 For instance the main use case I see right now is for getting the application 
 attempt id within the appmaster:
 String containerIdStr =
   System.getenv(Environment.CONTAINER_ID.name());
 ConverterUtils.toContainerId
 ContainerId containerId = ConverterUtils.toContainerId(containerIdStr);
   ApplicationAttemptId applicationAttemptId =
   containerId.getApplicationAttemptId();
 I don't see any other way for the application master to get this information. 
  If there is please let me know.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1942) ConverterUtils should not be Private

2014-04-15 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1942:


Priority: Critical  (was: Major)

 ConverterUtils should not be Private
 

 Key: YARN-1942
 URL: https://issues.apache.org/jira/browse/YARN-1942
 Project: Hadoop YARN
  Issue Type: Bug
  Components: api
Affects Versions: 2.4.0
Reporter: Thomas Graves
Priority: Critical

 ConverterUtils has a bunch of functions that are useful to application 
 masters.   It should either be made public or we make some of the utilities 
 in it public or we provide other external apis for application masters to 
 use.  Note that distributedshell and MR are both using these interfaces. 
 For instance the main use case I see right now is for getting the application 
 attempt id within the appmaster:
 String containerIdStr =
   System.getenv(Environment.CONTAINER_ID.name());
 ConverterUtils.toContainerId
 ContainerId containerId = ConverterUtils.toContainerId(containerIdStr);
   ApplicationAttemptId applicationAttemptId =
   containerId.getApplicationAttemptId();
 I don't see any other way for the application master to get this information. 
  If there is please let me know.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1942) ConverterUtils should not be Private

2014-04-15 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1942:


Target Version/s: 3.0.0, 2.4.1

 ConverterUtils should not be Private
 

 Key: YARN-1942
 URL: https://issues.apache.org/jira/browse/YARN-1942
 Project: Hadoop YARN
  Issue Type: Bug
  Components: api
Affects Versions: 2.4.0
Reporter: Thomas Graves
Priority: Critical

 ConverterUtils has a bunch of functions that are useful to application 
 masters.   It should either be made public or we make some of the utilities 
 in it public or we provide other external apis for application masters to 
 use.  Note that distributedshell and MR are both using these interfaces. 
 For instance the main use case I see right now is for getting the application 
 attempt id within the appmaster:
 String containerIdStr =
   System.getenv(Environment.CONTAINER_ID.name());
 ConverterUtils.toContainerId
 ContainerId containerId = ConverterUtils.toContainerId(containerIdStr);
   ApplicationAttemptId applicationAttemptId =
   containerId.getApplicationAttemptId();
 I don't see any other way for the application master to get this information. 
  If there is please let me know.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1946) need Public interface for WebAppUtils.getProxyHostAndPort

2014-04-15 Thread Thomas Graves (JIRA)
Thomas Graves created YARN-1946:
---

 Summary: need Public interface for WebAppUtils.getProxyHostAndPort
 Key: YARN-1946
 URL: https://issues.apache.org/jira/browse/YARN-1946
 Project: Hadoop YARN
  Issue Type: Bug
  Components: api, webapp
Affects Versions: 2.4.0
Reporter: Thomas Graves
Priority: Critical


ApplicationMasters are supposed to go through the ResourceManager web app proxy 
if they have web UI's so they are properly secured.  There is currently no 
public interface for Application Masters to conveniently get the proxy host and 
port.  There is a function in WebAppUtils, but that class is private.  

We should provide this as a utility since any properly written AM will need to 
do this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1931) Private API change in YARN-1824 in 2.4 broke compatibility with previous releases

2014-04-15 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13970099#comment-13970099
 ] 

Thomas Graves commented on YARN-1931:
-

I also agree it makes more sense to make a new utility class.  Put these back 
for backwards compatibility but ask people to move to the new api. 

Should this class be marked LimitedPrivate({MapReduce, Yarn})?  Its also 
missing the interface stability annotation.

 Private API change in YARN-1824 in 2.4 broke compatibility with previous 
 releases
 -

 Key: YARN-1931
 URL: https://issues.apache.org/jira/browse/YARN-1931
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications
Affects Versions: 2.4.0
Reporter: Thomas Graves
Assignee: Sandy Ryza
Priority: Blocker
 Attachments: YARN-1931-1.patch, YARN-1931.patch


 YARN-1824 broke compatibility with previous 2.x releases by changes the API's 
 in org.apache.hadoop.yarn.util.Apps.{setEnvFromInputString,addToEnvironment}  
 The old api should be added back in.
 This affects any ApplicationMasters who were using this api.  It also breaks 
 previously built MapReduce libraries from working with the new Yarn release 
 as MR uses this api. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1939) Improve the packaging of AmIpFilter

2014-04-14 Thread Thomas Graves (JIRA)
Thomas Graves created YARN-1939:
---

 Summary: Improve the packaging of AmIpFilter
 Key: YARN-1939
 URL: https://issues.apache.org/jira/browse/YARN-1939
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api, webapp
Affects Versions: 2.4.0
Reporter: Thomas Graves


It is recommended for applications to use the AmIpFilter to properly secure any 
WebUI that is specific to that application.  The packaging of the AmIpFilter is 
in org.apache.hadoop.yarn.server.webproxy.amfilter, which requires a 
application to pull in yarn-server as a dependency which isn't very user 
friendly for applications wanting to pick up the bare minimum.

We should improve the packaging of that so it can be pulled in independently.  
We do need to be careful to keep it backwards compatible atleast in the 2.x 
release line also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


  1   2   3   4   >