[jira] [Commented] (YARN-462) Project Parameter for Chargeback

2019-07-16 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886402#comment-16886402
 ] 

Eric Payne commented on YARN-462:
-

[~kthrapp], it has been several years. Is there still a need for this 
functionality?

> Project Parameter for Chargeback
> 
>
> Key: YARN-462
> URL: https://issues.apache.org/jira/browse/YARN-462
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 0.23.6
>Reporter: Kendall Thrapp
>Priority: Major
>
> Problem Summary
> For the purpose of chargeback and better understanding of grid usage, we need 
> to be able to associate applications with "projects", e.g. "pipeline X", 
> "property Y".  This would allow us to aggregate on this property, thereby 
> helping us compute grid resource usage for the entire "project".  Currently, 
> for a given application, two things we know about it are the user that 
> submitted it and the queue it was submitted to.  Below, I'll explain why 
> neither of these is adequate for enterprise-level chargeback and 
> understanding resource allocation needs.
> Why Not Users?
> Its not individual users that are paying the bill -- its projects.  When one 
> of our real users submits an application on a Hadoop grid, they're presumably 
> not usually doing it for themselves.  They're doing work for some project or 
> team effort, so its that team or project that should be "charged" for all its 
> users applications.  Maintaining outside lists of associations between users 
> and projects is error-prone because it is time-sensitive and requires 
> continued ongoing maintenance.  New users join organizations, users leave and 
> users even change projects.  Furthermore, users may split their time between 
> multiple projects, making it ambiguous as to which of a user's projects a 
> given application should be charged.  Also, there can be headless users, 
> which can be even more difficult to link to a project and can be shared 
> between teams or projects.
> Why Not Queues?
> The purpose of queues is for scheduling.  Overloading the queues concept to 
> also mean who should be "charged" for an application can have a detrimental 
> effect on the primary purpose of queues.  It could be manageable in the case 
> of a very small number of projects sharing a cluster, but doesn't scale to 
> tens or hundreds of projects sharing a cluster.  If a given cluster is shared 
> between 50 projects, creating 50 separate queues will result in inefficient 
> use of the cluster resources.  Furthermore, a given project may desire more 
> than one queue for different types or priorities of applications.  
> Proposed Solution
> Rather than relying on external tools to infer through the user and/or queue 
> who to "charge" for a given application, I propose a straightforward approach 
> where that information be explicitly supplied when the application is 
> submitted, just like we do with queues.  Let's use a charge card analogy: 
> when you buy something online, you don't just say who you are and how to ship 
> it, you also specify how you're paying for it.  Similarly, when submitting an 
> application in YARN, you could explicitly specify to whom it's resource usage 
> should be associated (a project, team, cost center, etc).
> This new configuration parameter should default to being optional, so that 
> organizations not interested in chargeback or project-level resource tracking 
> can happily continue on as if it wasn't there.  However, it should be 
> configurable at the cluster-level such that, a given cluster to could elect 
> to make it required, so that all applications would have an associated 
> project.  The value of this new parameter should be exposed via the Resource 
> Manager UI and Resource Manager REST API, so that users and tools can make 
> use of it for chargeback, utilization metrics, etc.
> I'm undecided on what to name the new parameter, as I like the flexibility in 
> the ways it could be used.  It is essentially just an additional party other 
> than user or queue that an application can be associated with, so its use is 
> not just limited to a chargeback scenario.  For example, an organization not 
> interested in chargeback could still use this parameter to communicate useful 
> information about a application (e.g. pipelineX.stageN) and aggregate like 
> applications.
> Enforcement
> Couldn't users just specify this information as a prefix for their job names? 
>  Yes, but the missing piece this could provides is enforcement.  Ideally, I'd 
> like this parameter to work very much like how the queues work.  Like already 
> exists with queues, it'd be ideal if a given user couldn't just specify any 
> old value for this parameter.  It could be configurable such that a given 
> user only has 

[jira] [Commented] (YARN-462) Project Parameter for Chargeback

2015-10-06 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945933#comment-14945933
 ] 

Ruslan Dautkhanov commented on YARN-462:


It's probably related to https://issues.apache.org/jira/browse/YARN-415 

> Project Parameter for Chargeback
> 
>
> Key: YARN-462
> URL: https://issues.apache.org/jira/browse/YARN-462
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 0.23.6
>Reporter: Kendall Thrapp
>
> Problem Summary
> For the purpose of chargeback and better understanding of grid usage, we need 
> to be able to associate applications with "projects", e.g. "pipeline X", 
> "property Y".  This would allow us to aggregate on this property, thereby 
> helping us compute grid resource usage for the entire "project".  Currently, 
> for a given application, two things we know about it are the user that 
> submitted it and the queue it was submitted to.  Below, I'll explain why 
> neither of these is adequate for enterprise-level chargeback and 
> understanding resource allocation needs.
> Why Not Users?
> Its not individual users that are paying the bill -- its projects.  When one 
> of our real users submits an application on a Hadoop grid, they're presumably 
> not usually doing it for themselves.  They're doing work for some project or 
> team effort, so its that team or project that should be "charged" for all its 
> users applications.  Maintaining outside lists of associations between users 
> and projects is error-prone because it is time-sensitive and requires 
> continued ongoing maintenance.  New users join organizations, users leave and 
> users even change projects.  Furthermore, users may split their time between 
> multiple projects, making it ambiguous as to which of a user's projects a 
> given application should be charged.  Also, there can be headless users, 
> which can be even more difficult to link to a project and can be shared 
> between teams or projects.
> Why Not Queues?
> The purpose of queues is for scheduling.  Overloading the queues concept to 
> also mean who should be "charged" for an application can have a detrimental 
> effect on the primary purpose of queues.  It could be manageable in the case 
> of a very small number of projects sharing a cluster, but doesn't scale to 
> tens or hundreds of projects sharing a cluster.  If a given cluster is shared 
> between 50 projects, creating 50 separate queues will result in inefficient 
> use of the cluster resources.  Furthermore, a given project may desire more 
> than one queue for different types or priorities of applications.  
> Proposed Solution
> Rather than relying on external tools to infer through the user and/or queue 
> who to "charge" for a given application, I propose a straightforward approach 
> where that information be explicitly supplied when the application is 
> submitted, just like we do with queues.  Let's use a charge card analogy: 
> when you buy something online, you don't just say who you are and how to ship 
> it, you also specify how you're paying for it.  Similarly, when submitting an 
> application in YARN, you could explicitly specify to whom it's resource usage 
> should be associated (a project, team, cost center, etc).
> This new configuration parameter should default to being optional, so that 
> organizations not interested in chargeback or project-level resource tracking 
> can happily continue on as if it wasn't there.  However, it should be 
> configurable at the cluster-level such that, a given cluster to could elect 
> to make it required, so that all applications would have an associated 
> project.  The value of this new parameter should be exposed via the Resource 
> Manager UI and Resource Manager REST API, so that users and tools can make 
> use of it for chargeback, utilization metrics, etc.
> I'm undecided on what to name the new parameter, as I like the flexibility in 
> the ways it could be used.  It is essentially just an additional party other 
> than user or queue that an application can be associated with, so its use is 
> not just limited to a chargeback scenario.  For example, an organization not 
> interested in chargeback could still use this parameter to communicate useful 
> information about a application (e.g. pipelineX.stageN) and aggregate like 
> applications.
> Enforcement
> Couldn't users just specify this information as a prefix for their job names? 
>  Yes, but the missing piece this could provides is enforcement.  Ideally, I'd 
> like this parameter to work very much like how the queues work.  Like already 
> exists with queues, it'd be ideal if a given user couldn't just specify any 
> old value for this parameter.  It could be configurable such that a given 
> user only has permission to submit 

[jira] [Commented] (YARN-462) Project Parameter for Chargeback

2015-10-06 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945950#comment-14945950
 ] 

Jason Lowe commented on YARN-462:
-

It is related to YARN-415 but not implemented by it.  YARN-415 adds the ability 
to know the aggregate resource usage of an application, but it doesn't help 
associate that application with a project or business entity.  To use Kendall's 
analogy from above, YARN-415 lets us know how much an app costs but not who's 
supposed to be paying for it.  One would think that would be the user, but in 
many cluster setups a single user can run jobs for multiple projects.


> Project Parameter for Chargeback
> 
>
> Key: YARN-462
> URL: https://issues.apache.org/jira/browse/YARN-462
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 0.23.6
>Reporter: Kendall Thrapp
>
> Problem Summary
> For the purpose of chargeback and better understanding of grid usage, we need 
> to be able to associate applications with "projects", e.g. "pipeline X", 
> "property Y".  This would allow us to aggregate on this property, thereby 
> helping us compute grid resource usage for the entire "project".  Currently, 
> for a given application, two things we know about it are the user that 
> submitted it and the queue it was submitted to.  Below, I'll explain why 
> neither of these is adequate for enterprise-level chargeback and 
> understanding resource allocation needs.
> Why Not Users?
> Its not individual users that are paying the bill -- its projects.  When one 
> of our real users submits an application on a Hadoop grid, they're presumably 
> not usually doing it for themselves.  They're doing work for some project or 
> team effort, so its that team or project that should be "charged" for all its 
> users applications.  Maintaining outside lists of associations between users 
> and projects is error-prone because it is time-sensitive and requires 
> continued ongoing maintenance.  New users join organizations, users leave and 
> users even change projects.  Furthermore, users may split their time between 
> multiple projects, making it ambiguous as to which of a user's projects a 
> given application should be charged.  Also, there can be headless users, 
> which can be even more difficult to link to a project and can be shared 
> between teams or projects.
> Why Not Queues?
> The purpose of queues is for scheduling.  Overloading the queues concept to 
> also mean who should be "charged" for an application can have a detrimental 
> effect on the primary purpose of queues.  It could be manageable in the case 
> of a very small number of projects sharing a cluster, but doesn't scale to 
> tens or hundreds of projects sharing a cluster.  If a given cluster is shared 
> between 50 projects, creating 50 separate queues will result in inefficient 
> use of the cluster resources.  Furthermore, a given project may desire more 
> than one queue for different types or priorities of applications.  
> Proposed Solution
> Rather than relying on external tools to infer through the user and/or queue 
> who to "charge" for a given application, I propose a straightforward approach 
> where that information be explicitly supplied when the application is 
> submitted, just like we do with queues.  Let's use a charge card analogy: 
> when you buy something online, you don't just say who you are and how to ship 
> it, you also specify how you're paying for it.  Similarly, when submitting an 
> application in YARN, you could explicitly specify to whom it's resource usage 
> should be associated (a project, team, cost center, etc).
> This new configuration parameter should default to being optional, so that 
> organizations not interested in chargeback or project-level resource tracking 
> can happily continue on as if it wasn't there.  However, it should be 
> configurable at the cluster-level such that, a given cluster to could elect 
> to make it required, so that all applications would have an associated 
> project.  The value of this new parameter should be exposed via the Resource 
> Manager UI and Resource Manager REST API, so that users and tools can make 
> use of it for chargeback, utilization metrics, etc.
> I'm undecided on what to name the new parameter, as I like the flexibility in 
> the ways it could be used.  It is essentially just an additional party other 
> than user or queue that an application can be associated with, so its use is 
> not just limited to a chargeback scenario.  For example, an organization not 
> interested in chargeback could still use this parameter to communicate useful 
> information about a application (e.g. pipelineX.stageN) and aggregate like 
> applications.
> Enforcement
> Couldn't users just specify this information as a prefix for 

[jira] [Commented] (YARN-462) Project Parameter for Chargeback

2013-03-19 Thread Kendall Thrapp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13606949#comment-13606949
 ] 

Kendall Thrapp commented on YARN-462:
-

Karthik, I think your suggestion for transparent project queues under the leaf 
queues is an interesting idea and would also meet my requirements.

 Project Parameter for Chargeback
 

 Key: YARN-462
 URL: https://issues.apache.org/jira/browse/YARN-462
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp

 Problem Summary
 For the purpose of chargeback and better understanding of grid usage, we need 
 to be able to associate applications with projects, e.g. pipeline X, 
 property Y.  This would allow us to aggregate on this property, thereby 
 helping us compute grid resource usage for the entire project.  Currently, 
 for a given application, two things we know about it are the user that 
 submitted it and the queue it was submitted to.  Below, I'll explain why 
 neither of these is adequate for enterprise-level chargeback and 
 understanding resource allocation needs.
 Why Not Users?
 Its not individual users that are paying the bill -- its projects.  When one 
 of our real users submits an application on a Hadoop grid, they're presumably 
 not usually doing it for themselves.  They're doing work for some project or 
 team effort, so its that team or project that should be charged for all its 
 users applications.  Maintaining outside lists of associations between users 
 and projects is error-prone because it is time-sensitive and requires 
 continued ongoing maintenance.  New users join organizations, users leave and 
 users even change projects.  Furthermore, users may split their time between 
 multiple projects, making it ambiguous as to which of a user's projects a 
 given application should be charged.  Also, there can be headless users, 
 which can be even more difficult to link to a project and can be shared 
 between teams or projects.
 Why Not Queues?
 The purpose of queues is for scheduling.  Overloading the queues concept to 
 also mean who should be charged for an application can have a detrimental 
 effect on the primary purpose of queues.  It could be manageable in the case 
 of a very small number of projects sharing a cluster, but doesn't scale to 
 tens or hundreds of projects sharing a cluster.  If a given cluster is shared 
 between 50 projects, creating 50 separate queues will result in inefficient 
 use of the cluster resources.  Furthermore, a given project may desire more 
 than one queue for different types or priorities of applications.  
 Proposed Solution
 Rather than relying on external tools to infer through the user and/or queue 
 who to charge for a given application, I propose a straightforward approach 
 where that information be explicitly supplied when the application is 
 submitted, just like we do with queues.  Let's use a charge card analogy: 
 when you buy something online, you don't just say who you are and how to ship 
 it, you also specify how you're paying for it.  Similarly, when submitting an 
 application in YARN, you could explicitly specify to whom it's resource usage 
 should be associated (a project, team, cost center, etc).
 This new configuration parameter should default to being optional, so that 
 organizations not interested in chargeback or project-level resource tracking 
 can happily continue on as if it wasn't there.  However, it should be 
 configurable at the cluster-level such that, a given cluster to could elect 
 to make it required, so that all applications would have an associated 
 project.  The value of this new parameter should be exposed via the Resource 
 Manager UI and Resource Manager REST API, so that users and tools can make 
 use of it for chargeback, utilization metrics, etc.
 I'm undecided on what to name the new parameter, as I like the flexibility in 
 the ways it could be used.  It is essentially just an additional party other 
 than user or queue that an application can be associated with, so its use is 
 not just limited to a chargeback scenario.  For example, an organization not 
 interested in chargeback could still use this parameter to communicate useful 
 information about a application (e.g. pipelineX.stageN) and aggregate like 
 applications.
 Enforcement
 Couldn't users just specify this information as a prefix for their job names? 
  Yes, but the missing piece this could provides is enforcement.  Ideally, I'd 
 like this parameter to work very much like how the queues work.  Like already 
 exists with queues, it'd be ideal if a given user couldn't just specify any 
 old value for this parameter.  It could be configurable such that a given 
 user only has permission to 

[jira] [Commented] (YARN-462) Project Parameter for Chargeback

2013-03-13 Thread Kendall Thrapp (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13601245#comment-13601245
 ] 

Kendall Thrapp commented on YARN-462:
-

Thanks for the questions and feedback.  Yes, first I should clarify what I 
intended by chargeback.  I'm looking to be able quantify cluster resource usage 
(memory, CPU, HDFS, etc.) for every application, and then roll that up to the 
project level.  This would allow us to accurately charge the customer (i.e. 
team/project) for their grid usage (either literally or just informatively).  I 
want to provide incentive for more efficient coding, as well as make it easier 
for teams to compare their resource usage across different software versions of 
their Hadoop applications, config parameter changes, etc.

I had originally hoped that hierarchical queues could serve this purpose as 
well, but have since run into several issues with this approach.  The first is 
that it doesn't scale for clusters with large numbers of projects.  I've seen 
large clusters shared between over a hundred different projects, each with 
their own teams of users.  If I recall correctly, queues can't be assigned less 
than 1% of the total capacity, so it wouldn't be possible to give each of these 
project their own queue.  Even if we could, I suspect this could result in too 
much overhead for the scheduler and too much fragmentation of the cluster 
resources, which could result in poorer overall utilization.

The second issue is that the project-per-queue approach conflicts with how I 
see users wanting to use our queues.  In many cases I see queues being used to 
distinguish application priorities, ensuring that high priority time-sensitive 
jobs get the resources they need to finish on time, while big but lower 
priority and less time-sensitive jobs are constrained by being in a smaller 
queue.  I'd expect a lot of pushback from our users for any chargeback-focused 
queue configuration that had a negative impact on job run times and meeting 
SLAs.  The idea of the project/chargeback parameter decouples the two.

 Project Parameter for Chargeback
 

 Key: YARN-462
 URL: https://issues.apache.org/jira/browse/YARN-462
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp

 Problem Summary
 For the purpose of chargeback and better understanding of grid usage, we need 
 to be able to associate applications with projects, e.g. pipeline X, 
 property Y.  This would allow us to aggregate on this property, thereby 
 helping us compute grid resource usage for the entire project.  Currently, 
 for a given application, two things we know about it are the user that 
 submitted it and the queue it was submitted to.  Below, I'll explain why 
 neither of these is adequate for enterprise-level chargeback and 
 understanding resource allocation needs.
 Why Not Users?
 Its not individual users that are paying the bill -- its projects.  When one 
 of our real users submits an application on a Hadoop grid, they're presumably 
 not usually doing it for themselves.  They're doing work for some project or 
 team effort, so its that team or project that should be charged for all its 
 users applications.  Maintaining outside lists of associations between users 
 and projects is error-prone because it is time-sensitive and requires 
 continued ongoing maintenance.  New users join organizations, users leave and 
 users even change projects.  Furthermore, users may split their time between 
 multiple projects, making it ambiguous as to which of a user's projects a 
 given application should be charged.  Also, there can be headless users, 
 which can be even more difficult to link to a project and can be shared 
 between teams or projects.
 Why Not Queues?
 The purpose of queues is for scheduling.  Overloading the queues concept to 
 also mean who should be charged for an application can have a detrimental 
 effect on the primary purpose of queues.  It could be manageable in the case 
 of a very small number of projects sharing a cluster, but doesn't scale to 
 tens or hundreds of projects sharing a cluster.  If a given cluster is shared 
 between 50 projects, creating 50 separate queues will result in inefficient 
 use of the cluster resources.  Furthermore, a given project may desire more 
 than one queue for different types or priorities of applications.  
 Proposed Solution
 Rather than relying on external tools to infer through the user and/or queue 
 who to charge for a given application, I propose a straightforward approach 
 where that information be explicitly supplied when the application is 
 submitted, just like we do with queues.  Let's use a charge card analogy: 
 when you buy something online, you don't 

[jira] [Commented] (YARN-462) Project Parameter for Chargeback

2013-03-13 Thread Andy Rhee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13601601#comment-13601601
 ] 

Andy Rhee commented on YARN-462:


Kendall - Again, another great idea!  Two things popped in my mind.  

1. I wonder if we need to also verify and enforce project validity on a given 
cluster mapped to a whitelist or blacklist in the cluster config (this might 
even be tied to external source of truth like LDAP later) or decouple or 
delegate validation to other parts or external process, e.g. queue, user, or 
project accounting. 
 
2. Another interesting spin off of your idea could be flexible enforceable 
parameters or meta config.  Instead of keep modifying the code every time we 
have a great idea for a new parameter to enforce, it may be more cost effective 
to allow admins to define enforceable parameters in the cluster config, so that 
we don't have to worry about what to name new parameter or changing it later, 
IMHO :)

 Project Parameter for Chargeback
 

 Key: YARN-462
 URL: https://issues.apache.org/jira/browse/YARN-462
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp

 Problem Summary
 For the purpose of chargeback and better understanding of grid usage, we need 
 to be able to associate applications with projects, e.g. pipeline X, 
 property Y.  This would allow us to aggregate on this property, thereby 
 helping us compute grid resource usage for the entire project.  Currently, 
 for a given application, two things we know about it are the user that 
 submitted it and the queue it was submitted to.  Below, I'll explain why 
 neither of these is adequate for enterprise-level chargeback and 
 understanding resource allocation needs.
 Why Not Users?
 Its not individual users that are paying the bill -- its projects.  When one 
 of our real users submits an application on a Hadoop grid, they're presumably 
 not usually doing it for themselves.  They're doing work for some project or 
 team effort, so its that team or project that should be charged for all its 
 users applications.  Maintaining outside lists of associations between users 
 and projects is error-prone because it is time-sensitive and requires 
 continued ongoing maintenance.  New users join organizations, users leave and 
 users even change projects.  Furthermore, users may split their time between 
 multiple projects, making it ambiguous as to which of a user's projects a 
 given application should be charged.  Also, there can be headless users, 
 which can be even more difficult to link to a project and can be shared 
 between teams or projects.
 Why Not Queues?
 The purpose of queues is for scheduling.  Overloading the queues concept to 
 also mean who should be charged for an application can have a detrimental 
 effect on the primary purpose of queues.  It could be manageable in the case 
 of a very small number of projects sharing a cluster, but doesn't scale to 
 tens or hundreds of projects sharing a cluster.  If a given cluster is shared 
 between 50 projects, creating 50 separate queues will result in inefficient 
 use of the cluster resources.  Furthermore, a given project may desire more 
 than one queue for different types or priorities of applications.  
 Proposed Solution
 Rather than relying on external tools to infer through the user and/or queue 
 who to charge for a given application, I propose a straightforward approach 
 where that information be explicitly supplied when the application is 
 submitted, just like we do with queues.  Let's use a charge card analogy: 
 when you buy something online, you don't just say who you are and how to ship 
 it, you also specify how you're paying for it.  Similarly, when submitting an 
 application in YARN, you could explicitly specify to whom it's resource usage 
 should be associated (a project, team, cost center, etc).
 This new configuration parameter should default to being optional, so that 
 organizations not interested in chargeback or project-level resource tracking 
 can happily continue on as if it wasn't there.  However, it should be 
 configurable at the cluster-level such that, a given cluster to could elect 
 to make it required, so that all applications would have an associated 
 project.  The value of this new parameter should be exposed via the Resource 
 Manager UI and Resource Manager REST API, so that users and tools can make 
 use of it for chargeback, utilization metrics, etc.
 I'm undecided on what to name the new parameter, as I like the flexibility in 
 the ways it could be used.  It is essentially just an additional party other 
 than user or queue that an application can be associated with, so its use is 
 not just limited to a chargeback 

[jira] [Commented] (YARN-462) Project Parameter for Chargeback

2013-03-13 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13601652#comment-13601652
 ] 

Karthik Kambatla commented on YARN-462:
---

Fair points, Kendall. Thanks for the detailed explanation. As Arun said, the 
idea seems to be a very useful one, but we should be wary of adding new 
concepts to YARN. If we decide to go ahead with the chargeback parameter, I am 
concerned if we will end up duplicating a lot of scheduler code - ACLs, 
enforcement etc.

I wonder if the following would satisfy your requirements while leveraging all 
queue definition/ACLs logic and not overloading the scheduler:
- Idea of a 'project' queue that goes under the leaf queues. These 'project' 
queues are transparent to the scheduler at scheduling time, but keep track of 
the actual usage.
- e.g. root.sales.seller1.sell-coconut-project and 
root.sales.seller1.sell-pineapple-project could be two queues for seller1. At 
schedule time, the scheduler views all jobs under both projects to be under 
seller1 and we hopefully won't run into capacity  1 issues you are mentioning. 
Neither does it increase the scheduling latency.

 Project Parameter for Chargeback
 

 Key: YARN-462
 URL: https://issues.apache.org/jira/browse/YARN-462
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp

 Problem Summary
 For the purpose of chargeback and better understanding of grid usage, we need 
 to be able to associate applications with projects, e.g. pipeline X, 
 property Y.  This would allow us to aggregate on this property, thereby 
 helping us compute grid resource usage for the entire project.  Currently, 
 for a given application, two things we know about it are the user that 
 submitted it and the queue it was submitted to.  Below, I'll explain why 
 neither of these is adequate for enterprise-level chargeback and 
 understanding resource allocation needs.
 Why Not Users?
 Its not individual users that are paying the bill -- its projects.  When one 
 of our real users submits an application on a Hadoop grid, they're presumably 
 not usually doing it for themselves.  They're doing work for some project or 
 team effort, so its that team or project that should be charged for all its 
 users applications.  Maintaining outside lists of associations between users 
 and projects is error-prone because it is time-sensitive and requires 
 continued ongoing maintenance.  New users join organizations, users leave and 
 users even change projects.  Furthermore, users may split their time between 
 multiple projects, making it ambiguous as to which of a user's projects a 
 given application should be charged.  Also, there can be headless users, 
 which can be even more difficult to link to a project and can be shared 
 between teams or projects.
 Why Not Queues?
 The purpose of queues is for scheduling.  Overloading the queues concept to 
 also mean who should be charged for an application can have a detrimental 
 effect on the primary purpose of queues.  It could be manageable in the case 
 of a very small number of projects sharing a cluster, but doesn't scale to 
 tens or hundreds of projects sharing a cluster.  If a given cluster is shared 
 between 50 projects, creating 50 separate queues will result in inefficient 
 use of the cluster resources.  Furthermore, a given project may desire more 
 than one queue for different types or priorities of applications.  
 Proposed Solution
 Rather than relying on external tools to infer through the user and/or queue 
 who to charge for a given application, I propose a straightforward approach 
 where that information be explicitly supplied when the application is 
 submitted, just like we do with queues.  Let's use a charge card analogy: 
 when you buy something online, you don't just say who you are and how to ship 
 it, you also specify how you're paying for it.  Similarly, when submitting an 
 application in YARN, you could explicitly specify to whom it's resource usage 
 should be associated (a project, team, cost center, etc).
 This new configuration parameter should default to being optional, so that 
 organizations not interested in chargeback or project-level resource tracking 
 can happily continue on as if it wasn't there.  However, it should be 
 configurable at the cluster-level such that, a given cluster to could elect 
 to make it required, so that all applications would have an associated 
 project.  The value of this new parameter should be exposed via the Resource 
 Manager UI and Resource Manager REST API, so that users and tools can make 
 use of it for chargeback, utilization metrics, etc.
 I'm undecided on what to name the new parameter, as I like the flexibility in 
 the ways it 

[jira] [Commented] (YARN-462) Project Parameter for Chargeback

2013-03-09 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598147#comment-13598147
 ] 

Arun C Murthy commented on YARN-462:


Kendall - what sort of chargeback are you thinking about? Just reporting?

One of the reasons we implemented hierarchical queues is to ease management for 
cases like you've talked about here - won't it suffice to have sub-queues?

I have sympathy for the notion of chargebacks, but I'm wary of adding new 
concepts to YARN without really double-clicking into whether it's really 
necessary... let's discuss.

 Project Parameter for Chargeback
 

 Key: YARN-462
 URL: https://issues.apache.org/jira/browse/YARN-462
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp

 Problem Summary
 For the purpose of chargeback and better understanding of grid usage, we need 
 to be able to associate applications with projects, e.g. pipeline X, 
 property Y.  This would allow us to aggregate on this property, thereby 
 helping us compute grid resource usage for the entire project.  Currently, 
 for a given application, two things we know about it are the user that 
 submitted it and the queue it was submitted to.  Below, I'll explain why 
 neither of these is adequate for enterprise-level chargeback and 
 understanding resource allocation needs.
 Why Not Users?
 Its not individual users that are paying the bill -- its projects.  When one 
 of our real users submits an application on a Hadoop grid, they're presumably 
 not usually doing it for themselves.  They're doing work for some project or 
 team effort, so its that team or project that should be charged for all its 
 users applications.  Maintaining outside lists of associations between users 
 and projects is error-prone because it is time-sensitive and requires 
 continued ongoing maintenance.  New users join organizations, users leave and 
 users even change projects.  Furthermore, users may split their time between 
 multiple projects, making it ambiguous as to which of a user's projects a 
 given application should be charged.  Also, there can be headless users, 
 which can be even more difficult to link to a project and can be shared 
 between teams or projects.
 Why Not Queues?
 The purpose of queues is for scheduling.  Overloading the queues concept to 
 also mean who should be charged for an application can have a detrimental 
 effect on the primary purpose of queues.  It could be manageable in the case 
 of a very small number of projects sharing a cluster, but doesn't scale to 
 tens or hundreds of projects sharing a cluster.  If a given cluster is shared 
 between 50 projects, creating 50 separate queues will result in inefficient 
 use of the cluster resources.  Furthermore, a given project may desire more 
 than one queue for different types or priorities of applications.  
 Proposed Solution
 Rather than relying on external tools to infer through the user and/or queue 
 who to charge for a given application, I propose a straightforward approach 
 where that information be explicitly supplied when the application is 
 submitted, just like we do with queues.  Let's use a charge card analogy: 
 when you buy something online, you don't just say who you are and how to ship 
 it, you also specify how you're paying for it.  Similarly, when submitting an 
 application in YARN, you could explicitly specify to whom it's resource usage 
 should be associated (a project, team, cost center, etc).
 This new configuration parameter should default to being optional, so that 
 organizations not interested in chargeback or project-level resource tracking 
 can happily continue on as if it wasn't there.  However, it should be 
 configurable at the cluster-level such that, a given cluster to could elect 
 to make it required, so that all applications would have an associated 
 project.  The value of this new parameter should be exposed via the Resource 
 Manager UI and Resource Manager REST API, so that users and tools can make 
 use of it for chargeback, utilization metrics, etc.
 I'm undecided on what to name the new parameter, as I like the flexibility in 
 the ways it could be used.  It is essentially just an additional party other 
 than user or queue that an application can be associated with, so its use is 
 not just limited to a chargeback scenario.  For example, an organization not 
 interested in chargeback could still use this parameter to communicate useful 
 information about a application (e.g. pipelineX.stageN) and aggregate like 
 applications.
 Enforcement
 Couldn't users just specify this information as a prefix for their job names? 
  Yes, but the missing piece this could provides is enforcement.  Ideally, I'd 
 like 

[jira] [Commented] (YARN-462) Project Parameter for Chargeback

2013-03-08 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13597818#comment-13597818
 ] 

Karthik Kambatla commented on YARN-462:
---

Interesting idea, Kendall.

Just so I understand it right, can't we do this in the current way the 
schedulers work (CS and FS)? If each chargeable entity has a corresponding 
queue, it is just submitting to that queue. No? The hierarchy should help us 
address user, team, etc.

I see one case where that might not work though - if entity A (user) is part of 
two other entities (teams) B and C and submits jobs to both queues. Is this the 
particular case you are referring to? Even then you should be able to create 
queues B.A and C.A and submit the jobs.

Am I missing something basic here?



 Project Parameter for Chargeback
 

 Key: YARN-462
 URL: https://issues.apache.org/jira/browse/YARN-462
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp

 Problem Summary
 For the purpose of chargeback and better understanding of grid usage, we need 
 to be able to associate applications with projects, e.g. pipeline X, 
 property Y.  This would allow us to aggregate on this property, thereby 
 helping us compute grid resource usage for the entire project.  Currently, 
 for a given application, two things we know about it are the user that 
 submitted it and the queue it was submitted to.  Below, I'll explain why 
 neither of these is adequate for enterprise-level chargeback and 
 understanding resource allocation needs.
 Why Not Users?
 Its not individual users that are paying the bill -- its projects.  When one 
 of our real users submits an application on a Hadoop grid, they're presumably 
 not usually doing it for themselves.  They're doing work for some project or 
 team effort, so its that team or project that should be charged for all its 
 users applications.  Maintaining outside lists of associations between users 
 and projects is error-prone because it is time-sensitive and requires 
 continued ongoing maintenance.  New users join organizations, users leave and 
 users even change projects.  Furthermore, users may split their time between 
 multiple projects, making it ambiguous as to which of a user's projects a 
 given application should be charged.  Also, there can be headless users, 
 which can be even more difficult to link to a project and can be shared 
 between teams or projects.
 Why Not Queues?
 The purpose of queues is for scheduling.  Overloading the queues concept to 
 also mean who should be charged for an application can have a detrimental 
 effect on the primary purpose of queues.  It could be manageable in the case 
 of a very small number of projects sharing a cluster, but doesn't scale to 
 tens or hundreds of projects sharing a cluster.  If a given cluster is shared 
 between 50 projects, creating 50 separate queues will result in inefficient 
 use of the cluster resources.  Furthermore, a given project may desire more 
 than one queue for different types or priorities of applications.  
 Proposed Solution
 Rather than relying on external tools to infer through the user and/or queue 
 who to charge for a given application, I propose a straightforward approach 
 where that information be explicitly supplied when the application is 
 submitted, just like we do with queues.  Let's use a charge card analogy: 
 when you buy something online, you don't just say who you are and how to ship 
 it, you also specify how you're paying for it.  Similarly, when submitting an 
 application in YARN, you could explicitly specify to whom it's resource usage 
 should be associated (a project, team, cost center, etc).
 This new configuration parameter should default to being optional, so that 
 organizations not interested in chargeback or project-level resource tracking 
 can happily continue on as if it wasn't there.  However, it should be 
 configurable at the cluster-level such that, a given cluster to could elect 
 to make it required, so that all applications would have an associated 
 project.  The value of this new parameter should be exposed via the Resource 
 Manager UI and Resource Manager REST API, so that users and tools can make 
 use of it for chargeback, utilization metrics, etc.
 I'm undecided on what to name the new parameter, as I like the flexibility in 
 the ways it could be used.  It is essentially just an additional party other 
 than user or queue that an application can be associated with, so its use is 
 not just limited to a chargeback scenario.  For example, an organization not 
 interested in chargeback could still use this parameter to communicate useful 
 information about a application (e.g. pipelineX.stageN) and aggregate like 

[jira] [Commented] (YARN-462) Project Parameter for Chargeback

2013-03-08 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13597822#comment-13597822
 ] 

Karthik Kambatla commented on YARN-462:
---

I might have jumped the gun there with my previous comment. I was under the 
impression the chargeback would factor back in to the scheduling logic. If it 
is not, it would definitely help in fully understanding the kind of chargebacks 
you are referring to, and what they are intended for.

 Project Parameter for Chargeback
 

 Key: YARN-462
 URL: https://issues.apache.org/jira/browse/YARN-462
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Affects Versions: 0.23.6
Reporter: Kendall Thrapp

 Problem Summary
 For the purpose of chargeback and better understanding of grid usage, we need 
 to be able to associate applications with projects, e.g. pipeline X, 
 property Y.  This would allow us to aggregate on this property, thereby 
 helping us compute grid resource usage for the entire project.  Currently, 
 for a given application, two things we know about it are the user that 
 submitted it and the queue it was submitted to.  Below, I'll explain why 
 neither of these is adequate for enterprise-level chargeback and 
 understanding resource allocation needs.
 Why Not Users?
 Its not individual users that are paying the bill -- its projects.  When one 
 of our real users submits an application on a Hadoop grid, they're presumably 
 not usually doing it for themselves.  They're doing work for some project or 
 team effort, so its that team or project that should be charged for all its 
 users applications.  Maintaining outside lists of associations between users 
 and projects is error-prone because it is time-sensitive and requires 
 continued ongoing maintenance.  New users join organizations, users leave and 
 users even change projects.  Furthermore, users may split their time between 
 multiple projects, making it ambiguous as to which of a user's projects a 
 given application should be charged.  Also, there can be headless users, 
 which can be even more difficult to link to a project and can be shared 
 between teams or projects.
 Why Not Queues?
 The purpose of queues is for scheduling.  Overloading the queues concept to 
 also mean who should be charged for an application can have a detrimental 
 effect on the primary purpose of queues.  It could be manageable in the case 
 of a very small number of projects sharing a cluster, but doesn't scale to 
 tens or hundreds of projects sharing a cluster.  If a given cluster is shared 
 between 50 projects, creating 50 separate queues will result in inefficient 
 use of the cluster resources.  Furthermore, a given project may desire more 
 than one queue for different types or priorities of applications.  
 Proposed Solution
 Rather than relying on external tools to infer through the user and/or queue 
 who to charge for a given application, I propose a straightforward approach 
 where that information be explicitly supplied when the application is 
 submitted, just like we do with queues.  Let's use a charge card analogy: 
 when you buy something online, you don't just say who you are and how to ship 
 it, you also specify how you're paying for it.  Similarly, when submitting an 
 application in YARN, you could explicitly specify to whom it's resource usage 
 should be associated (a project, team, cost center, etc).
 This new configuration parameter should default to being optional, so that 
 organizations not interested in chargeback or project-level resource tracking 
 can happily continue on as if it wasn't there.  However, it should be 
 configurable at the cluster-level such that, a given cluster to could elect 
 to make it required, so that all applications would have an associated 
 project.  The value of this new parameter should be exposed via the Resource 
 Manager UI and Resource Manager REST API, so that users and tools can make 
 use of it for chargeback, utilization metrics, etc.
 I'm undecided on what to name the new parameter, as I like the flexibility in 
 the ways it could be used.  It is essentially just an additional party other 
 than user or queue that an application can be associated with, so its use is 
 not just limited to a chargeback scenario.  For example, an organization not 
 interested in chargeback could still use this parameter to communicate useful 
 information about a application (e.g. pipelineX.stageN) and aggregate like 
 applications.
 Enforcement
 Couldn't users just specify this information as a prefix for their job names? 
  Yes, but the missing piece this could provides is enforcement.  Ideally, I'd 
 like this parameter to work very much like how the queues work.  Like already 
 exists with queues, it'd be ideal if