[jira] [Commented] (YARN-1027) Implement RMHAServiceProtocol
[ https://issues.apache.org/jira/browse/YARN-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729294#comment-13729294 ] Karthik Kambatla commented on YARN-1027: [~nemon], if you haven't started work on this already, do you mind if I take this up? I have been discussing this with Bikas on YARN-149 and offline and started working on. Implement RMHAServiceProtocol - Key: YARN-1027 URL: https://issues.apache.org/jira/browse/YARN-1027 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: nemon lou Implement existing HAServiceProtocol from Hadoop common. This protocol is the single point of interaction between the RM and HA clients/services. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-1026) Test and verify ACL based ZKRMStateStore fencing for RM State Store
[ https://issues.apache.org/jira/browse/YARN-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla reassigned YARN-1026: -- Assignee: Karthik Kambatla Test and verify ACL based ZKRMStateStore fencing for RM State Store --- Key: YARN-1026 URL: https://issues.apache.org/jira/browse/YARN-1026 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Karthik Kambatla ZooKeeper allows create/delete ACL's for immediate children of a znode. It also has admin ACL's on a znode that allow changing the create/delete ACL's on that znode. RM instances could share the admin ACL's on the state store root znode. When an RM transitions to active, it can use the shared admin ACL's to give itself exclusive create/delete permissions on the children of the root znode. If all ZK state store operations are atomic and involve a create/delete znode then the above effectively fences other RM instances from modifying the store. This ACL change is only allowed when transitioning to active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-1029) Allow embedding leader election into the RM
[ https://issues.apache.org/jira/browse/YARN-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla reassigned YARN-1029: -- Assignee: Karthik Kambatla Allow embedding leader election into the RM --- Key: YARN-1029 URL: https://issues.apache.org/jira/browse/YARN-1029 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Karthik Kambatla It should be possible to embed common ActiveStandyElector into the RM such that ZooKeeper based leader election and notification is in-built. In conjunction with a ZK state store, this configuration will be a simple deployment option. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1027) Implement RMHAServiceProtocol
[ https://issues.apache.org/jira/browse/YARN-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729348#comment-13729348 ] nemon lou commented on YARN-1027: - I have also started working on this since it was in unassigned. It's ok to take it up,i will review the patch :) Implement RMHAServiceProtocol - Key: YARN-1027 URL: https://issues.apache.org/jira/browse/YARN-1027 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: nemon lou Implement existing HAServiceProtocol from Hadoop common. This protocol is the single point of interaction between the RM and HA clients/services. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS
[ https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729530#comment-13729530 ] Timothy St. Clair commented on YARN-160: I think the prudent approach would be to evaluate hwloc and its community, and determine if it meets the internal needs of YARN. For risk mitigation purposes, I think having a plugin abstraction layer as a fallback would also be wise. I did notice there are also java bindings around hwloc (https://launchpad.net/jhwloc/) nodemanagers should obtain cpu/memory values from underlying OS --- Key: YARN-160 URL: https://issues.apache.org/jira/browse/YARN-160 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Fix For: 2.1.0-beta As mentioned in YARN-2 *NM memory and CPU configs* Currently these values are coming from the config of the NM, we should be able to obtain those values from the OS (ie, in the case of Linux from /proc/meminfo /proc/cpuinfo). As this is highly OS dependent we should have an interface that obtains this information. In addition implementations of this interface should be able to specify a mem/cpu offset (amount of mem/cpu not to be avail as YARN resource), this would allow to reserve mem/cpu for the OS and other services outside of YARN containers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1025) NodeManager does not propagate java.library.path to launched child containers on Windows
[ https://issues.apache.org/jira/browse/YARN-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729549#comment-13729549 ] Kihwal Lee commented on YARN-1025: -- On Linux, use of LD_LIBRARY_PATH is better because it's easier to manipulate (e.g. path munging) and offers a better search and error handling. When java.library.path is set, jvm tries to load the first match. If it fails, the failure is permanent. I.e. no further search is done. This is unacceptable if the search paths contain libraries for multiple architectures (e.g. 32 bit and 64 bit). When LD_LIBRARY_PATH is used exclusively, the system loader is in charge and it does a much better job. I believe the behavior is similar on Windows with PATH. NodeManager does not propagate java.library.path to launched child containers on Windows Key: YARN-1025 URL: https://issues.apache.org/jira/browse/YARN-1025 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0, 2.1.1-beta Reporter: Chris Nauroth Neither the NodeManager process itself nor the child container processes that it launches have the correct setting for java.library.path on Windows. This prevents the processes from loading native code from hadoop.dll. The native code is required for correct functioning on Windows (not optional), so this ultimately can cause failures in MapReduce jobs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1021) Yarn Scheduler Load Simulator
[ https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-1021: -- Description: The Yarn Scheduler is a fertile area of interest with different implementations, e.g., Fifo, Capacity and Fair schedulers. Meanwhile, several optimizations are also made to improve scheduler performance for different scenarios and workload. Each scheduler algorithm has its own set of features, and drives scheduling decisions by many factors, such as fairness, capacity guarantee, resource availability, etc. It is very important to evaluate a scheduler algorithm very well before we deploy it in a production cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling algorithm. Evaluating in a real cluster is always time and cost consuming, and it is also very hard to find a large-enough cluster. Hence, a simulator which can predict how well a scheduler algorithm for some specific workload would be quite useful. We want to build a Scheduler Load Simulator to simulate large-scale Yarn clusters and application loads in a single machine. This would be invaluable in furthering Yarn by providing a tool for researchers and developers to prototype new scheduler features and predict their behavior and performance with reasonable amount of confidence, there-by aiding rapid innovation. The simulator will exercise the real Yarn ResourceManager removing the network factor by simulating NodeManagers and ApplicationMasters via handling and dispatching NM/AMs heartbeat events from within the same JVM. To keep tracking of scheduler behavior and performance, a scheduler wrapper will wrap the real scheduler. The simulator will produce real time metrics while executing, including: * Resource usages for whole cluster and each queue, which can be utilized to configure cluster and queue's capacity. * The detailed application execution trace (recorded in relation to simulated time), which can be analyzed to understand/validate the scheduler behavior (individual jobs turn around time, throughput, fairness, capacity guarantee, etc). * Several key metrics of scheduler algorithm, such as time cost of each scheduler operation (allocate, handle, etc), which can be utilized by Hadoop developers to find the code spots and scalability limits. The simulator will provide real time charts showing the behavior of the scheduler and its performance. was: The Yarn Scheduler is a fertile area of interest with different implementations, e.g., Fifo, Capacity and Fair schedulers. Meanwhile, several optimizations are also made to improve scheduler performance for different scenarios and workload. Each scheduler algorithm has its own set of features, and drives scheduling decisions by many factors, such as fairness, capacity guarantee, resource availability, etc. It is very important to evaluate a scheduler algorithm very well before we deploy it in a production cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling algorithm. Evaluating in a real cluster is always time and cost consuming, and it is also very hard to find a large-enough cluster. Hence, a simulator which can predict how well a scheduler algorithm for some specific workload would be quite useful. We want to build a Scheduler Load Simulator to simulate large-scale Yarn clusters and application loads in a single machine. This would be invaluable in furthering Yarn by providing a tool for researchers and developers to prototype new scheduler features and predict their behavior and performance with reasonable amount of confidence, there-by aiding rapid innovation. The simulator will exercise the real Yarn ResourceManager removing the network factor by simulating NodeManagers and ApplicationManagers via handling and dispatching NM/AMs heartbeat events from within the same JVM. To keep tracking of scheduler behavior and performance, a scheduler wrapper will wrap the real scheduler. The simulator will produce real time metrics while executing, including: * Resource usages for whole cluster and each queue, which can be utilized to configure cluster and queue's capacity. * The detailed application execution trace (recorded in relation to simulated time), which can be analyzed to understand/validate the scheduler behavior (individual jobs turn around time, throughput, fairness, capacity guarantee, etc). * Several key metrics of scheduler algorithm, such as time cost of each scheduler operation (allocate, handle, etc), which can be utilized by Hadoop developers to find the code spots and scalability limits. The simulator will provide real time charts showing the behavior of the scheduler and its performance. Yarn Scheduler Load Simulator - Key: YARN-1021 URL: https://issues.apache.org/jira/browse/YARN-1021 Project:
[jira] [Updated] (YARN-975) Adding HDFS implementation for grouped reading and writing interfaces of history storage
[ https://issues.apache.org/jira/browse/YARN-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-975: - Attachment: YARN-975.4.patch Updated the patch according to YARN-1007 Adding HDFS implementation for grouped reading and writing interfaces of history storage Key: YARN-975 URL: https://issues.apache.org/jira/browse/YARN-975 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-975.1.patch, YARN-975.2.patch, YARN-975.3.patch, YARN-975.4.patch HDFS implementation should be a standard persistence strategy of history storage -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1024) Define a virtual core unambigiously
[ https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729724#comment-13729724 ] Arun C Murthy commented on YARN-1024: - bq. If I were to package my simulator and give it to other people on other clusters, it would still be true that it spins one CPU. Its runtime, however, would vary depending on the horsepower. I don't see the conflict. If you don't care about predictable runtime, you could still say I want to run on 1 virtual-core. By the above non-requirement on predictability, whether it's 1 (virtual) core out of 16 physical cores or 1024 virtual cores is immaterial, isn't it? And yes, you still get only 1 physical core since the virtual core is mapped to a single physical core. The point about specifying a virtual core is that you get predictable performance when you migrate your application between clusters and other goodness. What am I missing here? Define a virtual core unambigiously --- Key: YARN-1024 URL: https://issues.apache.org/jira/browse/YARN-1024 Project: Hadoop YARN Issue Type: Improvement Reporter: Arun C Murthy Assignee: Arun C Murthy We need to clearly define the meaning of a virtual core unambiguously so that it's easy to migrate applications between clusters. For e.g. here is Amazon EC2 definition of ECU: http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it Essentially we need to clearly define a YARN Virtual Core (YVC). Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.* -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-975) Adding HDFS implementation for grouped reading and writing interfaces of history storage
[ https://issues.apache.org/jira/browse/YARN-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729730#comment-13729730 ] Hadoop QA commented on YARN-975: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12596159/YARN-975.4.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1656//console This message is automatically generated. Adding HDFS implementation for grouped reading and writing interfaces of history storage Key: YARN-975 URL: https://issues.apache.org/jira/browse/YARN-975 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-975.1.patch, YARN-975.2.patch, YARN-975.3.patch, YARN-975.4.patch HDFS implementation should be a standard persistence strategy of history storage -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-696) Enable multiple states to to be specified in Resource Manager apps REST call
[ https://issues.apache.org/jira/browse/YARN-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trevor Lorimer updated YARN-696: Attachment: YARN-696.diff Enable multiple states to to be specified in Resource Manager apps REST call Key: YARN-696 URL: https://issues.apache.org/jira/browse/YARN-696 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.0.4-alpha Reporter: Trevor Lorimer Assignee: Trevor Lorimer Priority: Trivial Attachments: YARN-696.diff Within the YARN Resource Manager REST API the GET call which returns all Applications can be filtered by a single State query parameter (http://rm http address:port/ws/v1/cluster/apps). There are 8 possible states (New, Submitted, Accepted, Running, Finishing, Finished, Failed, Killed), if no state parameter is specified all states are returned, however if a sub-set of states is required then multiple REST calls are required (max. of 7). The proposal is to be able to specify multiple states in a single REST call. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-696) Enable multiple states to to be specified in Resource Manager apps REST call
[ https://issues.apache.org/jira/browse/YARN-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trevor Lorimer updated YARN-696: Attachment: (was: 0001-YARN-696.patch) Enable multiple states to to be specified in Resource Manager apps REST call Key: YARN-696 URL: https://issues.apache.org/jira/browse/YARN-696 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.0.4-alpha Reporter: Trevor Lorimer Assignee: Trevor Lorimer Priority: Trivial Attachments: YARN-696.diff Within the YARN Resource Manager REST API the GET call which returns all Applications can be filtered by a single State query parameter (http://rm http address:port/ws/v1/cluster/apps). There are 8 possible states (New, Submitted, Accepted, Running, Finishing, Finished, Failed, Killed), if no state parameter is specified all states are returned, however if a sub-set of states is required then multiple REST calls are required (max. of 7). The proposal is to be able to specify multiple states in a single REST call. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-953) [YARN-321] Change ResourceManager to use HistoryStorage to log history data
[ https://issues.apache.org/jira/browse/YARN-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-953: - Attachment: YARN-953.3.patch Update the patch against the latest branch code, and add the code to init/start and stop the writer in case it is a service [YARN-321] Change ResourceManager to use HistoryStorage to log history data --- Key: YARN-953 URL: https://issues.apache.org/jira/browse/YARN-953 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: YARN-953.1.patch, YARN-953.2.patch, YARN-953.3.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-696) Enable multiple states to to be specified in Resource Manager apps REST call
[ https://issues.apache.org/jira/browse/YARN-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trevor Lorimer updated YARN-696: Attachment: YARN-696.diff Enable multiple states to to be specified in Resource Manager apps REST call Key: YARN-696 URL: https://issues.apache.org/jira/browse/YARN-696 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.0.4-alpha Reporter: Trevor Lorimer Assignee: Trevor Lorimer Priority: Trivial Attachments: YARN-696.diff Within the YARN Resource Manager REST API the GET call which returns all Applications can be filtered by a single State query parameter (http://rm http address:port/ws/v1/cluster/apps). There are 8 possible states (New, Submitted, Accepted, Running, Finishing, Finished, Failed, Killed), if no state parameter is specified all states are returned, however if a sub-set of states is required then multiple REST calls are required (max. of 7). The proposal is to be able to specify multiple states in a single REST call. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-696) Enable multiple states to to be specified in Resource Manager apps REST call
[ https://issues.apache.org/jira/browse/YARN-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trevor Lorimer updated YARN-696: Attachment: (was: YARN-696.diff) Enable multiple states to to be specified in Resource Manager apps REST call Key: YARN-696 URL: https://issues.apache.org/jira/browse/YARN-696 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.0.4-alpha Reporter: Trevor Lorimer Assignee: Trevor Lorimer Priority: Trivial Attachments: YARN-696.diff Within the YARN Resource Manager REST API the GET call which returns all Applications can be filtered by a single State query parameter (http://rm http address:port/ws/v1/cluster/apps). There are 8 possible states (New, Submitted, Accepted, Running, Finishing, Finished, Failed, Killed), if no state parameter is specified all states are returned, however if a sub-set of states is required then multiple REST calls are required (max. of 7). The proposal is to be able to specify multiple states in a single REST call. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-953) [YARN-321] Change ResourceManager to use HistoryStorage to log history data
[ https://issues.apache.org/jira/browse/YARN-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729785#comment-13729785 ] Hadoop QA commented on YARN-953: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12596169/YARN-953.3.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1658//console This message is automatically generated. [YARN-321] Change ResourceManager to use HistoryStorage to log history data --- Key: YARN-953 URL: https://issues.apache.org/jira/browse/YARN-953 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: YARN-953.1.patch, YARN-953.2.patch, YARN-953.3.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-978) [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation
[ https://issues.apache.org/jira/browse/YARN-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729801#comment-13729801 ] Zhijie Shen commented on YARN-978: -- +1 LGTM. The patch should be clean for trunk. [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation -- Key: YARN-978 URL: https://issues.apache.org/jira/browse/YARN-978 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Xuan Gong Fix For: YARN-321 Attachments: YARN-978-1.patch, YARN-978.2.patch, YARN-978.3.patch We dont have ApplicationAttemptReport and Protobuf implementation. Adding that. Thanks, Mayank -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-696) Enable multiple states to to be specified in Resource Manager apps REST call
[ https://issues.apache.org/jira/browse/YARN-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729825#comment-13729825 ] Hadoop QA commented on YARN-696: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12596170/YARN-696.diff against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1657//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1657//console This message is automatically generated. Enable multiple states to to be specified in Resource Manager apps REST call Key: YARN-696 URL: https://issues.apache.org/jira/browse/YARN-696 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.0.4-alpha Reporter: Trevor Lorimer Assignee: Trevor Lorimer Priority: Trivial Attachments: YARN-696.diff Within the YARN Resource Manager REST API the GET call which returns all Applications can be filtered by a single State query parameter (http://rm http address:port/ws/v1/cluster/apps). There are 8 possible states (New, Submitted, Accepted, Running, Finishing, Finished, Failed, Killed), if no state parameter is specified all states are returned, however if a sub-set of states is required then multiple REST calls are required (max. of 7). The proposal is to be able to specify multiple states in a single REST call. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1024) Define a virtual core unambigiously
[ https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729833#comment-13729833 ] Steve Loughran commented on YARN-1024: -- I was the one trying to convince Sandy that a uniform core metric is dangerous, it's like when a MIP was a VAX-equivalent Million Instructions. # different parts have different performance in terms of FPU and memory IO bandwidth, even if the integer perf is the same. (hence people like to get Intel parts over AMD parts on EC2 allocations). # there's also the hyperthreading issue; is an HT core the equivalent of a real core (no, but Linux treats them the same, AFAIK). # over time, as 2007 gets further away, the metric becomes less relevant. # EC2 also includes RAM (e.g m1.small has same CPU as m1.medium, only less RAM; AWS considers medium as having 2x ECUs. One thing I was arguing against in YARN-972 is allocating fractions of a real core: if I say 1 core, I get a single core, irrespective of performance. If EC2s are used, and I ask for 1 ECU, does that mean that I get 0.50 of a bigger core, or a free upgrade. I'm happy if I ask for 8 ECUs and get a guarantee of not being on a CPU with 8 ECUs, making it a minimum requirement of the CPU perf. Define a virtual core unambigiously --- Key: YARN-1024 URL: https://issues.apache.org/jira/browse/YARN-1024 Project: Hadoop YARN Issue Type: Improvement Reporter: Arun C Murthy Assignee: Arun C Murthy We need to clearly define the meaning of a virtual core unambiguously so that it's easy to migrate applications between clusters. For e.g. here is Amazon EC2 definition of ECU: http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it Essentially we need to clearly define a YARN Virtual Core (YVC). Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.* -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-978) [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation
[ https://issues.apache.org/jira/browse/YARN-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729832#comment-13729832 ] Mayank Bansal commented on YARN-978: +1 Thanks, Mayank [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation -- Key: YARN-978 URL: https://issues.apache.org/jira/browse/YARN-978 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Xuan Gong Fix For: YARN-321 Attachments: YARN-978-1.patch, YARN-978.2.patch, YARN-978.3.patch We dont have ApplicationAttemptReport and Protobuf implementation. Adding that. Thanks, Mayank -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1024) Define a virtual core unambigiously
[ https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729839#comment-13729839 ] Sandy Ryza commented on YARN-1024: -- If I am used to running my single-threaded task on a fast core (let's say rated at 250 YVCs), and then I migrate it to another cluster with slower cores (let's say rated at 150 YVCs), and still request 250 YVCs, my task will run no faster than if I had requested it with 150 YVCs. I won't get predictable performance, and, from a scheduling perspective, I'd be better off requesting 150 YVCs on the slower cluster. In a single pcore-to-vcore world, if I know that my task is CPU-bound and uses X threads, I know that each vcore I ask for up to X vcores will predictably improve its performance, whatever cluster I am running on. In a world where different cores have different YVCs, I don't get a clear concept of when I should increase my YVCs requested, and the advantage of doing so depends mostly on the cluster I am running on. A virtual core definition based on processing power masks the fact that two 1.5 GHz cores mean something very different than three 1.0 GHz cores. And makes it very hard to reason about how many virtual cores to request. Define a virtual core unambigiously --- Key: YARN-1024 URL: https://issues.apache.org/jira/browse/YARN-1024 Project: Hadoop YARN Issue Type: Improvement Reporter: Arun C Murthy Assignee: Arun C Murthy We need to clearly define the meaning of a virtual core unambiguously so that it's easy to migrate applications between clusters. For e.g. here is Amazon EC2 definition of ECU: http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it Essentially we need to clearly define a YARN Virtual Core (YVC). Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.* -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1030) Adding AHS as service of RM
Zhijie Shen created YARN-1030: - Summary: Adding AHS as service of RM Key: YARN-1030 URL: https://issues.apache.org/jira/browse/YARN-1030 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1031) JQuery UI components reference external css in branch-23
Jonathan Eagles created YARN-1031: - Summary: JQuery UI components reference external css in branch-23 Key: YARN-1031 URL: https://issues.apache.org/jira/browse/YARN-1031 Project: Hadoop YARN Issue Type: Bug Affects Versions: 0.23.9 Reporter: Jonathan Eagles Assignee: Jonathan Eagles -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1031) JQuery UI components reference external css in branch-23
[ https://issues.apache.org/jira/browse/YARN-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729865#comment-13729865 ] Jonathan Eagles commented on YARN-1031: --- This issue in not present in branch-2 or trunk. Thinking that this will be just a minor bug instead of bringing back entire jquery themes into the source base in branch-0.23 JQuery UI components reference external css in branch-23 Key: YARN-1031 URL: https://issues.apache.org/jira/browse/YARN-1031 Project: Hadoop YARN Issue Type: Bug Affects Versions: 0.23.9 Reporter: Jonathan Eagles Assignee: Jonathan Eagles -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1031) JQuery UI components reference external css in branch-23
[ https://issues.apache.org/jira/browse/YARN-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated YARN-1031: -- Attachment: YARN-1031-branch-0.23.patch JQuery UI components reference external css in branch-23 Key: YARN-1031 URL: https://issues.apache.org/jira/browse/YARN-1031 Project: Hadoop YARN Issue Type: Bug Affects Versions: 0.23.9 Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-1031-branch-0.23.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1018) prereq check for AMRMClient.ContainerRequest relaxLocality flag wrong
[ https://issues.apache.org/jira/browse/YARN-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729890#comment-13729890 ] Steve Loughran commented on YARN-1018: -- OK, but there's a risk that an empty array has come from some feature (like the list of past containers), and that if the list is non-empty then that's because there were no past containers. If it is rejected if the node list is empty, then you may end up coding {code} boolean strict = nodes.length! =0 new AMRMClient.ContainerRequest(capability, nodes, null, 0, !strict); {code} prereq check for AMRMClient.ContainerRequest relaxLocality flag wrong - Key: YARN-1018 URL: https://issues.apache.org/jira/browse/YARN-1018 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Steve Loughran Priority: Minor Trying to create a container request with no racks/nodes and no relaxed priority fails {code} new AMRMClient.ContainerRequest(capability, null, null, 0, false); {code} expected: a container request. actual: stack trace saying I can't relax node locality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1024) Define a virtual core unambigiously
[ https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729895#comment-13729895 ] Jason Lowe commented on YARN-1024: -- Agree that the example posed by [~sandyr] shows that a single unit in the request cannot properly convey the ask. Chatted briefly about this offline with [~revans2] and [~nroberts] and we think in general there needs to be a way to show the parallelism needed along with some performance guarantee from those threads. That basically leads us to a path where in the generalized case we're asking for a list of vcore units, where the number of entries in the list represents the desired hardware parallelism and the value of each entry represents the performance needed for that execution thread. Using this with Sandy's example, asking for a single unit of 250 YVCs means it would not be allocated on the node with three cores each rated at 150 YVCs because none of the cores meets the single-threaded performance needed by the container. If another job came along and asked for three cores each at 100 YVCs, that could still run on a node that only has a single core rated at 500 YVCs because that core likely has enough horsepower to multitask the three threads and get them each the required performance. I understand where [~ste...@apache.org] is coming from re: dangers of developing one unit to rule them all, but I also think there needs to be *some* way to convey performance requirements. Sandy's example shows that just because a job ran fine with one core on some box doesn't mean the job is going to run fine with one core on another. We will not be able to develop a metric that will cover all the hardware architecture differences, but if a metric works in the vast majority of cases then I think that's a net win over no metric. The APIs are already set for 2.1, and I believe the common case will be jobs where a single thread dominates the overall CPU request of the container. In that sense, we can map the existing API call to a single vcore ask and add another API where the ask can be a list/array of vcore asks. This could get complicated in the scheduler for an architecture where the effective vcore rating of the processors is not homogenous (brings up the spectre of processor-pinning and per-processor scheduling), but I don't think this will be a common architecture. Define a virtual core unambigiously --- Key: YARN-1024 URL: https://issues.apache.org/jira/browse/YARN-1024 Project: Hadoop YARN Issue Type: Improvement Reporter: Arun C Murthy Assignee: Arun C Murthy We need to clearly define the meaning of a virtual core unambiguously so that it's easy to migrate applications between clusters. For e.g. here is Amazon EC2 definition of ECU: http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it Essentially we need to clearly define a YARN Virtual Core (YVC). Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.* -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1031) JQuery UI components reference external css in branch-23
[ https://issues.apache.org/jira/browse/YARN-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729922#comment-13729922 ] Jason Lowe commented on YARN-1031: -- +1, lgtm. JQuery UI components reference external css in branch-23 Key: YARN-1031 URL: https://issues.apache.org/jira/browse/YARN-1031 Project: Hadoop YARN Issue Type: Bug Affects Versions: 0.23.9 Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-1031-branch-0.23.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1032) NPE in RackResolve
Lohit Vijayarenu created YARN-1032: -- Summary: NPE in RackResolve Key: YARN-1032 URL: https://issues.apache.org/jira/browse/YARN-1032 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.5-alpha Environment: linux Reporter: Lohit Vijayarenu Priority: Minor We found a case where our rack resolve script was not returning rack due to problem with resolving host address. This exception was see in RackResolver.java as NPE, ultimately caught in RMContainerAllocator. {noformat} 2013-08-01 07:11:37,708 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.lang.NullPointerException at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:99) at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:92) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignMapsWithLocality(RMContainerAllocator.java:1039) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignContainers(RMContainerAllocator.java:925) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:861) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$400(RMContainerAllocator.java:681) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:243) at java.lang.Thread.run(Thread.java:722) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1029) Allow embedding leader election into the RM
[ https://issues.apache.org/jira/browse/YARN-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729969#comment-13729969 ] Aaron T. Myers commented on YARN-1029: -- Just to be completely explicit, this is being presented as an alternative to using a separate ZKFC daemon? FWIW, in HDFS we deliberately opted to not do this so that the ZKFC could be completely logically separate from the NN, and so that the ZKFC could one day be made to monitor garbage collections and potentially not trigger a failover if one of those were going on. We have yet to get to the latter. Allow embedding leader election into the RM --- Key: YARN-1029 URL: https://issues.apache.org/jira/browse/YARN-1029 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Karthik Kambatla It should be possible to embed common ActiveStandyElector into the RM such that ZooKeeper based leader election and notification is in-built. In conjunction with a ZK state store, this configuration will be a simple deployment option. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1032) NPE in RackResolve
[ https://issues.apache.org/jira/browse/YARN-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729977#comment-13729977 ] Lohit Vijayarenu commented on YARN-1032: Once we hit exception in RackResolver, since this is not caught or default rack is not returned, this is end up not releasing containers which could not be assigned in RMContainerAllocator.java {noformat} assignContainers(allocatedContainers); // release container if we could not assign it it = allocatedContainers.iterator(); while (it.hasNext()) { Container allocated = it.next(); LOG.info(Releasing unassigned and invalid container + allocated + . RM may have assignment issues); containerNotAssigned(allocated); } {noformat} AM would no longer ask for new containers since it thinks containers are assigned and RM assumes containers are allocated to AM. Job ends up hanging forever without making any progress. Fixing releasing containers might be part of another JIRA, at the minimum we need to catch exception and return default rack incase of failure. NPE in RackResolve -- Key: YARN-1032 URL: https://issues.apache.org/jira/browse/YARN-1032 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.5-alpha Environment: linux Reporter: Lohit Vijayarenu Priority: Minor We found a case where our rack resolve script was not returning rack due to problem with resolving host address. This exception was see in RackResolver.java as NPE, ultimately caught in RMContainerAllocator. {noformat} 2013-08-01 07:11:37,708 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.lang.NullPointerException at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:99) at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:92) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignMapsWithLocality(RMContainerAllocator.java:1039) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignContainers(RMContainerAllocator.java:925) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:861) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$400(RMContainerAllocator.java:681) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:243) at java.lang.Thread.run(Thread.java:722) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1032) NPE in RackResolve
[ https://issues.apache.org/jira/browse/YARN-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lohit Vijayarenu updated YARN-1032: --- Attachment: YARN-1032.1.patch Simple patch to catch NPE and return default-rack. Since it is catch NPE did not try to come up with test case. Let me know if this look good. NPE in RackResolve -- Key: YARN-1032 URL: https://issues.apache.org/jira/browse/YARN-1032 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.5-alpha Environment: linux Reporter: Lohit Vijayarenu Priority: Minor Attachments: YARN-1032.1.patch We found a case where our rack resolve script was not returning rack due to problem with resolving host address. This exception was see in RackResolver.java as NPE, ultimately caught in RMContainerAllocator. {noformat} 2013-08-01 07:11:37,708 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.lang.NullPointerException at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:99) at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:92) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignMapsWithLocality(RMContainerAllocator.java:1039) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignContainers(RMContainerAllocator.java:925) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:861) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$400(RMContainerAllocator.java:681) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:243) at java.lang.Thread.run(Thread.java:722) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (YARN-1031) JQuery UI components reference external css in branch-23
[ https://issues.apache.org/jira/browse/YARN-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles resolved YARN-1031. --- Resolution: Fixed Fix Version/s: 0.23.10 Thanks for the review, Jason. JQuery UI components reference external css in branch-23 Key: YARN-1031 URL: https://issues.apache.org/jira/browse/YARN-1031 Project: Hadoop YARN Issue Type: Bug Affects Versions: 0.23.9 Reporter: Jonathan Eagles Assignee: Jonathan Eagles Fix For: 0.23.10 Attachments: YARN-1031-branch-0.23.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1024) Define a virtual core unambigiously
[ https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730074#comment-13730074 ] Arun C Murthy commented on YARN-1024: - bq. If I am used to running my single-threaded task on a fast core (let's say rated at 250 YVCs), and then I migrate it to another cluster with slower cores (let's say rated at 150 YVCs), and still request 250 YVCs, my task will run no faster than if I had requested it with 150 YVCs. [~sandyr] That is why you'd set a max-vcores in CS/FS of 150. This prevents users from falling into that trap. So, that should solve it - correct? Define a virtual core unambigiously --- Key: YARN-1024 URL: https://issues.apache.org/jira/browse/YARN-1024 Project: Hadoop YARN Issue Type: Improvement Reporter: Arun C Murthy Assignee: Arun C Murthy We need to clearly define the meaning of a virtual core unambiguously so that it's easy to migrate applications between clusters. For e.g. here is Amazon EC2 definition of ECU: http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it Essentially we need to clearly define a YARN Virtual Core (YVC). Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.* -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1024) Define a virtual core unambigiously
[ https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730077#comment-13730077 ] Arun C Murthy commented on YARN-1024: - [~jlowe] Yep, it does make sense to talk about a more explicit 'vector of cores' model as we've discussed in past - that said, I agree it's too early. Define a virtual core unambigiously --- Key: YARN-1024 URL: https://issues.apache.org/jira/browse/YARN-1024 Project: Hadoop YARN Issue Type: Improvement Reporter: Arun C Murthy Assignee: Arun C Murthy We need to clearly define the meaning of a virtual core unambiguously so that it's easy to migrate applications between clusters. For e.g. here is Amazon EC2 definition of ECU: http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it Essentially we need to clearly define a YARN Virtual Core (YVC). Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.* -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1024) Define a virtual core unambigiously
[ https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730080#comment-13730080 ] Arun C Murthy commented on YARN-1024: - Overall, yes, there are certainly issues with a strict definition vcore etc., but we need to do *just enough* for now - not solve all possible permutations. Basic requirements are simplicity, predictability and consistency - in that order. Define a virtual core unambigiously --- Key: YARN-1024 URL: https://issues.apache.org/jira/browse/YARN-1024 Project: Hadoop YARN Issue Type: Improvement Reporter: Arun C Murthy Assignee: Arun C Murthy We need to clearly define the meaning of a virtual core unambiguously so that it's easy to migrate applications between clusters. For e.g. here is Amazon EC2 definition of ECU: http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it Essentially we need to clearly define a YARN Virtual Core (YVC). Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.* -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-696) Enable multiple states to to be specified in Resource Manager apps REST call
[ https://issues.apache.org/jira/browse/YARN-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730095#comment-13730095 ] Trevor Lorimer commented on YARN-696: - In this patch I changed state to states and enabled comma separated state queries. However I could not find a way to create tests where I can be sure multiple applications with different states exist at a specific time. Are there any examples where an application can be created with a predefined static state? Enable multiple states to to be specified in Resource Manager apps REST call Key: YARN-696 URL: https://issues.apache.org/jira/browse/YARN-696 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.0.4-alpha Reporter: Trevor Lorimer Assignee: Trevor Lorimer Priority: Trivial Attachments: YARN-696.diff Within the YARN Resource Manager REST API the GET call which returns all Applications can be filtered by a single State query parameter (http://rm http address:port/ws/v1/cluster/apps). There are 8 possible states (New, Submitted, Accepted, Running, Finishing, Finished, Failed, Killed), if no state parameter is specified all states are returned, however if a sub-set of states is required then multiple REST calls are required (max. of 7). The proposal is to be able to specify multiple states in a single REST call. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1032) NPE in RackResolve
[ https://issues.apache.org/jira/browse/YARN-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730153#comment-13730153 ] Zhijie Shen commented on YARN-1032: --- {code} rNameList == null {code} Look into DNSToSwitchMapping doc, resolve() seems not to return null. Probably, you want to check {code} rNameList.size() == 0 {code} Please add a test case in TestRackResolver. NPE in RackResolve -- Key: YARN-1032 URL: https://issues.apache.org/jira/browse/YARN-1032 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.5-alpha Environment: linux Reporter: Lohit Vijayarenu Priority: Minor Attachments: YARN-1032.1.patch We found a case where our rack resolve script was not returning rack due to problem with resolving host address. This exception was see in RackResolver.java as NPE, ultimately caught in RMContainerAllocator. {noformat} 2013-08-01 07:11:37,708 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.lang.NullPointerException at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:99) at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:92) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignMapsWithLocality(RMContainerAllocator.java:1039) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignContainers(RMContainerAllocator.java:925) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:861) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$400(RMContainerAllocator.java:681) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:243) at java.lang.Thread.run(Thread.java:722) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1032) NPE in RackResolve
[ https://issues.apache.org/jira/browse/YARN-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730167#comment-13730167 ] Lohit Vijayarenu commented on YARN-1032: [~zjshen] Yes, documentation does not mention returning null for resolve(), but if you look into RawScriptBasedMapping::resolve(), failure to resolve rack can return null in atleast two places, hence the null check. Thanks for pointing out TestRackResolver, I will try to add a test case. NPE in RackResolve -- Key: YARN-1032 URL: https://issues.apache.org/jira/browse/YARN-1032 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.5-alpha Environment: linux Reporter: Lohit Vijayarenu Priority: Minor Attachments: YARN-1032.1.patch We found a case where our rack resolve script was not returning rack due to problem with resolving host address. This exception was see in RackResolver.java as NPE, ultimately caught in RMContainerAllocator. {noformat} 2013-08-01 07:11:37,708 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.lang.NullPointerException at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:99) at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:92) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignMapsWithLocality(RMContainerAllocator.java:1039) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignContainers(RMContainerAllocator.java:925) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:861) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$400(RMContainerAllocator.java:681) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:243) at java.lang.Thread.run(Thread.java:722) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1024) Define a virtual core unambigiously
[ https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730195#comment-13730195 ] Sandy Ryza commented on YARN-1024: -- Jason, Steve, and Arun, you bring up good points that I think have helped me understand some of my assumptions. I agree that simplicity, predictability, and consistency are our most important requirements. I agree with Jason that at least two values - processing power per core and # of cores - are required to fully express a request, and that, in spite of this, we should not use both and that a single value is better than nothing. We have a tradeoff between * A definition that offers some predictability between clusters, but only makes sense for requests for a single physical core or less per container. * A definition that offers predictability only on homogeneous hardware, but that functions sensibly for requests for both more and less than a single physical core. I thought that one of the exciting things about allowing requests for CPU would be that YARN would be able to better accommodate multi-threaded CPU-intensive frameworks like MPI and Storm. Predictability between clusters seems to matter a lot less to me. A ton of other factors interfere with this kind of predictability. The speed that hardware permits a task to read from disk or over the network has can have just as large an impact on the processing power it consumes as whatever the task is doing. I don't believe that we will be able to attain predictability to the degree that it will provide much value. Define a virtual core unambigiously --- Key: YARN-1024 URL: https://issues.apache.org/jira/browse/YARN-1024 Project: Hadoop YARN Issue Type: Improvement Reporter: Arun C Murthy Assignee: Arun C Murthy We need to clearly define the meaning of a virtual core unambiguously so that it's easy to migrate applications between clusters. For e.g. here is Amazon EC2 definition of ECU: http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it Essentially we need to clearly define a YARN Virtual Core (YVC). Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.* -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1024) Define a virtual core unambigiously
[ https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730197#comment-13730197 ] Sandy Ryza commented on YARN-1024: -- bq. The speed that hardware permits a task to read from disk or over the network has can have just as large an impact on the processing power it consumes as whatever the task is doing. Meant: The speed that hardware permits a task to read from disk or over the network can have just as large an impact on the processing power it consumes as whatever the task is doing. Define a virtual core unambigiously --- Key: YARN-1024 URL: https://issues.apache.org/jira/browse/YARN-1024 Project: Hadoop YARN Issue Type: Improvement Reporter: Arun C Murthy Assignee: Arun C Murthy We need to clearly define the meaning of a virtual core unambiguously so that it's easy to migrate applications between clusters. For e.g. here is Amazon EC2 definition of ECU: http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it Essentially we need to clearly define a YARN Virtual Core (YVC). Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.* -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1029) Allow embedding leader election into the RM
[ https://issues.apache.org/jira/browse/YARN-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730209#comment-13730209 ] Bikas Saha commented on YARN-1029: -- Yes. Thats correct. I am aware of the HDFS discussions. ZKFC is definitely going to be part of RM failover and supported. Given RM's lower memory consumption and sane values of ZK timeouts the GC problem may not be severe in the RM's case. On the other hand, with RM state being also stored in ZK, having an embedded FC may considerably simplify deployment and maintenance of RM failover. So its not a bad option to have. Allow embedding leader election into the RM --- Key: YARN-1029 URL: https://issues.apache.org/jira/browse/YARN-1029 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Karthik Kambatla It should be possible to embed common ActiveStandyElector into the RM such that ZooKeeper based leader election and notification is in-built. In conjunction with a ZK state store, this configuration will be a simple deployment option. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1029) Allow embedding leader election into the RM
[ https://issues.apache.org/jira/browse/YARN-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730224#comment-13730224 ] Aaron T. Myers commented on YARN-1029: -- Sounds good to me. I think we should seriously consider moving the ZKFC functionality into the NN as well, since in practice I don't think it's bought us much of anything and definitely complicates the deployment. But, that's another discussion for another day. Thanks, Bikas. Allow embedding leader election into the RM --- Key: YARN-1029 URL: https://issues.apache.org/jira/browse/YARN-1029 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Karthik Kambatla It should be possible to embed common ActiveStandyElector into the RM such that ZooKeeper based leader election and notification is in-built. In conjunction with a ZK state store, this configuration will be a simple deployment option. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1024) Define a virtual core unambigiously
[ https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730301#comment-13730301 ] Eli Collins commented on YARN-1024: --- I agree we need to define the meaning of a virtual core unambiguously (otherwise we won't be able to support two different frameworks on the same cluster that may have differing ideas of what a vcore is). I also agree with Phil that there are essentially two high-level use cases: 1. Jobs that want to express how much CPU capacity the job needs. Real world example - a distcp job wants to express it needs 100 containers but only a fraction of a CPU for each since it will spend most of its time blocking on IO. 2. Services - ie long-lived frameworks (ie support 2-level scheduling) - that want to request cores on many machines on a cluster and want to express CPU-level parallelism and aggregate demand (because they will schedule fine-grain requests w/in their long-lived containers). Eg a framework should be able to ask for two containers on a host, each with one core, so it can get two containers that can execute in parallel on a full core. This is assuming we plan to support long-running services in Yarn (YARN-896), which is hopefully not controversial. Real world example is HBase which may want 2 guaranteed cores per host on a given set of hosts. Seems like there are two high-level approaches: 1. Get rid of vcores. If we define 1vcore=1pcore (1vcore=1vcpu for virtual environments) and support fractional cores (YARN-972) then services can ask for 1 or more vcores knowing they're getting real cores and jobs just ask for what fraction of a vcore they think they need. This is really abandoning the concept of a virtual core because it's actually expressing a physical requirement (like memory, we assume Yarn is not dramatically over-committing the host). We can handle heterogeneous CPUs via attributes (as discussed in other Yarn jiras) since most clusters in my experience don't have wildly different processors (eg 1 or 2 generations is common), and attributes are sufficient to express policies like all my cores should have equal/comparable performance. 2. Keep going with vcores as a CPU unit of measurement. If we define 1vcore=1ECU (works 1:1 for virtual environments) then services (#1) need to understand the the power of a core so they can ask for that many vcores - essentially they are just undoing the virtualization. YARN would need to make sure two containers each with 1 pcores worth of vcores does in fact give you two cores( just like hypervisors schedule vcpus for the same VM on different pcores to ensure parallelism), but there would be no guarantee that two containers on the same host each w/ one vcore would run in parallel. Jobs that want fractional cores would just express 1vcore per container and work they're way up based on the experience running on the cluster (or also undo the virtualization by calculating vcore/pcore if they know what fraction of a pcore they want). Heterogenous CPUs does not fall out naturally (still need attributes) since there's no guarantee you can describe the difference between two CPUs is roughly 1 or more vcore (eg 2.4 - vs 2.0 Ghz 1ECU), however there's no need for fractional vcores. I think either is reasonable and can be made to work, though I think #1 is preferable because: - Some frameworks want to express containers in physical resources (this is consistent with how YARN handles memory) - You can support jobs that don't want a full core via fractional cores (or slightly over-committing cores) - You can support heterogeneous cores via attributes (I want equivalent containers) - vcores are optional anyway (only used in DRF) and therefore only need to be expressed if you care about physical cores because you need to reserve them or say you want a fraction of one Either way I think vcore is the wrong name either way because in #1 1vcore=1pcore so there's no virtualization and in #2 1 vcore is not a virtualization of a core (10 vcores does not give me 10 levels of parallelism), it's _just a unit_ (like an ECU). Define a virtual core unambigiously --- Key: YARN-1024 URL: https://issues.apache.org/jira/browse/YARN-1024 Project: Hadoop YARN Issue Type: Improvement Reporter: Arun C Murthy Assignee: Arun C Murthy We need to clearly define the meaning of a virtual core unambiguously so that it's easy to migrate applications between clusters. For e.g. here is Amazon EC2 definition of ECU: http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it Essentially we need to clearly define a YARN Virtual Core (YVC). Equivalently, we can use ECU itself: *One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2
[jira] [Comment Edited] (YARN-1024) Define a virtual core unambigiously
[ https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730301#comment-13730301 ] Eli Collins edited comment on YARN-1024 at 8/6/13 3:09 AM: --- I agree we need to define the meaning of a virtual core unambiguously (otherwise we won't be able to support two different frameworks on the same cluster that may have differing ideas of what a vcore is). I also agree with Phil that there are essentially two high-level use cases: 1. Jobs that want to express how much CPU capacity the job needs. Real world example - a distcp job wants to express it needs 100 containers but only a fraction of a CPU for each since it will spend most of its time blocking on IO. 2. Services - ie long-lived frameworks (ie support 2-level scheduling) - that want to request cores on many machines on a cluster and want to express CPU-level parallelism and aggregate demand (because they will schedule fine-grain requests w/in their long-lived containers). Eg a framework should be able to ask for two containers on a host, each with one core, so it can get two containers that can execute in parallel on a full core. This is assuming we plan to support long-running services in Yarn (YARN-896), which is hopefully not controversial. Real world example is HBase which may want 2 guaranteed cores per host on a given set of hosts. Seems like there are two high-level approaches: 1. Get rid of vcores. If we define 1vcore=1pcore (1vcore=1vcpu for virtual environments) and support fractional cores (YARN-972) then services can ask for 1 or more vcores knowing they're getting real cores and jobs just ask for what fraction of a vcore they think they need. This is really abandoning the concept of a virtual core because it's actually expressing a physical requirement (like memory, we assume Yarn is not dramatically over-committing the host). We can handle heterogeneous CPUs via attributes (as discussed in other Yarn jiras) since most clusters in my experience don't have wildly different processors (eg 1 or 2 generations is common), and attributes are sufficient to express policies like all my cores should have equal/comparable performance. 2. Keep going with vcores as a CPU unit of measurement. If we define 1vcore=1ECU (works 1:1 for virtual environments) then services need to understand the the power of a core so they can ask for that many vcores - essentially they are just undoing the virtualization. YARN would need to make sure two containers each with 1 pcores worth of vcores does in fact give you two cores (just like hypervisors schedule vcpus for the same VM on different pcores to ensure parallelism), but there would be no guarantee that two containers on the same host each w/ one vcore would run in parallel. Jobs that want fractional cores would just express 1vcore per container and work their way up based on the experience running on the cluster (or also undo the virtualization by calculating vcore/pcore if they know what fraction of a pcore they want). Heterogenous CPUs does not fall out naturally (still need attributes) since there's no guarantee you can describe the difference between two CPUs is roughly 1 or more vcore (eg 2.4 - vs 2.0 Ghz 1ECU), however there's no need for fractional vcores. I think either is reasonable and can be made to work, though I think #1 is preferable because: - Some frameworks want to express containers in physical resources (this is consistent with how YARN handles memory) - You can support jobs that don't want a full core via fractional cores (or slightly over-committing cores) - You can support heterogeneous cores via attributes (I want equivalent containers) - vcores are optional anyway (only used in DRF) and therefore only need to be expressed if you care about physical cores because you need to reserve them or say you want a fraction of one Either way I think vcore is the wrong name because in #1 1vcore=1pcore so there's no virtualization and in #2 1 vcore is not a virtualization of a core (10 vcores does not give me 10 levels of parallelism), it's _just a unit_ (like an ECU). was (Author: eli): I agree we need to define the meaning of a virtual core unambiguously (otherwise we won't be able to support two different frameworks on the same cluster that may have differing ideas of what a vcore is). I also agree with Phil that there are essentially two high-level use cases: 1. Jobs that want to express how much CPU capacity the job needs. Real world example - a distcp job wants to express it needs 100 containers but only a fraction of a CPU for each since it will spend most of its time blocking on IO. 2. Services - ie long-lived frameworks (ie support 2-level scheduling) - that want to request cores on many machines on a cluster and want to express CPU-level parallelism and aggregate
[jira] [Comment Edited] (YARN-1024) Define a virtual core unambigiously
[ https://issues.apache.org/jira/browse/YARN-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730301#comment-13730301 ] Eli Collins edited comment on YARN-1024 at 8/6/13 3:08 AM: --- I agree we need to define the meaning of a virtual core unambiguously (otherwise we won't be able to support two different frameworks on the same cluster that may have differing ideas of what a vcore is). I also agree with Phil that there are essentially two high-level use cases: 1. Jobs that want to express how much CPU capacity the job needs. Real world example - a distcp job wants to express it needs 100 containers but only a fraction of a CPU for each since it will spend most of its time blocking on IO. 2. Services - ie long-lived frameworks (ie support 2-level scheduling) - that want to request cores on many machines on a cluster and want to express CPU-level parallelism and aggregate demand (because they will schedule fine-grain requests w/in their long-lived containers). Eg a framework should be able to ask for two containers on a host, each with one core, so it can get two containers that can execute in parallel on a full core. This is assuming we plan to support long-running services in Yarn (YARN-896), which is hopefully not controversial. Real world example is HBase which may want 2 guaranteed cores per host on a given set of hosts. Seems like there are two high-level approaches: 1. Get rid of vcores. If we define 1vcore=1pcore (1vcore=1vcpu for virtual environments) and support fractional cores (YARN-972) then services can ask for 1 or more vcores knowing they're getting real cores and jobs just ask for what fraction of a vcore they think they need. This is really abandoning the concept of a virtual core because it's actually expressing a physical requirement (like memory, we assume Yarn is not dramatically over-committing the host). We can handle heterogeneous CPUs via attributes (as discussed in other Yarn jiras) since most clusters in my experience don't have wildly different processors (eg 1 or 2 generations is common), and attributes are sufficient to express policies like all my cores should have equal/comparable performance. 2. Keep going with vcores as a CPU unit of measurement. If we define 1vcore=1ECU (works 1:1 for virtual environments) then services need to understand the the power of a core so they can ask for that many vcores - essentially they are just undoing the virtualization. YARN would need to make sure two containers each with 1 pcores worth of vcores does in fact give you two cores (just like hypervisors schedule vcpus for the same VM on different pcores to ensure parallelism), but there would be no guarantee that two containers on the same host each w/ one vcore would run in parallel. Jobs that want fractional cores would just express 1vcore per container and work their way up based on the experience running on the cluster (or also undo the virtualization by calculating vcore/pcore if they know what fraction of a pcore they want). Heterogenous CPUs does not fall out naturally (still need attributes) since there's no guarantee you can describe the difference between two CPUs is roughly 1 or more vcore (eg 2.4 - vs 2.0 Ghz 1ECU), however there's no need for fractional vcores. I think either is reasonable and can be made to work, though I think #1 is preferable because: - Some frameworks want to express containers in physical resources (this is consistent with how YARN handles memory) - You can support jobs that don't want a full core via fractional cores (or slightly over-committing cores) - You can support heterogeneous cores via attributes (I want equivalent containers) - vcores are optional anyway (only used in DRF) and therefore only need to be expressed if you care about physical cores because you need to reserve them or say you want a fraction of one Either way I think vcore is the wrong name either way because in #1 1vcore=1pcore so there's no virtualization and in #2 1 vcore is not a virtualization of a core (10 vcores does not give me 10 levels of parallelism), it's _just a unit_ (like an ECU). was (Author: eli): I agree we need to define the meaning of a virtual core unambiguously (otherwise we won't be able to support two different frameworks on the same cluster that may have differing ideas of what a vcore is). I also agree with Phil that there are essentially two high-level use cases: 1. Jobs that want to express how much CPU capacity the job needs. Real world example - a distcp job wants to express it needs 100 containers but only a fraction of a CPU for each since it will spend most of its time blocking on IO. 2. Services - ie long-lived frameworks (ie support 2-level scheduling) - that want to request cores on many machines on a cluster and want to express CPU-level parallelism and
[jira] [Commented] (YARN-1021) Yarn Scheduler Load Simulator
[ https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730316#comment-13730316 ] Bikas Saha commented on YARN-1021: -- The idea and goals are very interesting. It would be great if there was a design description to initiate a discussion. Yarn Scheduler Load Simulator - Key: YARN-1021 URL: https://issues.apache.org/jira/browse/YARN-1021 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: Wei Yan Assignee: Wei Yan The Yarn Scheduler is a fertile area of interest with different implementations, e.g., Fifo, Capacity and Fair schedulers. Meanwhile, several optimizations are also made to improve scheduler performance for different scenarios and workload. Each scheduler algorithm has its own set of features, and drives scheduling decisions by many factors, such as fairness, capacity guarantee, resource availability, etc. It is very important to evaluate a scheduler algorithm very well before we deploy it in a production cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling algorithm. Evaluating in a real cluster is always time and cost consuming, and it is also very hard to find a large-enough cluster. Hence, a simulator which can predict how well a scheduler algorithm for some specific workload would be quite useful. We want to build a Scheduler Load Simulator to simulate large-scale Yarn clusters and application loads in a single machine. This would be invaluable in furthering Yarn by providing a tool for researchers and developers to prototype new scheduler features and predict their behavior and performance with reasonable amount of confidence, there-by aiding rapid innovation. The simulator will exercise the real Yarn ResourceManager removing the network factor by simulating NodeManagers and ApplicationMasters via handling and dispatching NM/AMs heartbeat events from within the same JVM. To keep tracking of scheduler behavior and performance, a scheduler wrapper will wrap the real scheduler. The simulator will produce real time metrics while executing, including: * Resource usages for whole cluster and each queue, which can be utilized to configure cluster and queue's capacity. * The detailed application execution trace (recorded in relation to simulated time), which can be analyzed to understand/validate the scheduler behavior (individual jobs turn around time, throughput, fairness, capacity guarantee, etc). * Several key metrics of scheduler algorithm, such as time cost of each scheduler operation (allocate, handle, etc), which can be utilized by Hadoop developers to find the code spots and scalability limits. The simulator will provide real time charts showing the behavior of the scheduler and its performance. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira