+1 (binding) Thanks Daniel for helping to backport this. I also ran various performance test cases including mentioned UT perfs and SLS tests.
In SLS tests, I found that performance impact of branch-3.0 and resource-types branch is almost minimal. I tried to run test scenarios with 8k nodes and 4k nodes. There are no performance regressions seen when I used 2 resource types. I could get around 2800 container allocation per second in my machine with 8k nodes. Other than this I have also gone through the branch code and trunk. I could see that all major changes related to recent performance improvements are pulled in. - Sunil On Sat, Oct 28, 2017 at 8:20 PM Daniel Templeton <dan...@cloudera.com> wrote: > As promised, here's the updated performance numbers. > > Performance reporting is always a tricky business. I'll do my best here > to fairly represent the state of things. We've run a number of > performance tests. Those tests include TestCapacitySchedulerPref, SLS, > and actual cluster testing. > > The summary is that in most scenarios, the resource-types branch is very > close to branch-3.0 in performance. There are some large scale SLS > tests that show a performance drop, but that we have not been able to > replicate those findings on an actual cluster. Additional cluster > testing is still in process. > > = TestCapacitySchedulerPerf = > This unit test added with YARN-7136 does a tight loop over the > scheduler's handling of node update events. The net effect is similar > to running 100 apps through 1 queue in a 2-node cluster. I also > modified it to run with fair scheduler and configured it with assign > multiple enabled and set to the max containers supported by the cluster. > > - Capacity scheduler - > Performance of resource-types v/s branch-3.0: 1.0 (no change) > Performance of resource-types v/s trunk: 1.16 (16% *better*) > - Fair scheduler - > Performance of resource-types + YARN-7374 v/s branch-3.0: 1.25 (25% > *better*) > Performance of resource-types + YARN-7374 v/s trunk: 1.04 (4% *better*) > > These results seem a little optimistic when compared with the SLS > results, but at worst they provide evidence that the resource types > changes do not have a significant negative impact. > > Wangda and Sunil did some independent testing with this unit test and > found no significant difference between branch-3.0 and resource-types. > > = SLS = > For SLS, we tested a wide range of scenarios with different node, app, > task, and queue counts. We ran these tests for capacity and fair > scheduler. > > The net result is that for the majority of the scenarios we tested, the > resource-types branch performance was within 95% to 105% of branch-3.0 > performance. We looked at the numbers for only the allocation time and > node update event processing time, as the other numbers returned by SLS > are not relevant here. I'm not reporting specific numbers because of > the volume of tests run, and because reporting any kind of aggregate > result would be inherently skewed by the mix of tests we chose to run, > and hence would be misleading. > > There were a few large node count+large queue count+large app count > scenarios where resource-types showed a larger performance degradation > versus branch-3.0 when comparing mean node update time over the entire > run. Mean is a lossy metric here, as we're trying to summarize an > entire time series in a single number, but it's about the best we're > gonna do. While these results aren't encouraging, bear in mind that > they are specifically for the time to process a node update, which does > not necessarily translate directly into overall cluster performance. > > Wangda and Sunil did some independent testing with SLS and found no > significant difference between branch-3.0 and resource-types. > > = Cluster Testing = > Because of the large SLS scenarios that showed a performance > degradation, we have done performance testing on actual clusters. These > tests are still ongoing, but thus far the results have shown no > discernible difference in overall throughput between branch-3.0 and > resource-types. Overall throughput for both branches falls into > identically the same range. > > Daniel > > On 10/24/17 10:56 AM, Daniel Templeton wrote: > > I'd like to formally start the voting process for merging the > > resource-types branch into branch-3.0. The resource-types branch is a > > selective backport of JIRAs that were already merged into trunk in a > > previous merge vote for YARN-3926 (resource types) [1]. For a full > > explanation of the feature, benefits, and risks, see the previous > > DISCUSS thread [2]. The vote will be 7 days, ending Tuesday Oct 31 at > > 11:00AM PDT. > > > > In summary, resource types adds the ability to declaratively configure > > new resource types in addition to CPU and memory and request them when > > submitting resource requests. The resource-types branch currently > > represents 32 patches from trunk drawn from the resource types > > umbrella JIRAs: YARN-3926 [3] and YARN-7069 [4]. > > > > Key points: > > * If no additional resource types are configured, the user experience > > with YARN remains unchanged. > > * Performance is the primary risk. We have been closely watching the > > performance impact of adding resource types, and according to current > > measurements the impact is trivial. > > * This merge vote is for resource types excluding the resource > > profiles feature which was included in the original merge vote [1]. > > * Documentation is available in trunk via YARN-7056 [5] with > > improvements pending review in YARN-7369 [6]. > > > > Refreshed performance numbers on the resource-types branch are > > pending, and I'll post them to this thread as soon as they're ready. > > > > Thanks! > > Daniel > > > > [1] > > > http://mail-archives.apache.org/mod_mbox/hadoop-yarn-dev/201708.mbox/%3ccad++ecm6xss4_kxp4audf85_rgg4pzxkuox7u2vp8tfzmy4...@mail.gmail.com%3E > > [2] > > > http://mail-archives.apache.org/mod_mbox/hadoop-yarn-dev/201710.mbox/%3Caa2bcc6d-9d88-459d-63f4-5bb43e31f4f4%40cloudera.com%3E > > [3] https://issues.apache.org/jira/browse/YARN-3926 > > [4] https://issues.apache.org/jira/browse/YARN-7069 > > [5] https://issues.apache.org/jira/browse/YARN-7056 > > [6] https://issues.apache.org/jira/browse/YARN-7369 > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org > For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org > >