Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
MaxGekk closed pull request #48501: [SPARK-49490][SQL] Add benchmarks for initCap URL: https://github.com/apache/spark/pull/48501 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
MaxGekk commented on PR #48501: URL: https://github.com/apache/spark/pull/48501#issuecomment-2490340634 +1, LGTM. Merging to master. Thank you, @mrk-andreev and @stevomitric @uros-db for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
mrk-andreev commented on code in PR #48501: URL: https://github.com/apache/spark/pull/48501#discussion_r1851026518 ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala: ## @@ -185,6 +185,48 @@ abstract class CollationBenchmarkBase extends BenchmarkBase { } benchmark.run(relativeTime = true) } + + def benchmarkInitCap( + collationTypes: Seq[String], + utf8Strings: Seq[UTF8String]): Unit = { +type collationType = Int +type InitCapEstimator = (UTF8String, collationType) => Unit +def skipCollationTypeFilter: Any => Boolean = _ => true +def createBenchmark( +implName: String, +impl: InitCapEstimator, +collationTypeFilter: String => Boolean): Unit = { + val benchmark = new Benchmark( +s"collation unit benchmarks - initCap using impl ${implName}", Review Comment: Fixed ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala: ## @@ -185,6 +185,48 @@ abstract class CollationBenchmarkBase extends BenchmarkBase { } benchmark.run(relativeTime = true) } + + def benchmarkInitCap( + collationTypes: Seq[String], + utf8Strings: Seq[UTF8String]): Unit = { +type collationType = Int +type InitCapEstimator = (UTF8String, collationType) => Unit +def skipCollationTypeFilter: Any => Boolean = _ => true +def createBenchmark( +implName: String, +impl: InitCapEstimator, +collationTypeFilter: String => Boolean): Unit = { + val benchmark = new Benchmark( +s"collation unit benchmarks - initCap using impl ${implName}", +utf8Strings.size * 10, +warmupTime = 10.seconds, +output = output) + collationTypes.filter(collationTypeFilter).foreach { collationType => { +val collationId = CollationFactory.collationNameToId(collationType) +benchmark.addCase(s"$collationType") { _ => Review Comment: Fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
mrk-andreev commented on code in PR #48501: URL: https://github.com/apache/spark/pull/48501#discussion_r1851026876 ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala: ## @@ -185,6 +185,48 @@ abstract class CollationBenchmarkBase extends BenchmarkBase { } benchmark.run(relativeTime = true) } + + def benchmarkInitCap( + collationTypes: Seq[String], + utf8Strings: Seq[UTF8String]): Unit = { +type collationType = Int Review Comment: Thank you. Fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
MaxGekk commented on code in PR #48501: URL: https://github.com/apache/spark/pull/48501#discussion_r1850057061 ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala: ## @@ -185,6 +185,48 @@ abstract class CollationBenchmarkBase extends BenchmarkBase { } benchmark.run(relativeTime = true) } + + def benchmarkInitCap( + collationTypes: Seq[String], + utf8Strings: Seq[UTF8String]): Unit = { +type collationType = Int +type InitCapEstimator = (UTF8String, collationType) => Unit +def skipCollationTypeFilter: Any => Boolean = _ => true +def createBenchmark( +implName: String, +impl: InitCapEstimator, +collationTypeFilter: String => Boolean): Unit = { + val benchmark = new Benchmark( +s"collation unit benchmarks - initCap using impl ${implName}", Review Comment: nit: the enclosing braces are redundant: ```suggestion s"collation unit benchmarks - initCap using $implName", ``` ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala: ## @@ -185,6 +185,48 @@ abstract class CollationBenchmarkBase extends BenchmarkBase { } benchmark.run(relativeTime = true) } + + def benchmarkInitCap( + collationTypes: Seq[String], + utf8Strings: Seq[UTF8String]): Unit = { +type collationType = Int Review Comment: It is a collation id, and types should begin from an upper case letter. ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala: ## @@ -185,6 +185,48 @@ abstract class CollationBenchmarkBase extends BenchmarkBase { } benchmark.run(relativeTime = true) } + + def benchmarkInitCap( + collationTypes: Seq[String], + utf8Strings: Seq[UTF8String]): Unit = { +type collationType = Int +type InitCapEstimator = (UTF8String, collationType) => Unit +def skipCollationTypeFilter: Any => Boolean = _ => true +def createBenchmark( +implName: String, +impl: InitCapEstimator, +collationTypeFilter: String => Boolean): Unit = { + val benchmark = new Benchmark( +s"collation unit benchmarks - initCap using impl ${implName}", +utf8Strings.size * 10, +warmupTime = 10.seconds, +output = output) + collationTypes.filter(collationTypeFilter).foreach { collationType => { +val collationId = CollationFactory.collationNameToId(collationType) +benchmark.addCase(s"$collationType") { _ => Review Comment: nit: ```suggestion benchmark.addCase(collationType) { _ => ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
mrk-andreev commented on PR #48501: URL: https://github.com/apache/spark/pull/48501#issuecomment-2486766722 Hi @MaxGekk, @stevomitric, Does this PR need any additional changes? Are there any blockers we should address? Let me know how I can help to move it forward! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
mrk-andreev commented on PR #48501: URL: https://github.com/apache/spark/pull/48501#issuecomment-2480651570 cc: @MaxGekk # Related work This is not related to my code changes but rather to the benchmarks we are modifying. It might be worth starting a separate thread in the dev mailing list or creating an additional ticket in Jira, which I would be happy to handle. ## Blackhole I would like to point out that the current implementation of org.apache.spark.benchmark.Benchmark::addCase does not use any form of Blackhole ([Blackhole in JMH](https://github.com/openjdk/jmh/blob/master/jmh-core/src/main/java/org/openjdk/jmh/infra/Blackhole.java#L155)), which could lead to dead-code elimination. However, I have not observed this issue in the existing tests. This is likely due to the complexity and side effects of the code being benchmarked, which prevents such elimination. Would it be a good idea to consider adding this as a feature in the future? ### Context `org.apache.spark.benchmark.Benchmark::addCase` ``` def addCase(name: String, numIters: Int = 0)(f: Int => Unit): Unit = { addTimerCase(name, numIters) { timer => timer.startTiming() f(timer.iteration) timer.stopTiming() } } ``` ## Async-profiler I suggest adding [Async Profiler](https://github.com/async-profiler/async-profiler), a low-overhead sampling profiler, to all benchmark runs. This will help us identify the causes of performance degradation. Would it also be worth considering adding this as a feature in the future? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
mrk-andreev commented on PR #48501: URL: https://github.com/apache/spark/pull/48501#issuecomment-2480647831 # Related work This is not related to my code changes but rather to the benchmarks we are modifying. It might be worth starting a separate thread in the dev mailing list or creating an additional ticket in Jira, which I would be happy to handle. ## Blackhole I would like to point out that the current implementation of org.apache.spark.benchmark.Benchmark::addCase does not use any form of Blackhole ([Blackhole in JMH](https://github.com/openjdk/jmh/blob/master/jmh-core/src/main/java/org/openjdk/jmh/infra/Blackhole.java#L155)), which could lead to dead-code elimination. However, I have not observed this issue in the existing tests. This is likely due to the complexity and side effects of the code being benchmarked, which prevents such elimination. Would it be a good idea to consider adding this as a feature in the future? ### Context `org.apache.spark.benchmark.Benchmark::addCase` ``` def addCase(name: String, numIters: Int = 0)(f: Int => Unit): Unit = { addTimerCase(name, numIters) { timer => timer.startTiming() f(timer.iteration) timer.stopTiming() } } ``` ## Async-profiler I suggest adding [Async Profiler](https://github.com/async-profiler/async-profiler), a low-overhead sampling profiler, to all benchmark runs. This will help us identify the causes of performance degradation. Would it also be worth considering adding this as a feature in the future? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
mrk-andreev commented on PR #48501: URL: https://github.com/apache/spark/pull/48501#issuecomment-2480474218 I would like to point out that the current implementation of org.apache.spark.benchmark.Benchmark::addCase does not use any form of Blackhole ([Blackhole in JMH](https://github.com/openjdk/jmh/blob/master/jmh-core/src/main/java/org/openjdk/jmh/infra/Blackhole.java#L155)), which could lead to dead-code elimination. However, I have not observed this issue in the existing tests. This is likely due to the complexity and side effects of the code being benchmarked, which prevents such elimination. Would it be a good idea to consider adding this as a feature in the future? --- # Context `org.apache.spark.benchmark.Benchmark::addCase` ``` def addCase(name: String, numIters: Int = 0)(f: Int => Unit): Unit = { addTimerCase(name, numIters) { timer => timer.startTiming() f(timer.iteration) timer.stopTiming() } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
mrk-andreev commented on code in PR #48501: URL: https://github.com/apache/spark/pull/48501#discussion_r1844404602 ## sql/core/benchmarks/CollationBenchmark-jdk21-results.txt: ## @@ -1,54 +1,88 @@ -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time -- -UTF8_BINARY 1353 1357 5 0.1 13532.2 1.0X -UTF8_LCASE 2601 2602 2 0.0 26008.0 1.9X -UNICODE 16745 16756 16 0.0 167450.9 12.4X -UNICODE_CI 16590 16627 52 0.0 165904.8 12.3X +UTF8_BINARY 2220 2223 5 0.0 22197.0 1.0X +UTF8_LCASE 4949 4950 2 0.0 49488.1 2.2X +UNICODE 28172 28198 36 0.0 281721.0 12.7X +UNICODE_CI 28233 28308 106 0.0 282328.2 12.7X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - compareFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time --- -UTF8_BINARY 1746 1746 0 0.1 17462.6 1.0X -UTF8_LCASE2629 2630 1 0.0 26294.8 1.5X -UNICODE 16744 16744 0 0.0 167438.6 9.6X -UNICODE_CI 16518 16521 4 0.0 165180.2 9.5X +UTF8_BINARY 2731 2733 2 0.0 27313.6 1.0X +UTF8_LCASE4611 4619 11 0.0 46111.4 1.7X +UNICODE 28149 28211 88 0.0 281486.8 10.3X +UNICODE_CI 27535 27597 89 0.0 275348.4 10.1X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - hashFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time -UTF8_BINARY2808 2808 1 0.0 28076.2 1.0X -UTF8_LCASE 5409 5410 0 0.0 54093.0 1.9X -UNICODE 67930 67957 38 0.0 679296.7 24.2X -UNICODE_CI56004 56005 1 0.0 560044.2 19.9X +UTF8_BINARY4603 4618 22 0.0 46031.3 1.0X +UTF8_LCASE 9510 9518 11 0.0 95097.7 2.1X +UNICODE 135718 135786 97 0.0 1357176.2 29.5X +UNICODE_CI 113715 113819 148 0.0 1137145.8 24.7X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - contains: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
mrk-andreev commented on code in PR #48501: URL: https://github.com/apache/spark/pull/48501#discussion_r1844407932 ## sql/core/benchmarks/CollationBenchmark-jdk21-results.txt: ## @@ -1,54 +1,88 @@ -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time -- -UTF8_BINARY 1349 1349 0 0.1 13485.4 1.0X -UTF8_LCASE 3559 3561 3 0.0 35594.3 2.6X -UNICODE 17580 17589 12 0.0 175803.6 13.0X -UNICODE_CI 17210 17212 2 0.0 172100.2 12.8X +UTF8_BINARY 2220 2223 5 0.0 22197.0 1.0X +UTF8_LCASE 4949 4950 2 0.0 49488.1 2.2X +UNICODE 28172 28198 36 0.0 281721.0 12.7X +UNICODE_CI 28233 28308 106 0.0 282328.2 12.7X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - compareFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time --- -UTF8_BINARY 1740 1741 1 0.1 17398.8 1.0X -UTF8_LCASE2630 2632 3 0.0 26301.0 1.5X -UNICODE 16732 16743 16 0.0 167319.7 9.6X -UNICODE_CI 16482 16492 14 0.0 164819.7 9.5X +UTF8_BINARY 2731 2733 2 0.0 27313.6 1.0X +UTF8_LCASE4611 4619 11 0.0 46111.4 1.7X +UNICODE 28149 28211 88 0.0 281486.8 10.3X +UNICODE_CI 27535 27597 89 0.0 275348.4 10.1X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - hashFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time -UTF8_BINARY2808 2808 0 0.0 28082.3 1.0X -UTF8_LCASE 5412 5413 1 0.0 54123.5 1.9X -UNICODE 70755 70787 44 0.0 707553.4 25.2X -UNICODE_CI57639 57669 43 0.0 576390.0 20.5X +UTF8_BINARY4603 4618 22 0.0 46031.3 1.0X +UTF8_LCASE 9510 9518 11 0.0 95097.7 2.1X +UNICODE 135718 135786 97 0.0 1357176.2 29.5X +UNICODE_CI 113715 113819 148 0.0 1137145.8 24.7X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - contains: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
stevomitric commented on code in PR #48501: URL: https://github.com/apache/spark/pull/48501#discussion_r1843598536 ## sql/core/benchmarks/CollationBenchmark-jdk21-results.txt: ## @@ -1,54 +1,88 @@ -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time -- -UTF8_BINARY 1353 1357 5 0.1 13532.2 1.0X -UTF8_LCASE 2601 2602 2 0.0 26008.0 1.9X -UNICODE 16745 16756 16 0.0 167450.9 12.4X -UNICODE_CI 16590 16627 52 0.0 165904.8 12.3X +UTF8_BINARY 2220 2223 5 0.0 22197.0 1.0X +UTF8_LCASE 4949 4950 2 0.0 49488.1 2.2X +UNICODE 28172 28198 36 0.0 281721.0 12.7X +UNICODE_CI 28233 28308 106 0.0 282328.2 12.7X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - compareFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time --- -UTF8_BINARY 1746 1746 0 0.1 17462.6 1.0X -UTF8_LCASE2629 2630 1 0.0 26294.8 1.5X -UNICODE 16744 16744 0 0.0 167438.6 9.6X -UNICODE_CI 16518 16521 4 0.0 165180.2 9.5X +UTF8_BINARY 2731 2733 2 0.0 27313.6 1.0X +UTF8_LCASE4611 4619 11 0.0 46111.4 1.7X +UNICODE 28149 28211 88 0.0 281486.8 10.3X +UNICODE_CI 27535 27597 89 0.0 275348.4 10.1X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - hashFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time -UTF8_BINARY2808 2808 1 0.0 28076.2 1.0X -UTF8_LCASE 5409 5410 0 0.0 54093.0 1.9X -UNICODE 67930 67957 38 0.0 679296.7 24.2X -UNICODE_CI56004 56005 1 0.0 560044.2 19.9X +UTF8_BINARY4603 4618 22 0.0 46031.3 1.0X +UTF8_LCASE 9510 9518 11 0.0 95097.7 2.1X +UNICODE 135718 135786 97 0.0 1357176.2 29.5X +UNICODE_CI 113715 113819 148 0.0 1137145.8 24.7X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - contains: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
stevomitric commented on code in PR #48501: URL: https://github.com/apache/spark/pull/48501#discussion_r1843598536 ## sql/core/benchmarks/CollationBenchmark-jdk21-results.txt: ## @@ -1,54 +1,88 @@ -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time -- -UTF8_BINARY 1353 1357 5 0.1 13532.2 1.0X -UTF8_LCASE 2601 2602 2 0.0 26008.0 1.9X -UNICODE 16745 16756 16 0.0 167450.9 12.4X -UNICODE_CI 16590 16627 52 0.0 165904.8 12.3X +UTF8_BINARY 2220 2223 5 0.0 22197.0 1.0X +UTF8_LCASE 4949 4950 2 0.0 49488.1 2.2X +UNICODE 28172 28198 36 0.0 281721.0 12.7X +UNICODE_CI 28233 28308 106 0.0 282328.2 12.7X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - compareFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time --- -UTF8_BINARY 1746 1746 0 0.1 17462.6 1.0X -UTF8_LCASE2629 2630 1 0.0 26294.8 1.5X -UNICODE 16744 16744 0 0.0 167438.6 9.6X -UNICODE_CI 16518 16521 4 0.0 165180.2 9.5X +UTF8_BINARY 2731 2733 2 0.0 27313.6 1.0X +UTF8_LCASE4611 4619 11 0.0 46111.4 1.7X +UNICODE 28149 28211 88 0.0 281486.8 10.3X +UNICODE_CI 27535 27597 89 0.0 275348.4 10.1X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - hashFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time -UTF8_BINARY2808 2808 1 0.0 28076.2 1.0X -UTF8_LCASE 5409 5410 0 0.0 54093.0 1.9X -UNICODE 67930 67957 38 0.0 679296.7 24.2X -UNICODE_CI56004 56005 1 0.0 560044.2 19.9X +UTF8_BINARY4603 4618 22 0.0 46031.3 1.0X +UTF8_LCASE 9510 9518 11 0.0 95097.7 2.1X +UNICODE 135718 135786 97 0.0 1357176.2 29.5X +UNICODE_CI 113715 113819 148 0.0 1137145.8 24.7X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - contains: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
stevomitric commented on code in PR #48501: URL: https://github.com/apache/spark/pull/48501#discussion_r1843598536 ## sql/core/benchmarks/CollationBenchmark-jdk21-results.txt: ## @@ -1,54 +1,88 @@ -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time -- -UTF8_BINARY 1353 1357 5 0.1 13532.2 1.0X -UTF8_LCASE 2601 2602 2 0.0 26008.0 1.9X -UNICODE 16745 16756 16 0.0 167450.9 12.4X -UNICODE_CI 16590 16627 52 0.0 165904.8 12.3X +UTF8_BINARY 2220 2223 5 0.0 22197.0 1.0X +UTF8_LCASE 4949 4950 2 0.0 49488.1 2.2X +UNICODE 28172 28198 36 0.0 281721.0 12.7X +UNICODE_CI 28233 28308 106 0.0 282328.2 12.7X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - compareFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time --- -UTF8_BINARY 1746 1746 0 0.1 17462.6 1.0X -UTF8_LCASE2629 2630 1 0.0 26294.8 1.5X -UNICODE 16744 16744 0 0.0 167438.6 9.6X -UNICODE_CI 16518 16521 4 0.0 165180.2 9.5X +UTF8_BINARY 2731 2733 2 0.0 27313.6 1.0X +UTF8_LCASE4611 4619 11 0.0 46111.4 1.7X +UNICODE 28149 28211 88 0.0 281486.8 10.3X +UNICODE_CI 27535 27597 89 0.0 275348.4 10.1X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - hashFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time -UTF8_BINARY2808 2808 1 0.0 28076.2 1.0X -UTF8_LCASE 5409 5410 0 0.0 54093.0 1.9X -UNICODE 67930 67957 38 0.0 679296.7 24.2X -UNICODE_CI56004 56005 1 0.0 560044.2 19.9X +UTF8_BINARY4603 4618 22 0.0 46031.3 1.0X +UTF8_LCASE 9510 9518 11 0.0 95097.7 2.1X +UNICODE 135718 135786 97 0.0 1357176.2 29.5X +UNICODE_CI 113715 113819 148 0.0 1137145.8 24.7X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - contains: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
mrk-andreev commented on code in PR #48501: URL: https://github.com/apache/spark/pull/48501#discussion_r1843013932 ## sql/core/benchmarks/CollationBenchmark-jdk21-results.txt: ## @@ -1,54 +1,88 @@ -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time -- -UTF8_BINARY 1349 1349 0 0.1 13485.4 1.0X -UTF8_LCASE 3559 3561 3 0.0 35594.3 2.6X -UNICODE 17580 17589 12 0.0 175803.6 13.0X -UNICODE_CI 17210 17212 2 0.0 172100.2 12.8X +UTF8_BINARY 2220 2223 5 0.0 22197.0 1.0X +UTF8_LCASE 4949 4950 2 0.0 49488.1 2.2X +UNICODE 28172 28198 36 0.0 281721.0 12.7X +UNICODE_CI 28233 28308 106 0.0 282328.2 12.7X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - compareFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time --- -UTF8_BINARY 1740 1741 1 0.1 17398.8 1.0X -UTF8_LCASE2630 2632 3 0.0 26301.0 1.5X -UNICODE 16732 16743 16 0.0 167319.7 9.6X -UNICODE_CI 16482 16492 14 0.0 164819.7 9.5X +UTF8_BINARY 2731 2733 2 0.0 27313.6 1.0X +UTF8_LCASE4611 4619 11 0.0 46111.4 1.7X +UNICODE 28149 28211 88 0.0 281486.8 10.3X +UNICODE_CI 27535 27597 89 0.0 275348.4 10.1X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - hashFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time -UTF8_BINARY2808 2808 0 0.0 28082.3 1.0X -UTF8_LCASE 5412 5413 1 0.0 54123.5 1.9X -UNICODE 70755 70787 44 0.0 707553.4 25.2X -UNICODE_CI57639 57669 43 0.0 576390.0 20.5X +UTF8_BINARY4603 4618 22 0.0 46031.3 1.0X +UTF8_LCASE 9510 9518 11 0.0 95097.7 2.1X +UNICODE 135718 135786 97 0.0 1357176.2 29.5X +UNICODE_CI 113715 113819 148 0.0 1137145.8 24.7X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - contains: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
MaxGekk commented on code in PR #48501: URL: https://github.com/apache/spark/pull/48501#discussion_r1840658353 ## sql/core/benchmarks/CollationBenchmark-jdk21-results.txt: ## @@ -1,54 +1,88 @@ -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time -- -UTF8_BINARY 1349 1349 0 0.1 13485.4 1.0X -UTF8_LCASE 3559 3561 3 0.0 35594.3 2.6X -UNICODE 17580 17589 12 0.0 175803.6 13.0X -UNICODE_CI 17210 17212 2 0.0 172100.2 12.8X +UTF8_BINARY 2220 2223 5 0.0 22197.0 1.0X +UTF8_LCASE 4949 4950 2 0.0 49488.1 2.2X +UNICODE 28172 28198 36 0.0 281721.0 12.7X +UNICODE_CI 28233 28308 106 0.0 282328.2 12.7X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - compareFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time --- -UTF8_BINARY 1740 1741 1 0.1 17398.8 1.0X -UTF8_LCASE2630 2632 3 0.0 26301.0 1.5X -UNICODE 16732 16743 16 0.0 167319.7 9.6X -UNICODE_CI 16482 16492 14 0.0 164819.7 9.5X +UTF8_BINARY 2731 2733 2 0.0 27313.6 1.0X +UTF8_LCASE4611 4619 11 0.0 46111.4 1.7X +UNICODE 28149 28211 88 0.0 281486.8 10.3X +UNICODE_CI 27535 27597 89 0.0 275348.4 10.1X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - hashFunction: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time -UTF8_BINARY2808 2808 0 0.0 28082.3 1.0X -UTF8_LCASE 5412 5413 1 0.0 54123.5 1.9X -UNICODE 70755 70787 44 0.0 707553.4 25.2X -UNICODE_CI57639 57669 43 0.0 576390.0 20.5X +UTF8_BINARY4603 4618 22 0.0 46031.3 1.0X +UTF8_LCASE 9510 9518 11 0.0 95097.7 2.1X +UNICODE 135718 135786 97 0.0 1357176.2 29.5X +UNICODE_CI 113715 113819 148 0.0 1137145.8 24.7X -OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure -AMD EPYC 7763 64-Core Processor +OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-aws +Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz collation unit benchmarks - contains: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative time
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
MaxGekk commented on code in PR #48501: URL: https://github.com/apache/spark/pull/48501#discussion_r1839787864 ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala: ## @@ -185,6 +185,49 @@ abstract class CollationBenchmarkBase extends BenchmarkBase { } benchmark.run(relativeTime = true) } + + def benchmarkInitCap( + collationTypes: Seq[String], + utf8Strings: Seq[UTF8String]): Unit = { +type collationType = Int +type InitCapEstimator = (UTF8String, collationType) => Unit +def skipCollationTypeFilter: Any => Boolean = _ => true +def createBenchmark( + implName: String, + impl: InitCapEstimator, + collationTypeFilter: String => Boolean): Unit = { Review Comment: Could you fix indentations here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
mrk-andreev commented on PR #48501: URL: https://github.com/apache/spark/pull/48501#issuecomment-2471613278 > @mrk-andreev Could you intergrate your benchmark into CollationBenchmark, please, as @uros-db pointed out https://github.com/apache/spark/pull/48501#pullrequestreview-2385543767. Otherwise we might forget to re-run your benchmark while benchmarking collation related code. @MaxGekk , done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
mrk-andreev commented on PR #48501: URL: https://github.com/apache/spark/pull/48501#issuecomment-2471612662 > @mrk-andreev Could you intergrate your benchmark into CollationBenchmark, please, as @uros-db pointed out https://github.com/apache/spark/pull/48501#pullrequestreview-2385543767. Otherwise we might forget to re-run your benchmark while benchmarking collation related code. Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
MaxGekk commented on PR #48501: URL: https://github.com/apache/spark/pull/48501#issuecomment-2463069933 @mrk-andreev Could you intergrate your benchmark into `CollationBenchmark`, please, as @uros-db pointed out https://github.com/apache/spark/pull/48501#pullrequestreview-2385543767. Otherwise we might forget to re-run your benchmark while benchmarking collation related code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
mrk-andreev commented on code in PR #48501: URL: https://github.com/apache/spark/pull/48501#discussion_r1827021066 ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InitCapBenchmark.scala: ## @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.benchmark + +import org.apache.spark.benchmark.{Benchmark, BenchmarkBase} +import org.apache.spark.sql.catalyst.util.CollationFactory +import org.apache.spark.sql.catalyst.util.CollationSupport.InitCap +import org.apache.spark.unsafe.types.UTF8String + +/** + * A benchmark that compares the performance of different ways to evaluate SQL initcap expressions. + * + * Specifically, this class compares the execICU, execBinaryICU, execBinary, execLowercase + * approaches. This class compares for string of different lengths with different words count. + * + * To run this benchmark: + * {{{ + * 1. without sbt: + * bin/spark-submit --class + *--jars , + * 2. build/sbt "sql/Test/runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain " + * Results will be written to "benchmarks/InitCapBenchmark-results.txt". + * }}} + */ +object InitCapBenchmark extends BenchmarkBase { + override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { +def generateString(wordsCount: Int, wordLen: Int, firstLetterUpper: Boolean): UTF8String = { + val sb = new StringBuilder(wordsCount * wordLen + wordLen) + for (_ <- 0 until wordsCount) { +for (pos <- 0 until wordLen) { + if (pos == 0 && firstLetterUpper) { +sb.append("X") + } else { +sb.append("x") + } +} +sb.append(" ") + } + UTF8String.fromString(sb.toString()) +} + +def addCases(benchmark: Benchmark, + text: UTF8String): Unit = { + // collation that contains collator + val collationId = CollationFactory.collationNameToId("he_ISR") Review Comment: Updated - `InitCapBenchmark-results.txt` - `InitCapBenchmark-jdk21-results.txt` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
mrk-andreev commented on code in PR #48501: URL: https://github.com/apache/spark/pull/48501#discussion_r1821582236 ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InitCapBenchmark.scala: ## @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.benchmark + +import org.apache.spark.benchmark.{Benchmark, BenchmarkBase} +import org.apache.spark.sql.catalyst.util.CollationFactory +import org.apache.spark.sql.catalyst.util.CollationSupport.InitCap +import org.apache.spark.unsafe.types.UTF8String + +/** + * A benchmark that compares the performance of different ways to evaluate SQL initcap expressions. + * + * Specifically, this class compares the execICU, execBinaryICU, execBinary, execLowercase + * approaches. This class compares for string of different lengths with different words count. + * + * To run this benchmark: + * {{{ + * 1. without sbt: + * bin/spark-submit --class + *--jars , + * 2. build/sbt "sql/Test/runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain " + * Results will be written to "benchmarks/InitCapBenchmark-results.txt". + * }}} + */ +object InitCapBenchmark extends BenchmarkBase { + override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { +def generateString(wordsCount: Int, wordLen: Int, firstLetterUpper: Boolean): UTF8String = { + val sb = new StringBuilder(wordsCount * wordLen + wordLen) + for (_ <- 0 until wordsCount) { +for (pos <- 0 until wordLen) { + if (pos == 0 && firstLetterUpper) { +sb.append("X") + } else { +sb.append("x") + } +} +sb.append(" ") + } + UTF8String.fromString(sb.toString()) +} + +def addCases(benchmark: Benchmark, + text: UTF8String): Unit = { Review Comment: Fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
mrk-andreev commented on code in PR #48501: URL: https://github.com/apache/spark/pull/48501#discussion_r1821583528 ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InitCapBenchmark.scala: ## @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.benchmark + +import org.apache.spark.benchmark.{Benchmark, BenchmarkBase} +import org.apache.spark.sql.catalyst.util.CollationFactory +import org.apache.spark.sql.catalyst.util.CollationSupport.InitCap +import org.apache.spark.unsafe.types.UTF8String + +/** + * A benchmark that compares the performance of different ways to evaluate SQL initcap expressions. + * + * Specifically, this class compares the execICU, execBinaryICU, execBinary, execLowercase + * approaches. This class compares for string of different lengths with different words count. + * + * To run this benchmark: + * {{{ + * 1. without sbt: + * bin/spark-submit --class + *--jars , + * 2. build/sbt "sql/Test/runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain " + * Results will be written to "benchmarks/InitCapBenchmark-results.txt". + * }}} + */ +object InitCapBenchmark extends BenchmarkBase { + override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { +def generateString(wordsCount: Int, wordLen: Int, firstLetterUpper: Boolean): UTF8String = { + val sb = new StringBuilder(wordsCount * wordLen + wordLen) + for (_ <- 0 until wordsCount) { +for (pos <- 0 until wordLen) { + if (pos == 0 && firstLetterUpper) { +sb.append("X") + } else { +sb.append("x") + } +} +sb.append(" ") + } + UTF8String.fromString(sb.toString()) +} + +def addCases(benchmark: Benchmark, + text: UTF8String): Unit = { + // collation that contains collator + val collationId = CollationFactory.collationNameToId("he_ISR") Review Comment: Extended with ```java for (collationName <- List("he_ISR", "UNICODE", "UNICODE_CI")) { val collationId = CollationFactory.collationNameToId(collationName) assert(CollationFactory.fetchCollation(collationId).collator != null) val caseName = s"execICU[collationName=${collationName}]" benchmark.addCase(caseName)(_ => InitCap.execICU(text, collationId)) } ``` The primary requirement for `collationId` in `InitCap.execICU` is that `CollationFactory.fetchCollation(collationId).collator` must not be null; otherwise, the function will throw an NPE. ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InitCapBenchmark.scala: ## @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.benchmark + +import org.apache.spark.benchmark.{Benchmark, BenchmarkBase} +import org.apache.spark.sql.catalyst.util.CollationFactory +import org.apache.spark.sql.catalyst.util.CollationSupport.InitCap +import org.apache.spark.unsafe.types.UTF8String + +/** + * A benchmark that compares the performance of different ways to evaluate SQL initcap expressions. + * + * Specifically, this class compares the execICU, execBinaryICU, execBinary, execLowercase + * approaches. This class compares for string of different lengths with different words count. + * + * To run this benchmark: + * {{{ + *
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
mrk-andreev commented on PR #48501: URL: https://github.com/apache/spark/pull/48501#issuecomment-2445410112 TODO: Update benchmark outputs. However, I'll wait for additional comments since reevaluation may take some time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
mrk-andreev commented on code in PR #48501: URL: https://github.com/apache/spark/pull/48501#discussion_r1821583528 ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InitCapBenchmark.scala: ## @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.benchmark + +import org.apache.spark.benchmark.{Benchmark, BenchmarkBase} +import org.apache.spark.sql.catalyst.util.CollationFactory +import org.apache.spark.sql.catalyst.util.CollationSupport.InitCap +import org.apache.spark.unsafe.types.UTF8String + +/** + * A benchmark that compares the performance of different ways to evaluate SQL initcap expressions. + * + * Specifically, this class compares the execICU, execBinaryICU, execBinary, execLowercase + * approaches. This class compares for string of different lengths with different words count. + * + * To run this benchmark: + * {{{ + * 1. without sbt: + * bin/spark-submit --class + *--jars , + * 2. build/sbt "sql/Test/runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain " + * Results will be written to "benchmarks/InitCapBenchmark-results.txt". + * }}} + */ +object InitCapBenchmark extends BenchmarkBase { + override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { +def generateString(wordsCount: Int, wordLen: Int, firstLetterUpper: Boolean): UTF8String = { + val sb = new StringBuilder(wordsCount * wordLen + wordLen) + for (_ <- 0 until wordsCount) { +for (pos <- 0 until wordLen) { + if (pos == 0 && firstLetterUpper) { +sb.append("X") + } else { +sb.append("x") + } +} +sb.append(" ") + } + UTF8String.fromString(sb.toString()) +} + +def addCases(benchmark: Benchmark, + text: UTF8String): Unit = { + // collation that contains collator + val collationId = CollationFactory.collationNameToId("he_ISR") Review Comment: Extended with ``` for (collationName <- List("he_ISR", "UNICODE", "UNICODE_CI")) { val collationId = CollationFactory.collationNameToId(collationName) assert(CollationFactory.fetchCollation(collationId).collator != null) val caseName = s"execICU[collationName=${collationName}]" benchmark.addCase(caseName)(_ => InitCap.execICU(text, collationId)) } ``` The primary requirement for collationId in InitCap.execICU is that CollationFactory.fetchCollation(collationId).collator must not be null; otherwise, the function will throw an NPE. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
MaxGekk commented on code in PR #48501: URL: https://github.com/apache/spark/pull/48501#discussion_r1817929449 ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InitCapBenchmark.scala: ## @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.benchmark + +import org.apache.spark.benchmark.{Benchmark, BenchmarkBase} +import org.apache.spark.sql.catalyst.util.CollationFactory +import org.apache.spark.sql.catalyst.util.CollationSupport.InitCap +import org.apache.spark.unsafe.types.UTF8String + +/** + * A benchmark that compares the performance of different ways to evaluate SQL initcap expressions. + * + * Specifically, this class compares the execICU, execBinaryICU, execBinary, execLowercase + * approaches. This class compares for string of different lengths with different words count. + * + * To run this benchmark: + * {{{ + * 1. without sbt: + * bin/spark-submit --class + *--jars , + * 2. build/sbt "sql/Test/runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain " + * Results will be written to "benchmarks/InitCapBenchmark-results.txt". + * }}} + */ +object InitCapBenchmark extends BenchmarkBase { + override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { +def generateString(wordsCount: Int, wordLen: Int, firstLetterUpper: Boolean): UTF8String = { + val sb = new StringBuilder(wordsCount * wordLen + wordLen) + for (_ <- 0 until wordsCount) { +for (pos <- 0 until wordLen) { + if (pos == 0 && firstLetterUpper) { +sb.append("X") + } else { +sb.append("x") + } +} +sb.append(" ") + } + UTF8String.fromString(sb.toString()) +} + +def addCases(benchmark: Benchmark, + text: UTF8String): Unit = { + // collation that contains collator + val collationId = CollationFactory.collationNameToId("he_ISR") Review Comment: Could you benchmark more collations, see https://github.com/apache/spark/blob/9909817aef9198fd88058b4c8fec292de2797b8d/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala#L27 ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InitCapBenchmark.scala: ## @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.benchmark + +import org.apache.spark.benchmark.{Benchmark, BenchmarkBase} +import org.apache.spark.sql.catalyst.util.CollationFactory +import org.apache.spark.sql.catalyst.util.CollationSupport.InitCap +import org.apache.spark.unsafe.types.UTF8String + +/** + * A benchmark that compares the performance of different ways to evaluate SQL initcap expressions. + * + * Specifically, this class compares the execICU, execBinaryICU, execBinary, execLowercase + * approaches. This class compares for string of different lengths with different words count. + * + * To run this benchmark: + * {{{ + * 1. without sbt: + * bin/spark-submit --class + *--jars , + * 2. build/sbt "sql/Test/runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain " + * Results will be written to "benchmarks/InitCapBenchmark-results.txt". + * }}} + */ +object InitCapBenchmark extends BenchmarkBase { + override def runBenchmarkSuite(mainArgs: Arr
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
MaxGekk commented on PR #48501: URL: https://github.com/apache/spark/pull/48501#issuecomment-2429155377 @uros-db @mihailom-db @viktorluc-db Could you review this PR, please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
mrk-andreev commented on code in PR #48501: URL: https://github.com/apache/spark/pull/48501#discussion_r1809129420 ## sql/core/benchmarks/InitCapBenchmark-results.txt: ## @@ -0,0 +1,168 @@ + +[wc=1, wl=1, capitalized=true] + + +OpenJDK 64-Bit Server VM 17.0.11+10-LTS on Linux 5.15.0-122-generic +Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz +InitCap evaluation [wc=1, wl=1, capitalized=true]: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative +- +execICU0 0 0 371177345.1 0.0 1.0X Review Comment: I adjusted the word count for my `Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz`, but encountered issues with local evaluation. This led to a remote evaluation on an `Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz`, where the performance was noticeably less impressive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
MaxGekk commented on code in PR #48501: URL: https://github.com/apache/spark/pull/48501#discussion_r1807893787 ## sql/core/benchmarks/InitCapBenchmark-results.txt: ## @@ -0,0 +1,168 @@ + +[wc=1, wl=1, capitalized=true] + + +OpenJDK 64-Bit Server VM 17.0.11+10-LTS on Linux 5.15.0-122-generic +Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz +InitCap evaluation [wc=1, wl=1, capitalized=true]: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative +- +execICU0 0 0 371177345.1 0.0 1.0X Review Comment: Let's bump number of iterations to see seconds in Best/Avg Time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
mrk-andreev commented on PR #48501: URL: https://github.com/apache/spark/pull/48501#issuecomment-2423747579 > Let's place the backmark at the SQL level so far. Done > Can we include the benchmark result files too? Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
HyukjinKwon commented on PR #48501: URL: https://github.com/apache/spark/pull/48501#issuecomment-2418241753 Can we include the benchmark result files too? See also "Testing with GitHub Actions workflow" at https://spark.apache.org/developer-tools.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-49490][SQL] Add benchmarks for initCap [spark]
mrk-andreev commented on PR #48501: URL: https://github.com/apache/spark/pull/48501#issuecomment-2417414974 Results of local run [InitCapBenchmark-local.txt](https://github.com/user-attachments/files/17399973/InitCapBenchmark-local.txt) ## Sample ``` Running benchmark: InitCap evaluation [wc=1000, wl=16, capitalized=false] Running case: execICU Stopped after 8978 iterations, 2000 ms Running case: execBinaryICU Stopped after 6235 iterations, 2000 ms Running case: execBinary Stopped after 28374 iterations, 2000 ms Running case: execLowercase Stopped after 8839 iterations, 2000 ms OpenJDK 64-Bit Server VM 17.0.2+8-86 on Linux 5.15.0-122-generic Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz InitCap evaluation [wc=1000, wl=16, capitalized=false]: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- execICU 0 0 0 432768.3 0.0 1.0X execBinaryICU 0 0 0 285450.1 0.0 0.7X execBinary 0 0 01494256.8 0.0 3.5X execLowercase 0 0 0 415082.4 0.0 1.0X ``` ## Open questions 1. Should we place the benchmark code in the same package, 'unsafe,' or at the 'SQL level'? If it's in 'unsafe,' should we extract the shared code for benchmarks into a shared library? 2. The benchmark output expects each measurement to be at least 1 ms, but this isn't the case here. Should we align the rounding to the first non-zero digit after the decimal point? 4. How detailed do we expect the benchmarks to be? Do we want different axes of variation, or should we stick to defaults like parameters? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org