[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22845 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22845#discussion_r229217228 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmarks.scala --- @@ -16,30 +16,31 @@ */ package org.apache.spark.sql.execution.datasources.csv -import org.apache.spark.SparkConf import org.apache.spark.benchmark.Benchmark -import org.apache.spark.sql.{Column, Row, SparkSession} -import org.apache.spark.sql.catalyst.plans.SQLHelper +import org.apache.spark.sql.{Column, Row} +import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark import org.apache.spark.sql.functions.lit import org.apache.spark.sql.types._ /** * Benchmark to measure CSV read/write performance. - * To run this: - * spark-submit --class --jars + * To run this benchmark: + * {{{ + * 1. without sbt: + * bin/spark-submit --class --jars , + * + * 2. build/sbt "sql/test:runMain " + * 3. generate result: + * SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * Results will be written to "benchmarks/CSVBenchmarks-results.txt". + * }}} */ -object CSVBenchmarks extends SQLHelper { - val conf = new SparkConf() - - val spark = SparkSession.builder -.master("local[1]") -.appName("benchmark-csv-datasource") -.config(conf) -.getOrCreate() + +object CSVBenchmarks extends SqlBasedBenchmark { --- End diff -- @heary-cao . Could you rename the files? - `CSVBenchmarks.scala` -> `CSVBenchmark.scala` - `CSVBenchmarks-results.txt` -> `CSVBenchmark-results.txt` - [Line 35](https://github.com/apache/spark/pull/22845/files#diff-985fa5181f2aec4df39324995590ea83R35) should be changed together from `benchmarks/CSVBenchmarks-results.txt` to `benchmarks/CSVBenchmark-results.txt`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22845#discussion_r229216166 --- Diff: sql/core/benchmarks/CSVBenchmarks-results.txt --- @@ -0,0 +1,27 @@ + +Benchmark to measure CSV read/write performance + + +OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64 +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Parsing quoted values: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +One quoted string 64733 / 64839 0.0 1294653.1 1.0X + +OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64 +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Wide rows with 1000 columns: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +Select 1000 columns 185609 / 189735 0.0 185608.6 1.0X +Select 100 columns 50195 / 51808 0.0 50194.8 3.7X +Select one column 39266 / 39293 0.0 39265.6 4.7X +count() 10959 / 11000 0.1 10958.5 16.9X --- End diff -- In this case, the ratio change seems to be due to the improvement on `count()`. cc @HyukjinKwon . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22845#discussion_r229201094 --- Diff: sql/core/benchmarks/CSVBenchmarks-results.txt --- @@ -0,0 +1,27 @@ + +Benchmark to measure CSV read/write performance + + +OpenJDK 64-Bit Server VM 1.8.0_163-b01 on Windows 7 6.1 +Intel64 Family 6 Model 94 Stepping 3, GenuineIntel --- End diff -- I made a PR to you. Could you review and merge https://github.com/heary-cao/spark/pull/2 ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22845#discussion_r229199547 --- Diff: sql/core/benchmarks/CSVBenchmarks-results.txt --- @@ -0,0 +1,27 @@ + +Benchmark to measure CSV read/write performance + + +OpenJDK 64-Bit Server VM 1.8.0_163-b01 on Windows 7 6.1 +Intel64 Family 6 Model 94 Stepping 3, GenuineIntel --- End diff -- This seems to be the limitation in Spark benchmark code itself. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22845#discussion_r229199434 --- Diff: sql/core/benchmarks/CSVBenchmarks-results.txt --- @@ -0,0 +1,27 @@ + +Benchmark to measure CSV read/write performance + + +OpenJDK 64-Bit Server VM 1.8.0_163-b01 on Windows 7 6.1 +Intel64 Family 6 Model 94 Stepping 3, GenuineIntel --- End diff -- Actually, `GHz` is missing here. So, it's hard to figure out what CPU is used here. ``` Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz [Family 6 Model 94 Stepping 3] Intel(R) Core(TM) i7-6700T CPU @ 2.80GHz [Family 6 Model 94 Stepping 3] Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz [Family 6 Model 94 Stepping 3] Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz [Family 6 Model 94 Stepping 3] ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22845#discussion_r229190198 --- Diff: sql/core/benchmarks/CSVBenchmarks-results.txt --- @@ -0,0 +1,27 @@ + +Benchmark to measure CSV read/write performance + + +OpenJDK 64-Bit Server VM 1.8.0_163-b01 on Windows 7 6.1 --- End diff -- Wow. Did you run this on Windows 7? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22845#discussion_r229019879 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmarks.scala --- @@ -137,22 +124,15 @@ object CSVBenchmarks extends SQLHelper { ds.count() } - /* - Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz - - Count a dataset with 10 columns: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - - - Select 10 columns + count() 12598 / 12740 0.8 1259.8 1.0X - Select 1 column + count() 7960 / 8175 1.3 796.0 1.6X - count()2332 / 2386 4.3 233.2 5.4X - */ benchmark.run() } } - def main(args: Array[String]): Unit = { -quotedValuesBenchmark(rowsNum = 50 * 1000, numIters = 3) -multiColumnsBenchmark(rowsNum = 1000 * 1000) -countBenchmark(10 * 1000 * 1000) + override def runBenchmarkSuite(): Unit = { --- End diff -- +1 for @yucai 's comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/22845#discussion_r229011040 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmarks.scala --- @@ -137,22 +124,15 @@ object CSVBenchmarks extends SQLHelper { ds.count() } - /* - Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz - - Count a dataset with 10 columns: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - - - Select 10 columns + count() 12598 / 12740 0.8 1259.8 1.0X - Select 1 column + count() 7960 / 8175 1.3 796.0 1.6X - count()2332 / 2386 4.3 233.2 5.4X - */ benchmark.run() } } - def main(args: Array[String]): Unit = { -quotedValuesBenchmark(rowsNum = 50 * 1000, numIters = 3) -multiColumnsBenchmark(rowsNum = 1000 * 1000) -countBenchmark(10 * 1000 * 1000) + override def runBenchmarkSuite(): Unit = { --- End diff -- #22872 has updated `runBenchmarkSuite`'s signature. ```suggestion override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22845#discussion_r228831997 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmarks.scala --- @@ -16,30 +16,30 @@ */ package org.apache.spark.sql.execution.datasources.csv -import org.apache.spark.SparkConf import org.apache.spark.benchmark.Benchmark -import org.apache.spark.sql.{Column, Row, SparkSession} -import org.apache.spark.sql.catalyst.plans.SQLHelper +import org.apache.spark.sql.{Column, Row} +import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark import org.apache.spark.sql.functions.lit import org.apache.spark.sql.types._ /** * Benchmark to measure CSV read/write performance. - * To run this: - * spark-submit --class --jars + * To run this benchmark: + * {{{ + * 1. without sbt: + * bin/spark-submit --class --jars --- End diff -- Also update the usage in description: ```console bin/spark-submit --class org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,./sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22845#discussion_r228831306 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmarks.scala --- @@ -16,30 +16,30 @@ */ package org.apache.spark.sql.execution.datasources.csv -import org.apache.spark.SparkConf import org.apache.spark.benchmark.Benchmark -import org.apache.spark.sql.{Column, Row, SparkSession} -import org.apache.spark.sql.catalyst.plans.SQLHelper +import org.apache.spark.sql.{Column, Row} +import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark import org.apache.spark.sql.functions.lit import org.apache.spark.sql.types._ /** * Benchmark to measure CSV read/write performance. - * To run this: - * spark-submit --class --jars + * To run this benchmark: + * {{{ + * 1. without sbt: + * bin/spark-submit --class --jars --- End diff -- Please update `without sbt` usage to: ``` bin/spark-submit --class --jars , ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...
GitHub user heary-cao opened a pull request: https://github.com/apache/spark/pull/22845 [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks to use main method ## What changes were proposed in this pull request? use spark-submit: bin/spark-submit --class org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar Generate benchmark result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks" ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/heary-cao/spark CSVBenchmarks Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22845.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22845 commit 9ddb8476544fa34b15fbe15387e1b4983d4d76d4 Author: caoxuewen Date: 2018-10-26T04:07:48Z Refactor CSVBenchmarks to use main method --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org