[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...

2018-10-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22845


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...

2018-10-30 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22845#discussion_r229217228
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmarks.scala
 ---
@@ -16,30 +16,31 @@
  */
 package org.apache.spark.sql.execution.datasources.csv
 
-import org.apache.spark.SparkConf
 import org.apache.spark.benchmark.Benchmark
-import org.apache.spark.sql.{Column, Row, SparkSession}
-import org.apache.spark.sql.catalyst.plans.SQLHelper
+import org.apache.spark.sql.{Column, Row}
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
 import org.apache.spark.sql.functions.lit
 import org.apache.spark.sql.types._
 
 /**
  * Benchmark to measure CSV read/write performance.
- * To run this:
- *  spark-submit --class  --jars 
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *  bin/spark-submit --class  --jars ,
+ *
+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result:
+ *  SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain "
+ *  Results will be written to "benchmarks/CSVBenchmarks-results.txt".
+ * }}}
  */
-object CSVBenchmarks extends SQLHelper {
-  val conf = new SparkConf()
-
-  val spark = SparkSession.builder
-.master("local[1]")
-.appName("benchmark-csv-datasource")
-.config(conf)
-.getOrCreate()
+
+object CSVBenchmarks extends SqlBasedBenchmark {
--- End diff --

@heary-cao . Could you rename the files?
- `CSVBenchmarks.scala` -> `CSVBenchmark.scala`
- `CSVBenchmarks-results.txt` -> `CSVBenchmark-results.txt`
- [Line 
35](https://github.com/apache/spark/pull/22845/files#diff-985fa5181f2aec4df39324995590ea83R35)
 should be changed together from `benchmarks/CSVBenchmarks-results.txt` to 
`benchmarks/CSVBenchmark-results.txt`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...

2018-10-30 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22845#discussion_r229216166
  
--- Diff: sql/core/benchmarks/CSVBenchmarks-results.txt ---
@@ -0,0 +1,27 @@

+
+Benchmark to measure CSV read/write performance

+
+
+OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
+Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Parsing quoted values:   Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative

+
+One quoted string   64733 / 64839  0.0 
1294653.1   1.0X
+
+OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 3.10.0-862.3.2.el7.x86_64
+Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Wide rows with 1000 columns: Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative

+
+Select 1000 columns   185609 / 189735  0.0 
 185608.6   1.0X
+Select 100 columns  50195 / 51808  0.0 
  50194.8   3.7X
+Select one column   39266 / 39293  0.0 
  39265.6   4.7X
+count() 10959 / 11000  0.1 
  10958.5  16.9X
--- End diff --

In this case, the ratio change seems to be due to the improvement on 
`count()`. cc @HyukjinKwon .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...

2018-10-30 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22845#discussion_r229201094
  
--- Diff: sql/core/benchmarks/CSVBenchmarks-results.txt ---
@@ -0,0 +1,27 @@

+
+Benchmark to measure CSV read/write performance

+
+
+OpenJDK 64-Bit Server VM 1.8.0_163-b01 on Windows 7 6.1
+Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
--- End diff --

I made a PR to you. Could you review and merge 
https://github.com/heary-cao/spark/pull/2 ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...

2018-10-30 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22845#discussion_r229199547
  
--- Diff: sql/core/benchmarks/CSVBenchmarks-results.txt ---
@@ -0,0 +1,27 @@

+
+Benchmark to measure CSV read/write performance

+
+
+OpenJDK 64-Bit Server VM 1.8.0_163-b01 on Windows 7 6.1
+Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
--- End diff --

This seems to be the limitation in Spark benchmark code itself.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...

2018-10-30 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22845#discussion_r229199434
  
--- Diff: sql/core/benchmarks/CSVBenchmarks-results.txt ---
@@ -0,0 +1,27 @@

+
+Benchmark to measure CSV read/write performance

+
+
+OpenJDK 64-Bit Server VM 1.8.0_163-b01 on Windows 7 6.1
+Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
--- End diff --

Actually, `GHz` is missing here. So, it's hard to figure out what CPU is 
used here.
```
Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz [Family 6 Model 94 Stepping 3]
Intel(R) Core(TM) i7-6700T CPU @ 2.80GHz [Family 6 Model 94 Stepping 3]
Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz [Family 6 Model 94 Stepping 3]
Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz [Family 6 Model 94 Stepping 3]
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...

2018-10-29 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22845#discussion_r229190198
  
--- Diff: sql/core/benchmarks/CSVBenchmarks-results.txt ---
@@ -0,0 +1,27 @@

+
+Benchmark to measure CSV read/write performance

+
+
+OpenJDK 64-Bit Server VM 1.8.0_163-b01 on Windows 7 6.1
--- End diff --

Wow. Did you run this on Windows 7?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...

2018-10-29 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22845#discussion_r229019879
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmarks.scala
 ---
@@ -137,22 +124,15 @@ object CSVBenchmarks extends SQLHelper {
 ds.count()
   }
 
-  /*
-  Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
-
-  Count a dataset with 10 columns:  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-  
-
-  Select 10 columns + count()  12598 / 12740  0.8  
  1259.8   1.0X
-  Select 1 column + count()  7960 / 8175  1.3  
   796.0   1.6X
-  count()2332 / 2386  4.3  
   233.2   5.4X
-  */
   benchmark.run()
 }
   }
 
-  def main(args: Array[String]): Unit = {
-quotedValuesBenchmark(rowsNum = 50 * 1000, numIters = 3)
-multiColumnsBenchmark(rowsNum = 1000 * 1000)
-countBenchmark(10 * 1000 * 1000)
+  override def runBenchmarkSuite(): Unit = {
--- End diff --

+1 for @yucai 's comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...

2018-10-29 Thread yucai
Github user yucai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22845#discussion_r229011040
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmarks.scala
 ---
@@ -137,22 +124,15 @@ object CSVBenchmarks extends SQLHelper {
 ds.count()
   }
 
-  /*
-  Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
-
-  Count a dataset with 10 columns:  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-  
-
-  Select 10 columns + count()  12598 / 12740  0.8  
  1259.8   1.0X
-  Select 1 column + count()  7960 / 8175  1.3  
   796.0   1.6X
-  count()2332 / 2386  4.3  
   233.2   5.4X
-  */
   benchmark.run()
 }
   }
 
-  def main(args: Array[String]): Unit = {
-quotedValuesBenchmark(rowsNum = 50 * 1000, numIters = 3)
-multiColumnsBenchmark(rowsNum = 1000 * 1000)
-countBenchmark(10 * 1000 * 1000)
+  override def runBenchmarkSuite(): Unit = {
--- End diff --

#22872 has updated `runBenchmarkSuite`'s signature.
```suggestion
  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...

2018-10-29 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22845#discussion_r228831997
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmarks.scala
 ---
@@ -16,30 +16,30 @@
  */
 package org.apache.spark.sql.execution.datasources.csv
 
-import org.apache.spark.SparkConf
 import org.apache.spark.benchmark.Benchmark
-import org.apache.spark.sql.{Column, Row, SparkSession}
-import org.apache.spark.sql.catalyst.plans.SQLHelper
+import org.apache.spark.sql.{Column, Row}
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
 import org.apache.spark.sql.functions.lit
 import org.apache.spark.sql.types._
 
 /**
  * Benchmark to measure CSV read/write performance.
- * To run this:
- *  spark-submit --class  --jars 
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *  bin/spark-submit --class  --jars  

--- End diff --

Also update the usage in description:
```console
bin/spark-submit --class 
org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks --jars 
./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,./sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar
 ./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...

2018-10-29 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22845#discussion_r228831306
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmarks.scala
 ---
@@ -16,30 +16,30 @@
  */
 package org.apache.spark.sql.execution.datasources.csv
 
-import org.apache.spark.SparkConf
 import org.apache.spark.benchmark.Benchmark
-import org.apache.spark.sql.{Column, Row, SparkSession}
-import org.apache.spark.sql.catalyst.plans.SQLHelper
+import org.apache.spark.sql.{Column, Row}
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
 import org.apache.spark.sql.functions.lit
 import org.apache.spark.sql.types._
 
 /**
  * Benchmark to measure CSV read/write performance.
- * To run this:
- *  spark-submit --class  --jars 
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *  bin/spark-submit --class  --jars  

--- End diff --

Please update `without sbt` usage to:
```
bin/spark-submit --class  --jars , 
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22845: [SPARK-25848][SQL][TEST] Refactor CSVBenchmarks t...

2018-10-25 Thread heary-cao
GitHub user heary-cao opened a pull request:

https://github.com/apache/spark/pull/22845

[SPARK-25848][SQL][TEST] Refactor CSVBenchmarks to use main method

## What changes were proposed in this pull request?

use spark-submit:
bin/spark-submit --class  
org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks --jars 
./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
Generate benchmark result:
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks"

## How was this patch tested?

manual tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/heary-cao/spark CSVBenchmarks

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22845.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22845


commit 9ddb8476544fa34b15fbe15387e1b4983d4d76d4
Author: caoxuewen 
Date:   2018-10-26T04:07:48Z

Refactor CSVBenchmarks to use main method




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org