[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30379: [SPARK-33455][SQL][TEST] Add SubExprEliminationBenchmark for benchmarking subexpression elimination

2020-11-14 Thread GitBox


dongjoon-hyun commented on a change in pull request #30379:
URL: https://github.com/apache/spark/pull/30379#discussion_r523672856



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/SubExprEliminationBenchmark.scala
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * The benchmarks aims to measure performance of the queries where there are 
subexpression
+ * elimination or not.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *  bin/spark-submit --class  --jars ,
+ * 
+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result:
+ *  SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain "
+ *  Results will be written to 
"benchmarks/SubExprEliminationBenchmark-results.txt".
+ * }}}
+ */
+object SubExprEliminationBenchmark extends SqlBasedBenchmark {
+  import spark.implicits._
+
+  def withFromJson(rowsNum: Int, numIters: Int): Unit = {
+val benchmark = new Benchmark("from_json as subExpr", rowsNum, output = 
output)
+
+withTempPath { path =>
+  prepareDataInfo(benchmark)
+  val numCols = 1000
+  val schema = writeWideRow(path.getAbsolutePath, rowsNum, numCols)
+
+  val cols = (0 until numCols).map { idx =>
+from_json('value, schema).getField(s"col$idx")
+  }
+
+  benchmark.addCase("subexpressionElimination off, codegen on", numIters) 
{ _ =>
+withSQLConf(
+  SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key -> "false",
+  SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true",
+  SQLConf.CODEGEN_FACTORY_MODE.key -> "CODEGEN_ONLY",
+  SQLConf.JSON_EXPRESSION_OPTIMIZATION.key -> "false") {
+  val df = spark.read
+.text(path.getAbsolutePath)
+.select(cols: _*)
+  df.collect()
+}
+  }
+
+  benchmark.addCase("subexpressionElimination off, codegen off", numIters) 
{ _ =>
+withSQLConf(
+  SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key -> "false",
+  SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "false",
+  SQLConf.CODEGEN_FACTORY_MODE.key -> "NO_CODEGEN",
+  SQLConf.JSON_EXPRESSION_OPTIMIZATION.key -> "false") {
+  val df = spark.read
+.text(path.getAbsolutePath)
+.select(cols: _*)
+  df.collect()
+}
+  }
+
+  // We only benchmark subexpression performance under 
codegen/non-codegen, so disabling
+  // json optimization.

Review comment:
   Oh, this seems to be moved together to line 52.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30379: [SPARK-33455][SQL][TEST] Add SubExprEliminationBenchmark for benchmarking subexpression elimination

2020-11-14 Thread GitBox


dongjoon-hyun commented on a change in pull request #30379:
URL: https://github.com/apache/spark/pull/30379#discussion_r523672856



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/SubExprEliminationBenchmark.scala
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * The benchmarks aims to measure performance of the queries where there are 
subexpression
+ * elimination or not.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *  bin/spark-submit --class  --jars ,
+ * 
+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result:
+ *  SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain "
+ *  Results will be written to 
"benchmarks/SubExprEliminationBenchmark-results.txt".
+ * }}}
+ */
+object SubExprEliminationBenchmark extends SqlBasedBenchmark {
+  import spark.implicits._
+
+  def withFromJson(rowsNum: Int, numIters: Int): Unit = {
+val benchmark = new Benchmark("from_json as subExpr", rowsNum, output = 
output)
+
+withTempPath { path =>
+  prepareDataInfo(benchmark)
+  val numCols = 1000
+  val schema = writeWideRow(path.getAbsolutePath, rowsNum, numCols)
+
+  val cols = (0 until numCols).map { idx =>
+from_json('value, schema).getField(s"col$idx")
+  }
+
+  benchmark.addCase("subexpressionElimination off, codegen on", numIters) 
{ _ =>
+withSQLConf(
+  SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key -> "false",
+  SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true",
+  SQLConf.CODEGEN_FACTORY_MODE.key -> "CODEGEN_ONLY",
+  SQLConf.JSON_EXPRESSION_OPTIMIZATION.key -> "false") {
+  val df = spark.read
+.text(path.getAbsolutePath)
+.select(cols: _*)
+  df.collect()
+}
+  }
+
+  benchmark.addCase("subexpressionElimination off, codegen off", numIters) 
{ _ =>
+withSQLConf(
+  SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key -> "false",
+  SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "false",
+  SQLConf.CODEGEN_FACTORY_MODE.key -> "NO_CODEGEN",
+  SQLConf.JSON_EXPRESSION_OPTIMIZATION.key -> "false") {
+  val df = spark.read
+.text(path.getAbsolutePath)
+.select(cols: _*)
+  df.collect()
+}
+  }
+
+  // We only benchmark subexpression performance under 
codegen/non-codegen, so disabling
+  // json optimization.

Review comment:
   Oh, it seems that we need to move this comment block to line 52 because 
it's a global comment for all four run.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30379: [SPARK-33455][SQL][TEST] Add SubExprEliminationBenchmark for benchmarking subexpression elimination

2020-11-14 Thread GitBox


dongjoon-hyun commented on a change in pull request #30379:
URL: https://github.com/apache/spark/pull/30379#discussion_r523672856



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/SubExprEliminationBenchmark.scala
##
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * The benchmarks aims to measure performance of the queries where there are 
subexpression
+ * elimination or not.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *  bin/spark-submit --class  --jars ,
+ * 
+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result:
+ *  SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain "
+ *  Results will be written to 
"benchmarks/SubExprEliminationBenchmark-results.txt".
+ * }}}
+ */
+object SubExprEliminationBenchmark extends SqlBasedBenchmark {
+  import spark.implicits._
+
+  def withFromJson(rowsNum: Int, numIters: Int): Unit = {
+val benchmark = new Benchmark("from_json as subExpr", rowsNum, output = 
output)
+
+withTempPath { path =>
+  prepareDataInfo(benchmark)
+  val numCols = 1000
+  val schema = writeWideRow(path.getAbsolutePath, rowsNum, numCols)
+
+  val cols = (0 until numCols).map { idx =>
+from_json('value, schema).getField(s"col$idx")
+  }
+
+  benchmark.addCase("subexpressionElimination off, codegen on", numIters) 
{ _ =>
+withSQLConf(
+  SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key -> "false",
+  SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true",
+  SQLConf.CODEGEN_FACTORY_MODE.key -> "CODEGEN_ONLY",
+  SQLConf.JSON_EXPRESSION_OPTIMIZATION.key -> "false") {
+  val df = spark.read
+.text(path.getAbsolutePath)
+.select(cols: _*)
+  df.collect()
+}
+  }
+
+  benchmark.addCase("subexpressionElimination off, codegen off", numIters) 
{ _ =>
+withSQLConf(
+  SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key -> "false",
+  SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "false",
+  SQLConf.CODEGEN_FACTORY_MODE.key -> "NO_CODEGEN",
+  SQLConf.JSON_EXPRESSION_OPTIMIZATION.key -> "false") {
+  val df = spark.read
+.text(path.getAbsolutePath)
+.select(cols: _*)
+  df.collect()
+}
+  }
+
+  // We only benchmark subexpression performance under 
codegen/non-codegen, so disabling
+  // json optimization.

Review comment:
   Oh, it seems that we need to move this comment block to line 52.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30379: [SPARK-33455][SQL][TEST] Add SubExprEliminationBenchmark for benchmarking subexpression elimination

2020-11-14 Thread GitBox


dongjoon-hyun commented on a change in pull request #30379:
URL: https://github.com/apache/spark/pull/30379#discussion_r523670660



##
File path: sql/core/benchmarks/SubExprEliminationBenchmark-jdk11-results.txt
##
@@ -0,0 +1,15 @@
+
+Benchmark for performance of subexpression elimination
+
+
+Preparing data for benchmarking ...
+OpenJDK 64-Bit Server VM 11.0.9+11 on Mac OS X 10.15.6
+Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
+from_json as subExpr:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
+-
+subexpressionElimination off, codegen on   26809  27731
 898  0.0   268094225.4   1.0X
+subexpressionElimination off, codegen off  25117  26612
1357  0.0   251166638.4   1.1X
+subexpressionElimination on, codegen on 2582   2906
 282  0.025819408.7  10.4X

Review comment:
   Wow. It's faster in Java11?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30379: [SPARK-33455][SQL][TEST] Add SubExprEliminationBenchmark for benchmarking subexpression elimination

2020-11-14 Thread GitBox


dongjoon-hyun commented on a change in pull request #30379:
URL: https://github.com/apache/spark/pull/30379#discussion_r523487480



##
File path: sql/core/benchmarks/SubExprEliminationBenchmark-results.txt
##
@@ -7,9 +7,9 @@ OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6
 Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
 from_json as subExpr:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
-
-subexpressionElimination on, codegen on 2303   2543
 238  0.023029833.1   1.0X
-subexpressionElimination on, codegen off   23107  23520
 427  0.0   231069443.3   0.1X
-subexpressionElimination off, codegen on   23363  23848
 421  0.0   233634044.9   0.1X
-subexpressionElimination off, codegen off  22997  23355
 438  0.0   229974135.0   0.1X
+subexpressionElimination off, codegen on   24841  25365
 803  0.0   248412787.5   1.0X
+subexpressionElimination off, codegen off  25344  26205
 941  0.0   253442656.5   1.0X
+subexpressionElimination on, codegen on 2883   3019
 119  0.028833086.8   8.6X

Review comment:
   Nice. It's clearly `8.6x`. :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30379: [SPARK-33455][SQL][TEST] Add SubExprEliminationBenchmark for benchmarking subexpression elimination

2020-11-14 Thread GitBox


dongjoon-hyun commented on a change in pull request #30379:
URL: https://github.com/apache/spark/pull/30379#discussion_r523477798



##
File path: sql/core/benchmarks/SubExprEliminationBenchmark-results.txt
##
@@ -0,0 +1,15 @@
+
+Benchmark for performance of subexpression elimination
+
+
+Preparing data for benchmarking ...
+OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6
+Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
+from_json as subExpr:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
+-
+subexpressionElimination on, codegen on 2303   2543
 238  0.023029833.1   1.0X
+subexpressionElimination on, codegen off   23107  23520
 427  0.0   231069443.3   0.1X
+subexpressionElimination off, codegen on   23363  23848
 421  0.0   233634044.9   0.1X
+subexpressionElimination off, codegen off  22997  23355
 438  0.0   229974135.0   0.1X

Review comment:
   If we are going to merge this first, we need to merge a subset of this 
PR. Only the last two.
   ```
   subexpressionElimination off, codegen on   23363  23848  
   421  0.0   233634044.9   0.1X
   subexpressionElimination off, codegen off  22997  23355  
   438  0.0   229974135.0   0.1X
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30379: [SPARK-33455][SQL][TEST] Add SubExprEliminationBenchmark for benchmarking subexpression elimination

2020-11-14 Thread GitBox


dongjoon-hyun commented on a change in pull request #30379:
URL: https://github.com/apache/spark/pull/30379#discussion_r523468628



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/SubExprEliminationBenchmark.scala
##
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * The benchmarks aims to measure performance of the queries where there are 
subexpression
+ * elimination or not.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *  bin/spark-submit --class  --jars ,
+ * 
+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result:
+ *  SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain "
+ *  Results will be written to 
"benchmarks/SubExprEliminationBenchmark-results.txt".
+ * }}}
+ */
+
+object SubExprEliminationBenchmark extends SqlBasedBenchmark {
+  import spark.implicits._
+
+  def withFromJson(rowsNum: Int, numIters: Int): Unit = {
+val benchmark = new Benchmark("from_json as subExpr", rowsNum, output = 
output)
+
+withTempPath { path =>
+  prepareDataInfo(benchmark)
+  val numCols = 1000
+  val schema = writeWideRow(path.getAbsolutePath, rowsNum, numCols)
+
+  val cols = (0 until numCols).map { idx =>
+from_json('value, schema).getField(s"col$idx")
+  }
+
+  // We only benchmark subexpression performance under 
codegen/non-codegen, so disabling
+  // json optimization.
+  benchmark.addCase("subexpressionElimination on, codegen on", numIters) { 
_ =>
+withSQLConf(
+SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key -> "true",
+SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true",
+SQLConf.CODEGEN_FACTORY_MODE.key -> "CODEGEN_ONLY",
+SQLConf.JSON_EXPRESSION_OPTIMIZATION.key -> "false") {
+  val df = spark.read
+.text(path.getAbsolutePath)
+.select(cols: _*)
+  df.collect()
+}
+  }
+
+  benchmark.addCase("subexpressionElimination on, codegen off", numIters) 
{ _ =>
+withSQLConf(
+  SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key -> "true",
+  SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "false",
+  SQLConf.CODEGEN_FACTORY_MODE.key -> "NO_CODEGEN",
+  SQLConf.JSON_EXPRESSION_OPTIMIZATION.key -> "false") {
+  val df = spark.read
+.text(path.getAbsolutePath)
+.select(cols: _*)
+  df.collect()
+}
+  }
+
+  benchmark.addCase("subexpressionElimination off, codegen on", numIters) 
{ _ =>

Review comment:
   Could you move 82 ~ 106 to before line 54?
   Then, this will be the base line and we can easily say that your improvement 
is `xx times`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30379: [SPARK-33455][SQL][TEST] Add SubExprEliminationBenchmark for benchmarking subexpression elimination

2020-11-14 Thread GitBox


dongjoon-hyun commented on a change in pull request #30379:
URL: https://github.com/apache/spark/pull/30379#discussion_r523468297



##
File path: sql/core/benchmarks/SubExprEliminationBenchmark-results.txt
##
@@ -0,0 +1,15 @@
+
+Benchmark for performance of subexpression elimination
+
+
+Preparing data for benchmarking ...
+OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6

Review comment:
   Could you run the benchmark with Java11 once more?

##
File path: sql/core/benchmarks/SubExprEliminationBenchmark-results.txt
##
@@ -0,0 +1,15 @@
+
+Benchmark for performance of subexpression elimination
+
+
+Preparing data for benchmarking ...
+OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6

Review comment:
   Could you run the benchmark with Java11 once more to have both result 
files?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org