[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-06-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21288


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-06-17 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r195949711
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -0,0 +1,442 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ *  spark-submit --class  
+ */
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+.setAppName("FilterPushdownBenchmark")
+// Since `spark.master` always exists, overrides this value
+.set("spark.master", "local[1]")
--- End diff --

I'm afraid that other developers might misunderstand how-to-use this?
```
spark-submit --master local[1] --class  
spark-submit --master local[*] --class  

In both case, the benchmark always uses `local[1]`. Or, you suggest the 
other point of view?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-06-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r195948346
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -0,0 +1,442 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ *  spark-submit --class  
+ */
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+.setAppName("FilterPushdownBenchmark")
+// Since `spark.master` always exists, overrides this value
+.set("spark.master", "local[1]")
--- End diff --

What I mean is adding `--master local[1]` at line 34, too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-06-17 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r195946683
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -0,0 +1,442 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ *  spark-submit --class  
+ */
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+.setAppName("FilterPushdownBenchmark")
+// Since `spark.master` always exists, overrides this value
+.set("spark.master", "local[1]")
--- End diff --

btw, I updated the description. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-06-17 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r195946600
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -0,0 +1,442 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ *  spark-submit --class  
+ */
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+.setAppName("FilterPushdownBenchmark")
+// Since `spark.master` always exists, overrides this value
+.set("spark.master", "local[1]")
--- End diff --

In the current pr, we cannot use `spark.master` in command line options. 
You suggest we drop `.set("spark.master", "local[1]")` and we always set 
`spark.master` in options for this benchmark?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-06-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r195940979
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -0,0 +1,442 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ *  spark-submit --class  
+ */
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+.setAppName("FilterPushdownBenchmark")
+// Since `spark.master` always exists, overrides this value
+.set("spark.master", "local[1]")
--- End diff --

Could you update `m4.2xlarge` in the PR description and add `spark.master` 
at line 34, too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-06-13 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r195305634
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -131,211 +132,214 @@ object FilterPushdownBenchmark {
 }
 
 /*
+OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 
4.14.26-46.32.amzn1.x86_64
 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
 Select 0 string row (value IS NULL): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
 

-Parquet Vectorized8452 / 8504  1.9 
537.3   1.0X
-Parquet Vectorized (Pushdown)  274 /  281 57.3 
 17.4  30.8X
-Native ORC Vectorized 8167 / 8185  1.9 
519.3   1.0X
-Native ORC Vectorized (Pushdown)   365 /  379 43.1 
 23.2  23.1X
+Parquet Vectorized2961 / 3123  5.3 
188.3   1.0X
+Parquet Vectorized (Pushdown) 3057 / 3121  5.1 
194.4   1.0X
--- End diff --

Thank you for updating, @maropu .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-06-13 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r195304544
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -131,211 +132,214 @@ object FilterPushdownBenchmark {
 }
 
 /*
+OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 
4.14.26-46.32.amzn1.x86_64
 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
 Select 0 string row (value IS NULL): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
 

-Parquet Vectorized8452 / 8504  1.9 
537.3   1.0X
-Parquet Vectorized (Pushdown)  274 /  281 57.3 
 17.4  30.8X
-Native ORC Vectorized 8167 / 8185  1.9 
519.3   1.0X
-Native ORC Vectorized (Pushdown)   365 /  379 43.1 
 23.2  23.1X
+Parquet Vectorized2961 / 3123  5.3 
188.3   1.0X
+Parquet Vectorized (Pushdown) 3057 / 3121  5.1 
194.4   1.0X
--- End diff --

The result in v2.3.1: 
https://gist.github.com/maropu/88627246b7143ede5ab73c7183ab2128

That is not a regression, but I probably run the bench in wrong branch or 
commit.
I re-ran the bench in the current master and updated the pr.

how-to-run: I created a new `m4.2xlarge` instance, fetched this pr, rebased 
to master, and run the bench.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-06-13 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r195262751
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -131,211 +132,214 @@ object FilterPushdownBenchmark {
 }
 
 /*
+OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 
4.14.26-46.32.amzn1.x86_64
 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
 Select 0 string row (value IS NULL): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
 

-Parquet Vectorized8452 / 8504  1.9 
537.3   1.0X
-Parquet Vectorized (Pushdown)  274 /  281 57.3 
 17.4  30.8X
-Native ORC Vectorized 8167 / 8185  1.9 
519.3   1.0X
-Native ORC Vectorized (Pushdown)   365 /  379 43.1 
 23.2  23.1X
+Parquet Vectorized2961 / 3123  5.3 
188.3   1.0X
+Parquet Vectorized (Pushdown) 3057 / 3121  5.1 
194.4   1.0X
--- End diff --

I have time today, so I'll check v2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r191650442
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -131,211 +132,214 @@ object FilterPushdownBenchmark {
 }
 
 /*
+OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 
4.14.26-46.32.amzn1.x86_64
 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
 Select 0 string row (value IS NULL): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
 

-Parquet Vectorized8452 / 8504  1.9 
537.3   1.0X
-Parquet Vectorized (Pushdown)  274 /  281 57.3 
 17.4  30.8X
-Native ORC Vectorized 8167 / 8185  1.9 
519.3   1.0X
-Native ORC Vectorized (Pushdown)   365 /  379 43.1 
 23.2  23.1X
+Parquet Vectorized2961 / 3123  5.3 
188.3   1.0X
+Parquet Vectorized (Pushdown) 3057 / 3121  5.1 
194.4   1.0X
--- End diff --

Is it a regression?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r191650406
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -131,211 +132,214 @@ object FilterPushdownBenchmark {
 }
 
 /*
+OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 
4.14.26-46.32.amzn1.x86_64
 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
 Select 0 string row (value IS NULL): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
 

-Parquet Vectorized8452 / 8504  1.9 
537.3   1.0X
-Parquet Vectorized (Pushdown)  274 /  281 57.3 
 17.4  30.8X
-Native ORC Vectorized 8167 / 8185  1.9 
519.3   1.0X
-Native ORC Vectorized (Pushdown)   365 /  379 43.1 
 23.2  23.1X
+Parquet Vectorized2961 / 3123  5.3 
188.3   1.0X
+Parquet Vectorized (Pushdown) 3057 / 3121  5.1 
194.4   1.0X
--- End diff --

How about 2.3?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-29 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r191620766
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -131,211 +132,214 @@ object FilterPushdownBenchmark {
 }
 
 /*
+OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 
4.14.26-46.32.amzn1.x86_64
 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
 Select 0 string row (value IS NULL): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
 

-Parquet Vectorized8452 / 8504  1.9 
537.3   1.0X
-Parquet Vectorized (Pushdown)  274 /  281 57.3 
 17.4  30.8X
-Native ORC Vectorized 8167 / 8185  1.9 
519.3   1.0X
-Native ORC Vectorized (Pushdown)   365 /  379 43.1 
 23.2  23.1X
+Parquet Vectorized2961 / 3123  5.3 
188.3   1.0X
+Parquet Vectorized (Pushdown) 3057 / 3121  5.1 
194.4   1.0X
--- End diff --

That might be, but I feel the change was too big... I probably think that I 
had some mistakes in the last benchmark runs (I've not found why yet though).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-29 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r191610297
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -131,211 +132,214 @@ object FilterPushdownBenchmark {
 }
 
 /*
+OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 
4.14.26-46.32.amzn1.x86_64
 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
 Select 0 string row (value IS NULL): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
 

-Parquet Vectorized8452 / 8504  1.9 
537.3   1.0X
-Parquet Vectorized (Pushdown)  274 /  281 57.3 
 17.4  30.8X
-Native ORC Vectorized 8167 / 8185  1.9 
519.3   1.0X
-Native ORC Vectorized (Pushdown)   365 /  379 43.1 
 23.2  23.1X
+Parquet Vectorized2961 / 3123  5.3 
188.3   1.0X
+Parquet Vectorized (Pushdown) 3057 / 3121  5.1 
194.4   1.0X
--- End diff --

I have not tried it yet, but is it related to the recent change we made in 
the parquet reader?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-28 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r191283013
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -131,211 +132,214 @@ object FilterPushdownBenchmark {
 }
 
 /*
+OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 
4.14.26-46.32.amzn1.x86_64
 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
 Select 0 string row (value IS NULL): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
 

-Parquet Vectorized8452 / 8504  1.9 
537.3   1.0X
-Parquet Vectorized (Pushdown)  274 /  281 57.3 
 17.4  30.8X
-Native ORC Vectorized 8167 / 8185  1.9 
519.3   1.0X
-Native ORC Vectorized (Pushdown)   365 /  379 43.1 
 23.2  23.1X
+Parquet Vectorized2961 / 3123  5.3 
188.3   1.0X
+Parquet Vectorized (Pushdown) 3057 / 3121  5.1 
194.4   1.0X
--- End diff --

yea, I thinks so. But, not sure. I tried to run multiple times though, I 
didn't get the old performance values...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-28 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r191280132
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -131,211 +132,214 @@ object FilterPushdownBenchmark {
 }
 
 /*
+OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 
4.14.26-46.32.amzn1.x86_64
 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
 Select 0 string row (value IS NULL): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
 

-Parquet Vectorized8452 / 8504  1.9 
537.3   1.0X
-Parquet Vectorized (Pushdown)  274 /  281 57.3 
 17.4  30.8X
-Native ORC Vectorized 8167 / 8185  1.9 
519.3   1.0X
-Native ORC Vectorized (Pushdown)   365 /  379 43.1 
 23.2  23.1X
+Parquet Vectorized2961 / 3123  5.3 
188.3   1.0X
+Parquet Vectorized (Pushdown) 3057 / 3121  5.1 
194.4   1.0X
--- End diff --

The difference is huge. What happened?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-27 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r191109472
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ *  spark-submit --class  
+ */
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+.setAppName("FilterPushdownBenchmark")
+.setIfMissing("spark.master", "local[1]")
+.setIfMissing("spark.driver.memory", "3g")
+.setIfMissing("spark.executor.memory", "3g")
+.setIfMissing("orc.compression", "snappy")
+.setIfMissing("spark.sql.parquet.compression.codec", "snappy")
+
+  private val spark = SparkSession.builder().config(conf).getOrCreate()
+
+  def withTempPath(f: File => Unit): Unit = {
+val path = Utils.createTempDir()
+path.delete()
+try f(path) finally Utils.deleteRecursively(path)
+  }
+
+  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+try f finally tableNames.foreach(spark.catalog.dropTempView)
+  }
+
+  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+val (keys, values) = pairs.unzip
+val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+(keys, values).zipped.foreach(spark.conf.set)
+try f finally {
+  keys.zip(currentValues).foreach {
+case (key, Some(value)) => spark.conf.set(key, value)
+case (key, None) => spark.conf.unset(key)
+  }
+}
+  }
+
+  private def prepareTable(
+  dir: File, numRows: Int, width: Int, useStringForValue: Boolean): 
Unit = {
+import spark.implicits._
+val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
+val valueCol = if (useStringForValue) {
+  monotonically_increasing_id().cast("string")
+} else {
+  monotonically_increasing_id()
+}
+val df = spark.range(numRows).map(_ => 
Random.nextLong).selectExpr(selectExpr: _*)
+  .withColumn("value", valueCol)
+  .sort("value")
+
+saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+  }
+
+  private def prepareStringDictTable(
+  dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = 
{
+val selectExpr = (0 to width).map {
+  case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value"
+  case i => s"CAST(rand() AS STRING) c$i"
+}
+val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value")
+
+saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+  }
+
+  private def saveAsOrcTable(df: DataFrame, dir: String): Unit = {
+df.write.mode("overwrite").orc(dir)
+spark.read.orc(dir).createOrReplaceTempView("orcTable")
+  }
+
+  private def saveAsParquetTable(df: DataFrame, dir: String): Unit = {
+df.write.mode("overwrite").parquet(dir)
+spark.read.parquet(dir).createOrReplaceTempView("parquetTable")
+  }
+
+  def filterPushDownBenchmark(
+  values: Int,
+  title: String,
+  whereExpr: String,
+  selectExpr: String = "*"): Unit = {
+val benchmark = new Benchmark(title, values, minNumIters = 5)
+
+Seq(false, true).foreach { pushDownEnabled =>
+  val name 

[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r190382044
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ *  spark-submit --class  
+ */
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+.setAppName("FilterPushdownBenchmark")
+.setIfMissing("spark.master", "local[1]")
+.setIfMissing("spark.driver.memory", "3g")
+.setIfMissing("spark.executor.memory", "3g")
+.setIfMissing("orc.compression", "snappy")
+.setIfMissing("spark.sql.parquet.compression.codec", "snappy")
+
+  private val spark = SparkSession.builder().config(conf).getOrCreate()
+
+  def withTempPath(f: File => Unit): Unit = {
+val path = Utils.createTempDir()
+path.delete()
+try f(path) finally Utils.deleteRecursively(path)
+  }
+
+  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+try f finally tableNames.foreach(spark.catalog.dropTempView)
+  }
+
+  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+val (keys, values) = pairs.unzip
+val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+(keys, values).zipped.foreach(spark.conf.set)
+try f finally {
+  keys.zip(currentValues).foreach {
+case (key, Some(value)) => spark.conf.set(key, value)
+case (key, None) => spark.conf.unset(key)
+  }
+}
+  }
+
+  private def prepareTable(
+  dir: File, numRows: Int, width: Int, useStringForValue: Boolean): 
Unit = {
+import spark.implicits._
+val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
+val valueCol = if (useStringForValue) {
+  monotonically_increasing_id().cast("string")
+} else {
+  monotonically_increasing_id()
+}
+val df = spark.range(numRows).map(_ => 
Random.nextLong).selectExpr(selectExpr: _*)
+  .withColumn("value", valueCol)
+  .sort("value")
+
+saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+  }
+
+  private def prepareStringDictTable(
+  dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = 
{
+val selectExpr = (0 to width).map {
+  case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value"
+  case i => s"CAST(rand() AS STRING) c$i"
+}
+val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value")
+
+saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+  }
+
+  private def saveAsOrcTable(df: DataFrame, dir: String): Unit = {
+df.write.mode("overwrite").orc(dir)
+spark.read.orc(dir).createOrReplaceTempView("orcTable")
+  }
+
+  private def saveAsParquetTable(df: DataFrame, dir: String): Unit = {
+df.write.mode("overwrite").parquet(dir)
+spark.read.parquet(dir).createOrReplaceTempView("parquetTable")
+  }
+
+  def filterPushDownBenchmark(
+  values: Int,
+  title: String,
+  whereExpr: String,
+  selectExpr: String = "*"): Unit = {
+val benchmark = new Benchmark(title, values, minNumIters = 5)
+
+Seq(false, true).foreach { pushDownEnabled =>
+  val 

[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-21 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r189780637
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ *  spark-submit --class  
+ */
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+.setAppName("FilterPushdownBenchmark")
+.setIfMissing("spark.master", "local[1]")
+.setIfMissing("spark.driver.memory", "3g")
+.setIfMissing("spark.executor.memory", "3g")
+.setIfMissing("orc.compression", "snappy")
+.setIfMissing("spark.sql.parquet.compression.codec", "snappy")
+
+  private val spark = SparkSession.builder().config(conf).getOrCreate()
+
+  def withTempPath(f: File => Unit): Unit = {
+val path = Utils.createTempDir()
+path.delete()
+try f(path) finally Utils.deleteRecursively(path)
+  }
+
+  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+try f finally tableNames.foreach(spark.catalog.dropTempView)
+  }
+
+  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+val (keys, values) = pairs.unzip
+val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+(keys, values).zipped.foreach(spark.conf.set)
+try f finally {
+  keys.zip(currentValues).foreach {
+case (key, Some(value)) => spark.conf.set(key, value)
+case (key, None) => spark.conf.unset(key)
+  }
+}
+  }
+
+  private def prepareTable(
+  dir: File, numRows: Int, width: Int, useStringForValue: Boolean): 
Unit = {
+import spark.implicits._
+val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
+val valueCol = if (useStringForValue) {
+  monotonically_increasing_id().cast("string")
+} else {
+  monotonically_increasing_id()
+}
+val df = spark.range(numRows).map(_ => 
Random.nextLong).selectExpr(selectExpr: _*)
+  .withColumn("value", valueCol)
+  .sort("value")
+
+saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+  }
+
+  private def prepareStringDictTable(
+  dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = 
{
+val selectExpr = (0 to width).map {
+  case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value"
+  case i => s"CAST(rand() AS STRING) c$i"
+}
+val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value")
+
+saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+  }
+
+  private def saveAsOrcTable(df: DataFrame, dir: String): Unit = {
+df.write.mode("overwrite").orc(dir)
+spark.read.orc(dir).createOrReplaceTempView("orcTable")
+  }
+
+  private def saveAsParquetTable(df: DataFrame, dir: String): Unit = {
+df.write.mode("overwrite").parquet(dir)
+spark.read.parquet(dir).createOrReplaceTempView("parquetTable")
+  }
+
+  def filterPushDownBenchmark(
+  values: Int,
+  title: String,
+  whereExpr: String,
+  selectExpr: String = "*"): Unit = {
+val benchmark = new Benchmark(title, values, minNumIters = 5)
+
+Seq(false, true).foreach { pushDownEnabled =>
+  val name 

[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-21 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r189639582
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ *  spark-submit --class  
+ */
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+.setAppName("FilterPushdownBenchmark")
+.setIfMissing("spark.master", "local[1]")
+.setIfMissing("spark.driver.memory", "3g")
+.setIfMissing("spark.executor.memory", "3g")
+.setIfMissing("orc.compression", "snappy")
+.setIfMissing("spark.sql.parquet.compression.codec", "snappy")
+
+  private val spark = SparkSession.builder().config(conf).getOrCreate()
+
+  def withTempPath(f: File => Unit): Unit = {
+val path = Utils.createTempDir()
+path.delete()
+try f(path) finally Utils.deleteRecursively(path)
+  }
+
+  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+try f finally tableNames.foreach(spark.catalog.dropTempView)
+  }
+
+  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+val (keys, values) = pairs.unzip
+val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+(keys, values).zipped.foreach(spark.conf.set)
+try f finally {
+  keys.zip(currentValues).foreach {
+case (key, Some(value)) => spark.conf.set(key, value)
+case (key, None) => spark.conf.unset(key)
+  }
+}
+  }
+
+  private def prepareTable(
+  dir: File, numRows: Int, width: Int, useStringForValue: Boolean): 
Unit = {
+import spark.implicits._
+val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
+val valueCol = if (useStringForValue) {
+  monotonically_increasing_id().cast("string")
+} else {
+  monotonically_increasing_id()
+}
+val df = spark.range(numRows).map(_ => 
Random.nextLong).selectExpr(selectExpr: _*)
+  .withColumn("value", valueCol)
+  .sort("value")
+
+saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+  }
+
+  private def prepareStringDictTable(
+  dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = 
{
+val selectExpr = (0 to width).map {
+  case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value"
+  case i => s"CAST(rand() AS STRING) c$i"
+}
+val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value")
+
+saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+  }
+
+  private def saveAsOrcTable(df: DataFrame, dir: String): Unit = {
+df.write.mode("overwrite").orc(dir)
+spark.read.orc(dir).createOrReplaceTempView("orcTable")
+  }
+
+  private def saveAsParquetTable(df: DataFrame, dir: String): Unit = {
+df.write.mode("overwrite").parquet(dir)
+spark.read.parquet(dir).createOrReplaceTempView("parquetTable")
+  }
+
+  def filterPushDownBenchmark(
+  values: Int,
+  title: String,
+  whereExpr: String,
+  selectExpr: String = "*"): Unit = {
+val benchmark = new Benchmark(title, values, minNumIters = 5)
+
+Seq(false, true).foreach { pushDownEnabled =>
+  val 

[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-21 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r189638131
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ *  spark-submit --class  
+ */
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+.setAppName("FilterPushdownBenchmark")
+.setIfMissing("spark.master", "local[1]")
+.setIfMissing("spark.driver.memory", "3g")
+.setIfMissing("spark.executor.memory", "3g")
+.setIfMissing("orc.compression", "snappy")
+.setIfMissing("spark.sql.parquet.compression.codec", "snappy")
+
+  private val spark = SparkSession.builder().config(conf).getOrCreate()
+
+  def withTempPath(f: File => Unit): Unit = {
+val path = Utils.createTempDir()
+path.delete()
+try f(path) finally Utils.deleteRecursively(path)
+  }
+
+  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+try f finally tableNames.foreach(spark.catalog.dropTempView)
+  }
+
+  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+val (keys, values) = pairs.unzip
+val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+(keys, values).zipped.foreach(spark.conf.set)
+try f finally {
+  keys.zip(currentValues).foreach {
+case (key, Some(value)) => spark.conf.set(key, value)
+case (key, None) => spark.conf.unset(key)
+  }
+}
+  }
+
+  private def prepareTable(
+  dir: File, numRows: Int, width: Int, useStringForValue: Boolean): 
Unit = {
+import spark.implicits._
+val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
+val valueCol = if (useStringForValue) {
+  monotonically_increasing_id().cast("string")
+} else {
+  monotonically_increasing_id()
+}
+val df = spark.range(numRows).map(_ => 
Random.nextLong).selectExpr(selectExpr: _*)
+  .withColumn("value", valueCol)
+  .sort("value")
+
+saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+  }
+
+  private def prepareStringDictTable(
+  dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = 
{
+val selectExpr = (0 to width).map {
+  case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value"
+  case i => s"CAST(rand() AS STRING) c$i"
+}
+val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value")
+
+saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+  }
+
+  private def saveAsOrcTable(df: DataFrame, dir: String): Unit = {
+df.write.mode("overwrite").orc(dir)
+spark.read.orc(dir).createOrReplaceTempView("orcTable")
+  }
+
+  private def saveAsParquetTable(df: DataFrame, dir: String): Unit = {
+df.write.mode("overwrite").parquet(dir)
+spark.read.parquet(dir).createOrReplaceTempView("parquetTable")
+  }
+
+  def filterPushDownBenchmark(
+  values: Int,
+  title: String,
+  whereExpr: String,
+  selectExpr: String = "*"): Unit = {
+val benchmark = new Benchmark(title, values, minNumIters = 5)
+
+Seq(false, true).foreach { pushDownEnabled =>
+  

[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-21 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r189635143
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ * To run this:
+ *  spark-submit --class  
+ */
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+.setAppName("FilterPushdownBenchmark")
+.setIfMissing("spark.master", "local[1]")
+.setIfMissing("spark.driver.memory", "3g")
+.setIfMissing("spark.executor.memory", "3g")
+.setIfMissing("orc.compression", "snappy")
+.setIfMissing("spark.sql.parquet.compression.codec", "snappy")
+
+  private val spark = SparkSession.builder().config(conf).getOrCreate()
+
+  def withTempPath(f: File => Unit): Unit = {
+val path = Utils.createTempDir()
+path.delete()
+try f(path) finally Utils.deleteRecursively(path)
+  }
+
+  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+try f finally tableNames.foreach(spark.catalog.dropTempView)
+  }
+
+  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+val (keys, values) = pairs.unzip
+val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+(keys, values).zipped.foreach(spark.conf.set)
+try f finally {
+  keys.zip(currentValues).foreach {
+case (key, Some(value)) => spark.conf.set(key, value)
+case (key, None) => spark.conf.unset(key)
+  }
+}
+  }
+
+  private def prepareTable(
+  dir: File, numRows: Int, width: Int, useStringForValue: Boolean): 
Unit = {
+import spark.implicits._
+val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
+val valueCol = if (useStringForValue) {
+  monotonically_increasing_id().cast("string")
+} else {
+  monotonically_increasing_id()
+}
+val df = spark.range(numRows).map(_ => 
Random.nextLong).selectExpr(selectExpr: _*)
+  .withColumn("value", valueCol)
+  .sort("value")
+
+saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+  }
+
+  private def prepareStringDictTable(
+  dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = 
{
+val selectExpr = (0 to width).map {
+  case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value"
+  case i => s"CAST(rand() AS STRING) c$i"
+}
+val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value")
+
+saveAsOrcTable(df, dir.getCanonicalPath + "/orc")
+saveAsParquetTable(df, dir.getCanonicalPath + "/parquet")
+  }
+
+  private def saveAsOrcTable(df: DataFrame, dir: String): Unit = {
+df.write.mode("overwrite").orc(dir)
+spark.read.orc(dir).createOrReplaceTempView("orcTable")
+  }
+
+  private def saveAsParquetTable(df: DataFrame, dir: String): Unit = {
+df.write.mode("overwrite").parquet(dir)
+spark.read.parquet(dir).createOrReplaceTempView("parquetTable")
+  }
+
+  def filterPushDownBenchmark(
+  values: Int,
+  title: String,
+  whereExpr: String,
+  selectExpr: String = "*"): Unit = {
+val benchmark = new Benchmark(title, values, minNumIters = 5)
+
+Seq(false, true).foreach { pushDownEnabled =>
+  val 

[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-20 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r189490682
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -32,14 +32,14 @@ import org.apache.spark.util.{Benchmark, Utils}
  */
 object FilterPushdownBenchmark {
   val conf = new SparkConf()
-  conf.set("orc.compression", "snappy")
-  conf.set("spark.sql.parquet.compression.codec", "snappy")
+.setMaster("local[1]")
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-18 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r189175140
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -32,14 +32,14 @@ import org.apache.spark.util.{Benchmark, Utils}
  */
 object FilterPushdownBenchmark {
   val conf = new SparkConf()
-  conf.set("orc.compression", "snappy")
-  conf.set("spark.sql.parquet.compression.codec", "snappy")
+.setMaster("local[1]")
--- End diff --

I think you can do `.setIfMissing("spark.master", "local[1]")`
that way perhaps we could get this to run on different backends too


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r189158065
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -105,138 +128,306 @@ object FilterPushdownBenchmark {
 }
 
 /*
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.2
-Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
-
-Select 0 row (id IS NULL):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7882 / 7957  2.0 
501.1   1.0X
-Parquet Vectorized (Pushdown)   55 /   60285.2 
  3.5 142.9X
-Native ORC Vectorized 5592 / 5627  2.8 
355.5   1.4X
-Native ORC Vectorized (Pushdown)66 /   70237.2 
  4.2 118.9X
-
-Select 0 row (7864320 < id < 7864320):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7884 / 7909  2.0 
501.2   1.0X
-Parquet Vectorized (Pushdown)  739 /  752 21.3 
 47.0  10.7X
-Native ORC Vectorized 5614 / 5646  2.8 
356.9   1.4X
-Native ORC Vectorized (Pushdown)81 /   83195.2 
  5.1  97.8X
-
-Select 1 row (id = 7864320):Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7905 / 8027  2.0 
502.6   1.0X
-Parquet Vectorized (Pushdown)  740 /  766 21.2 
 47.1  10.7X
-Native ORC Vectorized 5684 / 5738  2.8 
361.4   1.4X
-Native ORC Vectorized (Pushdown)78 /   81202.4 
  4.9 101.7X
-
-Select 1 row (id <=> 7864320):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7928 / 7993  2.0 
504.1   1.0X
-Parquet Vectorized (Pushdown)  747 /  772 21.0 
 47.5  10.6X
-Native ORC Vectorized 5728 / 5753  2.7 
364.2   1.4X
-Native ORC Vectorized (Pushdown)76 /   78207.9 
  4.8 104.8X
-
-Select 1 row (7864320 <= id <= 7864320):Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7939 / 8021  2.0 
504.8   1.0X
-Parquet Vectorized (Pushdown)  746 /  770 21.1 
 47.4  10.6X
-Native ORC Vectorized 5690 / 5734  2.8 
361.7   1.4X
-Native ORC Vectorized (Pushdown)76 /   79206.7 
  4.8 104.3X
-
-Select 1 row (7864319 < id < 7864321):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7972 / 8019  2.0 
506.9   1.0X
-Parquet Vectorized (Pushdown)  742 /  764 21.2 
 47.2  10.7X
-Native ORC Vectorized 5704 / 5743  2.8 
362.6   1.4X
-Native ORC Vectorized (Pushdown)76 /   78207.9 
  4.8 105.4X
-
-Select 10% rows (id < 1572864): Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized8733 / 8808  1.8 
555.2   1.0X
-Parquet Vectorized (Pushdown) 2213 / 2267  7.1 
140.7   3.9X
-Native ORC Vectorized 6420 / 6463  2.4 
408.2   1.4X
-Native ORC Vectorized (Pushdown)  1313 / 1331 12.0 
 83.5   6.7X
-
-Select 50% rows (id < 7864320):   

[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-17 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r189120527
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -105,138 +128,306 @@ object FilterPushdownBenchmark {
 }
 
 /*
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.2
-Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
-
-Select 0 row (id IS NULL):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7882 / 7957  2.0 
501.1   1.0X
-Parquet Vectorized (Pushdown)   55 /   60285.2 
  3.5 142.9X
-Native ORC Vectorized 5592 / 5627  2.8 
355.5   1.4X
-Native ORC Vectorized (Pushdown)66 /   70237.2 
  4.2 118.9X
-
-Select 0 row (7864320 < id < 7864320):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7884 / 7909  2.0 
501.2   1.0X
-Parquet Vectorized (Pushdown)  739 /  752 21.3 
 47.0  10.7X
-Native ORC Vectorized 5614 / 5646  2.8 
356.9   1.4X
-Native ORC Vectorized (Pushdown)81 /   83195.2 
  5.1  97.8X
-
-Select 1 row (id = 7864320):Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7905 / 8027  2.0 
502.6   1.0X
-Parquet Vectorized (Pushdown)  740 /  766 21.2 
 47.1  10.7X
-Native ORC Vectorized 5684 / 5738  2.8 
361.4   1.4X
-Native ORC Vectorized (Pushdown)78 /   81202.4 
  4.9 101.7X
-
-Select 1 row (id <=> 7864320):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7928 / 7993  2.0 
504.1   1.0X
-Parquet Vectorized (Pushdown)  747 /  772 21.0 
 47.5  10.6X
-Native ORC Vectorized 5728 / 5753  2.7 
364.2   1.4X
-Native ORC Vectorized (Pushdown)76 /   78207.9 
  4.8 104.8X
-
-Select 1 row (7864320 <= id <= 7864320):Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7939 / 8021  2.0 
504.8   1.0X
-Parquet Vectorized (Pushdown)  746 /  770 21.1 
 47.4  10.6X
-Native ORC Vectorized 5690 / 5734  2.8 
361.7   1.4X
-Native ORC Vectorized (Pushdown)76 /   79206.7 
  4.8 104.3X
-
-Select 1 row (7864319 < id < 7864321):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7972 / 8019  2.0 
506.9   1.0X
-Parquet Vectorized (Pushdown)  742 /  764 21.2 
 47.2  10.7X
-Native ORC Vectorized 5704 / 5743  2.8 
362.6   1.4X
-Native ORC Vectorized (Pushdown)76 /   78207.9 
  4.8 105.4X
-
-Select 10% rows (id < 1572864): Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized8733 / 8808  1.8 
555.2   1.0X
-Parquet Vectorized (Pushdown) 2213 / 2267  7.1 
140.7   3.9X
-Native ORC Vectorized 6420 / 6463  2.4 
408.2   1.4X
-Native ORC Vectorized (Pushdown)  1313 / 1331 12.0 
 83.5   6.7X
-
-Select 50% rows (id < 7864320): 

[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r189019277
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -105,138 +128,306 @@ object FilterPushdownBenchmark {
 }
 
 /*
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.2
-Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
-
-Select 0 row (id IS NULL):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7882 / 7957  2.0 
501.1   1.0X
-Parquet Vectorized (Pushdown)   55 /   60285.2 
  3.5 142.9X
-Native ORC Vectorized 5592 / 5627  2.8 
355.5   1.4X
-Native ORC Vectorized (Pushdown)66 /   70237.2 
  4.2 118.9X
-
-Select 0 row (7864320 < id < 7864320):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7884 / 7909  2.0 
501.2   1.0X
-Parquet Vectorized (Pushdown)  739 /  752 21.3 
 47.0  10.7X
-Native ORC Vectorized 5614 / 5646  2.8 
356.9   1.4X
-Native ORC Vectorized (Pushdown)81 /   83195.2 
  5.1  97.8X
-
-Select 1 row (id = 7864320):Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7905 / 8027  2.0 
502.6   1.0X
-Parquet Vectorized (Pushdown)  740 /  766 21.2 
 47.1  10.7X
-Native ORC Vectorized 5684 / 5738  2.8 
361.4   1.4X
-Native ORC Vectorized (Pushdown)78 /   81202.4 
  4.9 101.7X
-
-Select 1 row (id <=> 7864320):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7928 / 7993  2.0 
504.1   1.0X
-Parquet Vectorized (Pushdown)  747 /  772 21.0 
 47.5  10.6X
-Native ORC Vectorized 5728 / 5753  2.7 
364.2   1.4X
-Native ORC Vectorized (Pushdown)76 /   78207.9 
  4.8 104.8X
-
-Select 1 row (7864320 <= id <= 7864320):Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7939 / 8021  2.0 
504.8   1.0X
-Parquet Vectorized (Pushdown)  746 /  770 21.1 
 47.4  10.6X
-Native ORC Vectorized 5690 / 5734  2.8 
361.7   1.4X
-Native ORC Vectorized (Pushdown)76 /   79206.7 
  4.8 104.3X
-
-Select 1 row (7864319 < id < 7864321):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7972 / 8019  2.0 
506.9   1.0X
-Parquet Vectorized (Pushdown)  742 /  764 21.2 
 47.2  10.7X
-Native ORC Vectorized 5704 / 5743  2.8 
362.6   1.4X
-Native ORC Vectorized (Pushdown)76 /   78207.9 
  4.8 105.4X
-
-Select 10% rows (id < 1572864): Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized8733 / 8808  1.8 
555.2   1.0X
-Parquet Vectorized (Pushdown) 2213 / 2267  7.1 
140.7   3.9X
-Native ORC Vectorized 6420 / 6463  2.4 
408.2   1.4X
-Native ORC Vectorized (Pushdown)  1313 / 1331 12.0 
 83.5   6.7X
-
-Select 50% rows (id < 7864320):   

[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r189018667
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -105,138 +128,306 @@ object FilterPushdownBenchmark {
 }
 
 /*
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.2
-Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
-
-Select 0 row (id IS NULL):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7882 / 7957  2.0 
501.1   1.0X
-Parquet Vectorized (Pushdown)   55 /   60285.2 
  3.5 142.9X
-Native ORC Vectorized 5592 / 5627  2.8 
355.5   1.4X
-Native ORC Vectorized (Pushdown)66 /   70237.2 
  4.2 118.9X
-
-Select 0 row (7864320 < id < 7864320):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7884 / 7909  2.0 
501.2   1.0X
-Parquet Vectorized (Pushdown)  739 /  752 21.3 
 47.0  10.7X
-Native ORC Vectorized 5614 / 5646  2.8 
356.9   1.4X
-Native ORC Vectorized (Pushdown)81 /   83195.2 
  5.1  97.8X
-
-Select 1 row (id = 7864320):Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7905 / 8027  2.0 
502.6   1.0X
-Parquet Vectorized (Pushdown)  740 /  766 21.2 
 47.1  10.7X
-Native ORC Vectorized 5684 / 5738  2.8 
361.4   1.4X
-Native ORC Vectorized (Pushdown)78 /   81202.4 
  4.9 101.7X
-
-Select 1 row (id <=> 7864320):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7928 / 7993  2.0 
504.1   1.0X
-Parquet Vectorized (Pushdown)  747 /  772 21.0 
 47.5  10.6X
-Native ORC Vectorized 5728 / 5753  2.7 
364.2   1.4X
-Native ORC Vectorized (Pushdown)76 /   78207.9 
  4.8 104.8X
-
-Select 1 row (7864320 <= id <= 7864320):Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7939 / 8021  2.0 
504.8   1.0X
-Parquet Vectorized (Pushdown)  746 /  770 21.1 
 47.4  10.6X
-Native ORC Vectorized 5690 / 5734  2.8 
361.7   1.4X
-Native ORC Vectorized (Pushdown)76 /   79206.7 
  4.8 104.3X
-
-Select 1 row (7864319 < id < 7864321):  Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized7972 / 8019  2.0 
506.9   1.0X
-Parquet Vectorized (Pushdown)  742 /  764 21.2 
 47.2  10.7X
-Native ORC Vectorized 5704 / 5743  2.8 
362.6   1.4X
-Native ORC Vectorized (Pushdown)76 /   78207.9 
  4.8 105.4X
-
-Select 10% rows (id < 1572864): Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative
-
---
-Parquet Vectorized8733 / 8808  1.8 
555.2   1.0X
-Parquet Vectorized (Pushdown) 2213 / 2267  7.1 
140.7   3.9X
-Native ORC Vectorized 6420 / 6463  2.4 
408.2   1.4X
-Native ORC Vectorized (Pushdown)  1313 / 1331 12.0 
 83.5   6.7X
-
-Select 50% rows (id < 7864320):   

[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-13 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r187822729
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -32,14 +32,14 @@ import org.apache.spark.util.{Benchmark, Utils}
  */
 object FilterPushdownBenchmark {
   val conf = new SparkConf()
-  conf.set("orc.compression", "snappy")
-  conf.set("spark.sql.parquet.compression.codec", "snappy")
+.setMaster("local[1]")
+.setAppName("FilterPushdownBenchmark")
+.set("spark.driver.memory", "3g")
--- End diff --

aha, ok. Looks good to me.
I just added this along with other benchmark code, e.g., 
`TPCDSQueryBenchmark`.
If no problem, I'll fix the other places in follow-up.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-11 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r187764083
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -32,14 +32,14 @@ import org.apache.spark.util.{Benchmark, Utils}
  */
 object FilterPushdownBenchmark {
   val conf = new SparkConf()
-  conf.set("orc.compression", "snappy")
-  conf.set("spark.sql.parquet.compression.codec", "snappy")
+.setMaster("local[1]")
+.setAppName("FilterPushdownBenchmark")
+.set("spark.driver.memory", "3g")
--- End diff --

these and master - change to setIfMissing()? I think it's great if these 
can be set via config


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-09 Thread maropu
GitHub user maropu opened a pull request:

https://github.com/apache/spark/pull/21288

[SPARK-24206][SQL] Improve FilterPushdownBenchmark benchmark code

## What changes were proposed in this pull request?
This pr added benchmark code `FilterPushdownBenchmark` for string pushdown 
and updated performance results.

## How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maropu/spark UpdateParquetBenchmark

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21288.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21288


commit 223bf2008abfe5fd41c3b5e741dc525ab3864977
Author: Takeshi Yamamuro 
Date:   2018-05-03T00:17:21Z

Fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org