[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21288 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r195949711 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -0,0 +1,442 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import java.io.File + +import scala.util.{Random, Try} + +import org.apache.spark.SparkConf +import org.apache.spark.sql.{DataFrame, SparkSession} +import org.apache.spark.sql.functions.monotonically_increasing_id +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.{Benchmark, Utils} + + +/** + * Benchmark to measure read performance with Filter pushdown. + * To run this: + * spark-submit --class + */ +object FilterPushdownBenchmark { + val conf = new SparkConf() +.setAppName("FilterPushdownBenchmark") +// Since `spark.master` always exists, overrides this value +.set("spark.master", "local[1]") --- End diff -- I'm afraid that other developers might misunderstand how-to-use this? ``` spark-submit --master local[1] --class spark-submit --master local[*] --class In both case, the benchmark always uses `local[1]`. Or, you suggest the other point of view? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r195948346 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -0,0 +1,442 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import java.io.File + +import scala.util.{Random, Try} + +import org.apache.spark.SparkConf +import org.apache.spark.sql.{DataFrame, SparkSession} +import org.apache.spark.sql.functions.monotonically_increasing_id +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.{Benchmark, Utils} + + +/** + * Benchmark to measure read performance with Filter pushdown. + * To run this: + * spark-submit --class + */ +object FilterPushdownBenchmark { + val conf = new SparkConf() +.setAppName("FilterPushdownBenchmark") +// Since `spark.master` always exists, overrides this value +.set("spark.master", "local[1]") --- End diff -- What I mean is adding `--master local[1]` at line 34, too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r195946683 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -0,0 +1,442 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import java.io.File + +import scala.util.{Random, Try} + +import org.apache.spark.SparkConf +import org.apache.spark.sql.{DataFrame, SparkSession} +import org.apache.spark.sql.functions.monotonically_increasing_id +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.{Benchmark, Utils} + + +/** + * Benchmark to measure read performance with Filter pushdown. + * To run this: + * spark-submit --class + */ +object FilterPushdownBenchmark { + val conf = new SparkConf() +.setAppName("FilterPushdownBenchmark") +// Since `spark.master` always exists, overrides this value +.set("spark.master", "local[1]") --- End diff -- btw, I updated the description. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r195946600 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -0,0 +1,442 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import java.io.File + +import scala.util.{Random, Try} + +import org.apache.spark.SparkConf +import org.apache.spark.sql.{DataFrame, SparkSession} +import org.apache.spark.sql.functions.monotonically_increasing_id +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.{Benchmark, Utils} + + +/** + * Benchmark to measure read performance with Filter pushdown. + * To run this: + * spark-submit --class + */ +object FilterPushdownBenchmark { + val conf = new SparkConf() +.setAppName("FilterPushdownBenchmark") +// Since `spark.master` always exists, overrides this value +.set("spark.master", "local[1]") --- End diff -- In the current pr, we cannot use `spark.master` in command line options. You suggest we drop `.set("spark.master", "local[1]")` and we always set `spark.master` in options for this benchmark? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r195940979 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -0,0 +1,442 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import java.io.File + +import scala.util.{Random, Try} + +import org.apache.spark.SparkConf +import org.apache.spark.sql.{DataFrame, SparkSession} +import org.apache.spark.sql.functions.monotonically_increasing_id +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.{Benchmark, Utils} + + +/** + * Benchmark to measure read performance with Filter pushdown. + * To run this: + * spark-submit --class + */ +object FilterPushdownBenchmark { + val conf = new SparkConf() +.setAppName("FilterPushdownBenchmark") +// Since `spark.master` always exists, overrides this value +.set("spark.master", "local[1]") --- End diff -- Could you update `m4.2xlarge` in the PR description and add `spark.master` at line 34, too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r195305634 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -131,211 +132,214 @@ object FilterPushdownBenchmark { } /* +OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.26-46.32.amzn1.x86_64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative -Parquet Vectorized8452 / 8504 1.9 537.3 1.0X -Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X -Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X -Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X +Parquet Vectorized2961 / 3123 5.3 188.3 1.0X +Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X --- End diff -- Thank you for updating, @maropu . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r195304544 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -131,211 +132,214 @@ object FilterPushdownBenchmark { } /* +OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.26-46.32.amzn1.x86_64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative -Parquet Vectorized8452 / 8504 1.9 537.3 1.0X -Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X -Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X -Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X +Parquet Vectorized2961 / 3123 5.3 188.3 1.0X +Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X --- End diff -- The result in v2.3.1: https://gist.github.com/maropu/88627246b7143ede5ab73c7183ab2128 That is not a regression, but I probably run the bench in wrong branch or commit. I re-ran the bench in the current master and updated the pr. how-to-run: I created a new `m4.2xlarge` instance, fetched this pr, rebased to master, and run the bench. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r195262751 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -131,211 +132,214 @@ object FilterPushdownBenchmark { } /* +OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.26-46.32.amzn1.x86_64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative -Parquet Vectorized8452 / 8504 1.9 537.3 1.0X -Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X -Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X -Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X +Parquet Vectorized2961 / 3123 5.3 188.3 1.0X +Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X --- End diff -- I have time today, so I'll check v2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r191650442 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -131,211 +132,214 @@ object FilterPushdownBenchmark { } /* +OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.26-46.32.amzn1.x86_64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative -Parquet Vectorized8452 / 8504 1.9 537.3 1.0X -Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X -Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X -Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X +Parquet Vectorized2961 / 3123 5.3 188.3 1.0X +Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X --- End diff -- Is it a regression? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r191650406 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -131,211 +132,214 @@ object FilterPushdownBenchmark { } /* +OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.26-46.32.amzn1.x86_64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative -Parquet Vectorized8452 / 8504 1.9 537.3 1.0X -Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X -Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X -Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X +Parquet Vectorized2961 / 3123 5.3 188.3 1.0X +Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X --- End diff -- How about 2.3? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r191620766 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -131,211 +132,214 @@ object FilterPushdownBenchmark { } /* +OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.26-46.32.amzn1.x86_64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative -Parquet Vectorized8452 / 8504 1.9 537.3 1.0X -Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X -Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X -Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X +Parquet Vectorized2961 / 3123 5.3 188.3 1.0X +Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X --- End diff -- That might be, but I feel the change was too big... I probably think that I had some mistakes in the last benchmark runs (I've not found why yet though). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r191610297 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -131,211 +132,214 @@ object FilterPushdownBenchmark { } /* +OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.26-46.32.amzn1.x86_64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative -Parquet Vectorized8452 / 8504 1.9 537.3 1.0X -Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X -Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X -Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X +Parquet Vectorized2961 / 3123 5.3 188.3 1.0X +Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X --- End diff -- I have not tried it yet, but is it related to the recent change we made in the parquet reader? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r191283013 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -131,211 +132,214 @@ object FilterPushdownBenchmark { } /* +OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.26-46.32.amzn1.x86_64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative -Parquet Vectorized8452 / 8504 1.9 537.3 1.0X -Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X -Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X -Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X +Parquet Vectorized2961 / 3123 5.3 188.3 1.0X +Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X --- End diff -- yea, I thinks so. But, not sure. I tried to run multiple times though, I didn't get the old performance values... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r191280132 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -131,211 +132,214 @@ object FilterPushdownBenchmark { } /* +OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.26-46.32.amzn1.x86_64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz Select 0 string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative -Parquet Vectorized8452 / 8504 1.9 537.3 1.0X -Parquet Vectorized (Pushdown) 274 / 281 57.3 17.4 30.8X -Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X -Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X +Parquet Vectorized2961 / 3123 5.3 188.3 1.0X +Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X --- End diff -- The difference is huge. What happened? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r191109472 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -0,0 +1,437 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import java.io.File + +import scala.util.{Random, Try} + +import org.apache.spark.SparkConf +import org.apache.spark.sql.{DataFrame, SparkSession} +import org.apache.spark.sql.functions.monotonically_increasing_id +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.{Benchmark, Utils} + + +/** + * Benchmark to measure read performance with Filter pushdown. + * To run this: + * spark-submit --class + */ +object FilterPushdownBenchmark { + val conf = new SparkConf() +.setAppName("FilterPushdownBenchmark") +.setIfMissing("spark.master", "local[1]") +.setIfMissing("spark.driver.memory", "3g") +.setIfMissing("spark.executor.memory", "3g") +.setIfMissing("orc.compression", "snappy") +.setIfMissing("spark.sql.parquet.compression.codec", "snappy") + + private val spark = SparkSession.builder().config(conf).getOrCreate() + + def withTempPath(f: File => Unit): Unit = { +val path = Utils.createTempDir() +path.delete() +try f(path) finally Utils.deleteRecursively(path) + } + + def withTempTable(tableNames: String*)(f: => Unit): Unit = { +try f finally tableNames.foreach(spark.catalog.dropTempView) + } + + def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = { +val (keys, values) = pairs.unzip +val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption) +(keys, values).zipped.foreach(spark.conf.set) +try f finally { + keys.zip(currentValues).foreach { +case (key, Some(value)) => spark.conf.set(key, value) +case (key, None) => spark.conf.unset(key) + } +} + } + + private def prepareTable( + dir: File, numRows: Int, width: Int, useStringForValue: Boolean): Unit = { +import spark.implicits._ +val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i") +val valueCol = if (useStringForValue) { + monotonically_increasing_id().cast("string") +} else { + monotonically_increasing_id() +} +val df = spark.range(numRows).map(_ => Random.nextLong).selectExpr(selectExpr: _*) + .withColumn("value", valueCol) + .sort("value") + +saveAsOrcTable(df, dir.getCanonicalPath + "/orc") +saveAsParquetTable(df, dir.getCanonicalPath + "/parquet") + } + + private def prepareStringDictTable( + dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = { +val selectExpr = (0 to width).map { + case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value" + case i => s"CAST(rand() AS STRING) c$i" +} +val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value") + +saveAsOrcTable(df, dir.getCanonicalPath + "/orc") +saveAsParquetTable(df, dir.getCanonicalPath + "/parquet") + } + + private def saveAsOrcTable(df: DataFrame, dir: String): Unit = { +df.write.mode("overwrite").orc(dir) +spark.read.orc(dir).createOrReplaceTempView("orcTable") + } + + private def saveAsParquetTable(df: DataFrame, dir: String): Unit = { +df.write.mode("overwrite").parquet(dir) +spark.read.parquet(dir).createOrReplaceTempView("parquetTable") + } + + def filterPushDownBenchmark( + values: Int, + title: String, + whereExpr: String, + selectExpr: String = "*"): Unit = { +val benchmark = new Benchmark(title, values, minNumIters = 5) + +Seq(false, true).foreach { pushDownEnabled => + val name
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r190382044 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -0,0 +1,437 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import java.io.File + +import scala.util.{Random, Try} + +import org.apache.spark.SparkConf +import org.apache.spark.sql.{DataFrame, SparkSession} +import org.apache.spark.sql.functions.monotonically_increasing_id +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.{Benchmark, Utils} + + +/** + * Benchmark to measure read performance with Filter pushdown. + * To run this: + * spark-submit --class + */ +object FilterPushdownBenchmark { + val conf = new SparkConf() +.setAppName("FilterPushdownBenchmark") +.setIfMissing("spark.master", "local[1]") +.setIfMissing("spark.driver.memory", "3g") +.setIfMissing("spark.executor.memory", "3g") +.setIfMissing("orc.compression", "snappy") +.setIfMissing("spark.sql.parquet.compression.codec", "snappy") + + private val spark = SparkSession.builder().config(conf).getOrCreate() + + def withTempPath(f: File => Unit): Unit = { +val path = Utils.createTempDir() +path.delete() +try f(path) finally Utils.deleteRecursively(path) + } + + def withTempTable(tableNames: String*)(f: => Unit): Unit = { +try f finally tableNames.foreach(spark.catalog.dropTempView) + } + + def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = { +val (keys, values) = pairs.unzip +val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption) +(keys, values).zipped.foreach(spark.conf.set) +try f finally { + keys.zip(currentValues).foreach { +case (key, Some(value)) => spark.conf.set(key, value) +case (key, None) => spark.conf.unset(key) + } +} + } + + private def prepareTable( + dir: File, numRows: Int, width: Int, useStringForValue: Boolean): Unit = { +import spark.implicits._ +val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i") +val valueCol = if (useStringForValue) { + monotonically_increasing_id().cast("string") +} else { + monotonically_increasing_id() +} +val df = spark.range(numRows).map(_ => Random.nextLong).selectExpr(selectExpr: _*) + .withColumn("value", valueCol) + .sort("value") + +saveAsOrcTable(df, dir.getCanonicalPath + "/orc") +saveAsParquetTable(df, dir.getCanonicalPath + "/parquet") + } + + private def prepareStringDictTable( + dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = { +val selectExpr = (0 to width).map { + case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value" + case i => s"CAST(rand() AS STRING) c$i" +} +val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value") + +saveAsOrcTable(df, dir.getCanonicalPath + "/orc") +saveAsParquetTable(df, dir.getCanonicalPath + "/parquet") + } + + private def saveAsOrcTable(df: DataFrame, dir: String): Unit = { +df.write.mode("overwrite").orc(dir) +spark.read.orc(dir).createOrReplaceTempView("orcTable") + } + + private def saveAsParquetTable(df: DataFrame, dir: String): Unit = { +df.write.mode("overwrite").parquet(dir) +spark.read.parquet(dir).createOrReplaceTempView("parquetTable") + } + + def filterPushDownBenchmark( + values: Int, + title: String, + whereExpr: String, + selectExpr: String = "*"): Unit = { +val benchmark = new Benchmark(title, values, minNumIters = 5) + +Seq(false, true).foreach { pushDownEnabled => + val
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r189780637 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -0,0 +1,437 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import java.io.File + +import scala.util.{Random, Try} + +import org.apache.spark.SparkConf +import org.apache.spark.sql.{DataFrame, SparkSession} +import org.apache.spark.sql.functions.monotonically_increasing_id +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.{Benchmark, Utils} + + +/** + * Benchmark to measure read performance with Filter pushdown. + * To run this: + * spark-submit --class + */ +object FilterPushdownBenchmark { + val conf = new SparkConf() +.setAppName("FilterPushdownBenchmark") +.setIfMissing("spark.master", "local[1]") +.setIfMissing("spark.driver.memory", "3g") +.setIfMissing("spark.executor.memory", "3g") +.setIfMissing("orc.compression", "snappy") +.setIfMissing("spark.sql.parquet.compression.codec", "snappy") + + private val spark = SparkSession.builder().config(conf).getOrCreate() + + def withTempPath(f: File => Unit): Unit = { +val path = Utils.createTempDir() +path.delete() +try f(path) finally Utils.deleteRecursively(path) + } + + def withTempTable(tableNames: String*)(f: => Unit): Unit = { +try f finally tableNames.foreach(spark.catalog.dropTempView) + } + + def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = { +val (keys, values) = pairs.unzip +val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption) +(keys, values).zipped.foreach(spark.conf.set) +try f finally { + keys.zip(currentValues).foreach { +case (key, Some(value)) => spark.conf.set(key, value) +case (key, None) => spark.conf.unset(key) + } +} + } + + private def prepareTable( + dir: File, numRows: Int, width: Int, useStringForValue: Boolean): Unit = { +import spark.implicits._ +val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i") +val valueCol = if (useStringForValue) { + monotonically_increasing_id().cast("string") +} else { + monotonically_increasing_id() +} +val df = spark.range(numRows).map(_ => Random.nextLong).selectExpr(selectExpr: _*) + .withColumn("value", valueCol) + .sort("value") + +saveAsOrcTable(df, dir.getCanonicalPath + "/orc") +saveAsParquetTable(df, dir.getCanonicalPath + "/parquet") + } + + private def prepareStringDictTable( + dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = { +val selectExpr = (0 to width).map { + case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value" + case i => s"CAST(rand() AS STRING) c$i" +} +val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value") + +saveAsOrcTable(df, dir.getCanonicalPath + "/orc") +saveAsParquetTable(df, dir.getCanonicalPath + "/parquet") + } + + private def saveAsOrcTable(df: DataFrame, dir: String): Unit = { +df.write.mode("overwrite").orc(dir) +spark.read.orc(dir).createOrReplaceTempView("orcTable") + } + + private def saveAsParquetTable(df: DataFrame, dir: String): Unit = { +df.write.mode("overwrite").parquet(dir) +spark.read.parquet(dir).createOrReplaceTempView("parquetTable") + } + + def filterPushDownBenchmark( + values: Int, + title: String, + whereExpr: String, + selectExpr: String = "*"): Unit = { +val benchmark = new Benchmark(title, values, minNumIters = 5) + +Seq(false, true).foreach { pushDownEnabled => + val name
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r189639582 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -0,0 +1,437 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import java.io.File + +import scala.util.{Random, Try} + +import org.apache.spark.SparkConf +import org.apache.spark.sql.{DataFrame, SparkSession} +import org.apache.spark.sql.functions.monotonically_increasing_id +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.{Benchmark, Utils} + + +/** + * Benchmark to measure read performance with Filter pushdown. + * To run this: + * spark-submit --class + */ +object FilterPushdownBenchmark { + val conf = new SparkConf() +.setAppName("FilterPushdownBenchmark") +.setIfMissing("spark.master", "local[1]") +.setIfMissing("spark.driver.memory", "3g") +.setIfMissing("spark.executor.memory", "3g") +.setIfMissing("orc.compression", "snappy") +.setIfMissing("spark.sql.parquet.compression.codec", "snappy") + + private val spark = SparkSession.builder().config(conf).getOrCreate() + + def withTempPath(f: File => Unit): Unit = { +val path = Utils.createTempDir() +path.delete() +try f(path) finally Utils.deleteRecursively(path) + } + + def withTempTable(tableNames: String*)(f: => Unit): Unit = { +try f finally tableNames.foreach(spark.catalog.dropTempView) + } + + def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = { +val (keys, values) = pairs.unzip +val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption) +(keys, values).zipped.foreach(spark.conf.set) +try f finally { + keys.zip(currentValues).foreach { +case (key, Some(value)) => spark.conf.set(key, value) +case (key, None) => spark.conf.unset(key) + } +} + } + + private def prepareTable( + dir: File, numRows: Int, width: Int, useStringForValue: Boolean): Unit = { +import spark.implicits._ +val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i") +val valueCol = if (useStringForValue) { + monotonically_increasing_id().cast("string") +} else { + monotonically_increasing_id() +} +val df = spark.range(numRows).map(_ => Random.nextLong).selectExpr(selectExpr: _*) + .withColumn("value", valueCol) + .sort("value") + +saveAsOrcTable(df, dir.getCanonicalPath + "/orc") +saveAsParquetTable(df, dir.getCanonicalPath + "/parquet") + } + + private def prepareStringDictTable( + dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = { +val selectExpr = (0 to width).map { + case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value" + case i => s"CAST(rand() AS STRING) c$i" +} +val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value") + +saveAsOrcTable(df, dir.getCanonicalPath + "/orc") +saveAsParquetTable(df, dir.getCanonicalPath + "/parquet") + } + + private def saveAsOrcTable(df: DataFrame, dir: String): Unit = { +df.write.mode("overwrite").orc(dir) +spark.read.orc(dir).createOrReplaceTempView("orcTable") + } + + private def saveAsParquetTable(df: DataFrame, dir: String): Unit = { +df.write.mode("overwrite").parquet(dir) +spark.read.parquet(dir).createOrReplaceTempView("parquetTable") + } + + def filterPushDownBenchmark( + values: Int, + title: String, + whereExpr: String, + selectExpr: String = "*"): Unit = { +val benchmark = new Benchmark(title, values, minNumIters = 5) + +Seq(false, true).foreach { pushDownEnabled => + val
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r189638131 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -0,0 +1,437 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import java.io.File + +import scala.util.{Random, Try} + +import org.apache.spark.SparkConf +import org.apache.spark.sql.{DataFrame, SparkSession} +import org.apache.spark.sql.functions.monotonically_increasing_id +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.{Benchmark, Utils} + + +/** + * Benchmark to measure read performance with Filter pushdown. + * To run this: + * spark-submit --class + */ +object FilterPushdownBenchmark { + val conf = new SparkConf() +.setAppName("FilterPushdownBenchmark") +.setIfMissing("spark.master", "local[1]") +.setIfMissing("spark.driver.memory", "3g") +.setIfMissing("spark.executor.memory", "3g") +.setIfMissing("orc.compression", "snappy") +.setIfMissing("spark.sql.parquet.compression.codec", "snappy") + + private val spark = SparkSession.builder().config(conf).getOrCreate() + + def withTempPath(f: File => Unit): Unit = { +val path = Utils.createTempDir() +path.delete() +try f(path) finally Utils.deleteRecursively(path) + } + + def withTempTable(tableNames: String*)(f: => Unit): Unit = { +try f finally tableNames.foreach(spark.catalog.dropTempView) + } + + def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = { +val (keys, values) = pairs.unzip +val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption) +(keys, values).zipped.foreach(spark.conf.set) +try f finally { + keys.zip(currentValues).foreach { +case (key, Some(value)) => spark.conf.set(key, value) +case (key, None) => spark.conf.unset(key) + } +} + } + + private def prepareTable( + dir: File, numRows: Int, width: Int, useStringForValue: Boolean): Unit = { +import spark.implicits._ +val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i") +val valueCol = if (useStringForValue) { + monotonically_increasing_id().cast("string") +} else { + monotonically_increasing_id() +} +val df = spark.range(numRows).map(_ => Random.nextLong).selectExpr(selectExpr: _*) + .withColumn("value", valueCol) + .sort("value") + +saveAsOrcTable(df, dir.getCanonicalPath + "/orc") +saveAsParquetTable(df, dir.getCanonicalPath + "/parquet") + } + + private def prepareStringDictTable( + dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = { +val selectExpr = (0 to width).map { + case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value" + case i => s"CAST(rand() AS STRING) c$i" +} +val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value") + +saveAsOrcTable(df, dir.getCanonicalPath + "/orc") +saveAsParquetTable(df, dir.getCanonicalPath + "/parquet") + } + + private def saveAsOrcTable(df: DataFrame, dir: String): Unit = { +df.write.mode("overwrite").orc(dir) +spark.read.orc(dir).createOrReplaceTempView("orcTable") + } + + private def saveAsParquetTable(df: DataFrame, dir: String): Unit = { +df.write.mode("overwrite").parquet(dir) +spark.read.parquet(dir).createOrReplaceTempView("parquetTable") + } + + def filterPushDownBenchmark( + values: Int, + title: String, + whereExpr: String, + selectExpr: String = "*"): Unit = { +val benchmark = new Benchmark(title, values, minNumIters = 5) + +Seq(false, true).foreach { pushDownEnabled => +
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r189635143 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -0,0 +1,437 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import java.io.File + +import scala.util.{Random, Try} + +import org.apache.spark.SparkConf +import org.apache.spark.sql.{DataFrame, SparkSession} +import org.apache.spark.sql.functions.monotonically_increasing_id +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.{Benchmark, Utils} + + +/** + * Benchmark to measure read performance with Filter pushdown. + * To run this: + * spark-submit --class + */ +object FilterPushdownBenchmark { + val conf = new SparkConf() +.setAppName("FilterPushdownBenchmark") +.setIfMissing("spark.master", "local[1]") +.setIfMissing("spark.driver.memory", "3g") +.setIfMissing("spark.executor.memory", "3g") +.setIfMissing("orc.compression", "snappy") +.setIfMissing("spark.sql.parquet.compression.codec", "snappy") + + private val spark = SparkSession.builder().config(conf).getOrCreate() + + def withTempPath(f: File => Unit): Unit = { +val path = Utils.createTempDir() +path.delete() +try f(path) finally Utils.deleteRecursively(path) + } + + def withTempTable(tableNames: String*)(f: => Unit): Unit = { +try f finally tableNames.foreach(spark.catalog.dropTempView) + } + + def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = { +val (keys, values) = pairs.unzip +val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption) +(keys, values).zipped.foreach(spark.conf.set) +try f finally { + keys.zip(currentValues).foreach { +case (key, Some(value)) => spark.conf.set(key, value) +case (key, None) => spark.conf.unset(key) + } +} + } + + private def prepareTable( + dir: File, numRows: Int, width: Int, useStringForValue: Boolean): Unit = { +import spark.implicits._ +val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i") +val valueCol = if (useStringForValue) { + monotonically_increasing_id().cast("string") +} else { + monotonically_increasing_id() +} +val df = spark.range(numRows).map(_ => Random.nextLong).selectExpr(selectExpr: _*) + .withColumn("value", valueCol) + .sort("value") + +saveAsOrcTable(df, dir.getCanonicalPath + "/orc") +saveAsParquetTable(df, dir.getCanonicalPath + "/parquet") + } + + private def prepareStringDictTable( + dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = { +val selectExpr = (0 to width).map { + case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value" + case i => s"CAST(rand() AS STRING) c$i" +} +val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value") + +saveAsOrcTable(df, dir.getCanonicalPath + "/orc") +saveAsParquetTable(df, dir.getCanonicalPath + "/parquet") + } + + private def saveAsOrcTable(df: DataFrame, dir: String): Unit = { +df.write.mode("overwrite").orc(dir) +spark.read.orc(dir).createOrReplaceTempView("orcTable") + } + + private def saveAsParquetTable(df: DataFrame, dir: String): Unit = { +df.write.mode("overwrite").parquet(dir) +spark.read.parquet(dir).createOrReplaceTempView("parquetTable") + } + + def filterPushDownBenchmark( + values: Int, + title: String, + whereExpr: String, + selectExpr: String = "*"): Unit = { +val benchmark = new Benchmark(title, values, minNumIters = 5) + +Seq(false, true).foreach { pushDownEnabled => + val
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r189490682 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala --- @@ -32,14 +32,14 @@ import org.apache.spark.util.{Benchmark, Utils} */ object FilterPushdownBenchmark { val conf = new SparkConf() - conf.set("orc.compression", "snappy") - conf.set("spark.sql.parquet.compression.codec", "snappy") +.setMaster("local[1]") --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r189175140 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala --- @@ -32,14 +32,14 @@ import org.apache.spark.util.{Benchmark, Utils} */ object FilterPushdownBenchmark { val conf = new SparkConf() - conf.set("orc.compression", "snappy") - conf.set("spark.sql.parquet.compression.codec", "snappy") +.setMaster("local[1]") --- End diff -- I think you can do `.setIfMissing("spark.master", "local[1]")` that way perhaps we could get this to run on different backends too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r189158065 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala --- @@ -105,138 +128,306 @@ object FilterPushdownBenchmark { } /* -Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.2 -Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz - -Select 0 row (id IS NULL): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7882 / 7957 2.0 501.1 1.0X -Parquet Vectorized (Pushdown) 55 / 60285.2 3.5 142.9X -Native ORC Vectorized 5592 / 5627 2.8 355.5 1.4X -Native ORC Vectorized (Pushdown)66 / 70237.2 4.2 118.9X - -Select 0 row (7864320 < id < 7864320): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7884 / 7909 2.0 501.2 1.0X -Parquet Vectorized (Pushdown) 739 / 752 21.3 47.0 10.7X -Native ORC Vectorized 5614 / 5646 2.8 356.9 1.4X -Native ORC Vectorized (Pushdown)81 / 83195.2 5.1 97.8X - -Select 1 row (id = 7864320):Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7905 / 8027 2.0 502.6 1.0X -Parquet Vectorized (Pushdown) 740 / 766 21.2 47.1 10.7X -Native ORC Vectorized 5684 / 5738 2.8 361.4 1.4X -Native ORC Vectorized (Pushdown)78 / 81202.4 4.9 101.7X - -Select 1 row (id <=> 7864320): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7928 / 7993 2.0 504.1 1.0X -Parquet Vectorized (Pushdown) 747 / 772 21.0 47.5 10.6X -Native ORC Vectorized 5728 / 5753 2.7 364.2 1.4X -Native ORC Vectorized (Pushdown)76 / 78207.9 4.8 104.8X - -Select 1 row (7864320 <= id <= 7864320):Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7939 / 8021 2.0 504.8 1.0X -Parquet Vectorized (Pushdown) 746 / 770 21.1 47.4 10.6X -Native ORC Vectorized 5690 / 5734 2.8 361.7 1.4X -Native ORC Vectorized (Pushdown)76 / 79206.7 4.8 104.3X - -Select 1 row (7864319 < id < 7864321): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7972 / 8019 2.0 506.9 1.0X -Parquet Vectorized (Pushdown) 742 / 764 21.2 47.2 10.7X -Native ORC Vectorized 5704 / 5743 2.8 362.6 1.4X -Native ORC Vectorized (Pushdown)76 / 78207.9 4.8 105.4X - -Select 10% rows (id < 1572864): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized8733 / 8808 1.8 555.2 1.0X -Parquet Vectorized (Pushdown) 2213 / 2267 7.1 140.7 3.9X -Native ORC Vectorized 6420 / 6463 2.4 408.2 1.4X -Native ORC Vectorized (Pushdown) 1313 / 1331 12.0 83.5 6.7X - -Select 50% rows (id < 7864320):
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r189120527 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala --- @@ -105,138 +128,306 @@ object FilterPushdownBenchmark { } /* -Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.2 -Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz - -Select 0 row (id IS NULL): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7882 / 7957 2.0 501.1 1.0X -Parquet Vectorized (Pushdown) 55 / 60285.2 3.5 142.9X -Native ORC Vectorized 5592 / 5627 2.8 355.5 1.4X -Native ORC Vectorized (Pushdown)66 / 70237.2 4.2 118.9X - -Select 0 row (7864320 < id < 7864320): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7884 / 7909 2.0 501.2 1.0X -Parquet Vectorized (Pushdown) 739 / 752 21.3 47.0 10.7X -Native ORC Vectorized 5614 / 5646 2.8 356.9 1.4X -Native ORC Vectorized (Pushdown)81 / 83195.2 5.1 97.8X - -Select 1 row (id = 7864320):Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7905 / 8027 2.0 502.6 1.0X -Parquet Vectorized (Pushdown) 740 / 766 21.2 47.1 10.7X -Native ORC Vectorized 5684 / 5738 2.8 361.4 1.4X -Native ORC Vectorized (Pushdown)78 / 81202.4 4.9 101.7X - -Select 1 row (id <=> 7864320): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7928 / 7993 2.0 504.1 1.0X -Parquet Vectorized (Pushdown) 747 / 772 21.0 47.5 10.6X -Native ORC Vectorized 5728 / 5753 2.7 364.2 1.4X -Native ORC Vectorized (Pushdown)76 / 78207.9 4.8 104.8X - -Select 1 row (7864320 <= id <= 7864320):Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7939 / 8021 2.0 504.8 1.0X -Parquet Vectorized (Pushdown) 746 / 770 21.1 47.4 10.6X -Native ORC Vectorized 5690 / 5734 2.8 361.7 1.4X -Native ORC Vectorized (Pushdown)76 / 79206.7 4.8 104.3X - -Select 1 row (7864319 < id < 7864321): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7972 / 8019 2.0 506.9 1.0X -Parquet Vectorized (Pushdown) 742 / 764 21.2 47.2 10.7X -Native ORC Vectorized 5704 / 5743 2.8 362.6 1.4X -Native ORC Vectorized (Pushdown)76 / 78207.9 4.8 105.4X - -Select 10% rows (id < 1572864): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized8733 / 8808 1.8 555.2 1.0X -Parquet Vectorized (Pushdown) 2213 / 2267 7.1 140.7 3.9X -Native ORC Vectorized 6420 / 6463 2.4 408.2 1.4X -Native ORC Vectorized (Pushdown) 1313 / 1331 12.0 83.5 6.7X - -Select 50% rows (id < 7864320):
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r189019277 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala --- @@ -105,138 +128,306 @@ object FilterPushdownBenchmark { } /* -Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.2 -Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz - -Select 0 row (id IS NULL): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7882 / 7957 2.0 501.1 1.0X -Parquet Vectorized (Pushdown) 55 / 60285.2 3.5 142.9X -Native ORC Vectorized 5592 / 5627 2.8 355.5 1.4X -Native ORC Vectorized (Pushdown)66 / 70237.2 4.2 118.9X - -Select 0 row (7864320 < id < 7864320): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7884 / 7909 2.0 501.2 1.0X -Parquet Vectorized (Pushdown) 739 / 752 21.3 47.0 10.7X -Native ORC Vectorized 5614 / 5646 2.8 356.9 1.4X -Native ORC Vectorized (Pushdown)81 / 83195.2 5.1 97.8X - -Select 1 row (id = 7864320):Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7905 / 8027 2.0 502.6 1.0X -Parquet Vectorized (Pushdown) 740 / 766 21.2 47.1 10.7X -Native ORC Vectorized 5684 / 5738 2.8 361.4 1.4X -Native ORC Vectorized (Pushdown)78 / 81202.4 4.9 101.7X - -Select 1 row (id <=> 7864320): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7928 / 7993 2.0 504.1 1.0X -Parquet Vectorized (Pushdown) 747 / 772 21.0 47.5 10.6X -Native ORC Vectorized 5728 / 5753 2.7 364.2 1.4X -Native ORC Vectorized (Pushdown)76 / 78207.9 4.8 104.8X - -Select 1 row (7864320 <= id <= 7864320):Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7939 / 8021 2.0 504.8 1.0X -Parquet Vectorized (Pushdown) 746 / 770 21.1 47.4 10.6X -Native ORC Vectorized 5690 / 5734 2.8 361.7 1.4X -Native ORC Vectorized (Pushdown)76 / 79206.7 4.8 104.3X - -Select 1 row (7864319 < id < 7864321): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7972 / 8019 2.0 506.9 1.0X -Parquet Vectorized (Pushdown) 742 / 764 21.2 47.2 10.7X -Native ORC Vectorized 5704 / 5743 2.8 362.6 1.4X -Native ORC Vectorized (Pushdown)76 / 78207.9 4.8 105.4X - -Select 10% rows (id < 1572864): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized8733 / 8808 1.8 555.2 1.0X -Parquet Vectorized (Pushdown) 2213 / 2267 7.1 140.7 3.9X -Native ORC Vectorized 6420 / 6463 2.4 408.2 1.4X -Native ORC Vectorized (Pushdown) 1313 / 1331 12.0 83.5 6.7X - -Select 50% rows (id < 7864320):
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r189018667 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala --- @@ -105,138 +128,306 @@ object FilterPushdownBenchmark { } /* -Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16 on Mac OS X 10.13.2 -Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz - -Select 0 row (id IS NULL): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7882 / 7957 2.0 501.1 1.0X -Parquet Vectorized (Pushdown) 55 / 60285.2 3.5 142.9X -Native ORC Vectorized 5592 / 5627 2.8 355.5 1.4X -Native ORC Vectorized (Pushdown)66 / 70237.2 4.2 118.9X - -Select 0 row (7864320 < id < 7864320): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7884 / 7909 2.0 501.2 1.0X -Parquet Vectorized (Pushdown) 739 / 752 21.3 47.0 10.7X -Native ORC Vectorized 5614 / 5646 2.8 356.9 1.4X -Native ORC Vectorized (Pushdown)81 / 83195.2 5.1 97.8X - -Select 1 row (id = 7864320):Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7905 / 8027 2.0 502.6 1.0X -Parquet Vectorized (Pushdown) 740 / 766 21.2 47.1 10.7X -Native ORC Vectorized 5684 / 5738 2.8 361.4 1.4X -Native ORC Vectorized (Pushdown)78 / 81202.4 4.9 101.7X - -Select 1 row (id <=> 7864320): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7928 / 7993 2.0 504.1 1.0X -Parquet Vectorized (Pushdown) 747 / 772 21.0 47.5 10.6X -Native ORC Vectorized 5728 / 5753 2.7 364.2 1.4X -Native ORC Vectorized (Pushdown)76 / 78207.9 4.8 104.8X - -Select 1 row (7864320 <= id <= 7864320):Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7939 / 8021 2.0 504.8 1.0X -Parquet Vectorized (Pushdown) 746 / 770 21.1 47.4 10.6X -Native ORC Vectorized 5690 / 5734 2.8 361.7 1.4X -Native ORC Vectorized (Pushdown)76 / 79206.7 4.8 104.3X - -Select 1 row (7864319 < id < 7864321): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized7972 / 8019 2.0 506.9 1.0X -Parquet Vectorized (Pushdown) 742 / 764 21.2 47.2 10.7X -Native ORC Vectorized 5704 / 5743 2.8 362.6 1.4X -Native ORC Vectorized (Pushdown)76 / 78207.9 4.8 105.4X - -Select 10% rows (id < 1572864): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Parquet Vectorized8733 / 8808 1.8 555.2 1.0X -Parquet Vectorized (Pushdown) 2213 / 2267 7.1 140.7 3.9X -Native ORC Vectorized 6420 / 6463 2.4 408.2 1.4X -Native ORC Vectorized (Pushdown) 1313 / 1331 12.0 83.5 6.7X - -Select 50% rows (id < 7864320):
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r187822729 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala --- @@ -32,14 +32,14 @@ import org.apache.spark.util.{Benchmark, Utils} */ object FilterPushdownBenchmark { val conf = new SparkConf() - conf.set("orc.compression", "snappy") - conf.set("spark.sql.parquet.compression.codec", "snappy") +.setMaster("local[1]") +.setAppName("FilterPushdownBenchmark") +.set("spark.driver.memory", "3g") --- End diff -- aha, ok. Looks good to me. I just added this along with other benchmark code, e.g., `TPCDSQueryBenchmark`. If no problem, I'll fix the other places in follow-up. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r187764083 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala --- @@ -32,14 +32,14 @@ import org.apache.spark.util.{Benchmark, Utils} */ object FilterPushdownBenchmark { val conf = new SparkConf() - conf.set("orc.compression", "snappy") - conf.set("spark.sql.parquet.compression.codec", "snappy") +.setMaster("local[1]") +.setAppName("FilterPushdownBenchmark") +.set("spark.driver.memory", "3g") --- End diff -- these and master - change to setIfMissing()? I think it's great if these can be set via config --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
GitHub user maropu opened a pull request: https://github.com/apache/spark/pull/21288 [SPARK-24206][SQL] Improve FilterPushdownBenchmark benchmark code ## What changes were proposed in this pull request? This pr added benchmark code `FilterPushdownBenchmark` for string pushdown and updated performance results. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/maropu/spark UpdateParquetBenchmark Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21288.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21288 commit 223bf2008abfe5fd41c3b5e741dc525ab3864977 Author: Takeshi YamamuroDate: 2018-05-03T00:17:21Z Fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org