[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308292#comment-17308292 ] Li Xian commented on SPARK-26345: - [~dongjoon] sure, I have created a new issue https://issues.apache.org/jira/browse/SPARK-34859 > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > Parquet 1.11 supports column indexing. Spark can supports this feature for > better read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 > > Benchmark result: > [https://github.com/apache/spark/pull/31393#issuecomment-769767724] > This feature is enabled by default, and users can disable it by setting > {{parquet.filter.columnindex.enabled}} to false. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308144#comment-17308144 ] Dongjoon Hyun commented on SPARK-26345: --- Thank you for reporting. Could you file a new JIRA, [~lxian2] and [~sha...@uber.com]? > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > Parquet 1.11 supports column indexing. Spark can supports this feature for > better read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 > > Benchmark result: > [https://github.com/apache/spark/pull/31393#issuecomment-769767724] > This feature is enabled by default, and users can disable it by setting > {{parquet.filter.columnindex.enabled}} to false. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308000#comment-17308000 ] Xinli Shang commented on SPARK-26345: - Yes, it needs some synchronization. I have the modified version implementation in Presto. You can check it [here|https://github.com/shangxinli/presto/commit/f6327a161eb6cfd5137f679620e095d8257816b8#diff-bb24b92e28343804ebaf540efe6c1cda0b5e2524e6811f8fe2daee5944dad386R203]. > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > Parquet 1.11 supports column indexing. Spark can supports this feature for > better read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 > > Benchmark result: > [https://github.com/apache/spark/pull/31393#issuecomment-769767724] > This feature is enabled by default, and users can disable it by setting > {{parquet.filter.columnindex.enabled}} to false. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307590#comment-17307590 ] Li Xian commented on SPARK-26345: - [~yumwang] I think the current implementation has a problem. the pages returned by `readNextFilteredRowGroup` may not be aligned, some columns may have more rows than others. Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` with `rowIndexes` to make sure that rows are aligned. Currently `VectorizedParquetRecordReader` doesn't have such synchronizing among pages from different columns. Using `readNextFilteredRowGroup` may result in incorrect result. > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > Parquet 1.11 supports column indexing. Spark can supports this feature for > better read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 > > Benchmark result: > [https://github.com/apache/spark/pull/31393#issuecomment-769767724] > This feature is enabled by default, and users can disable it by setting > {{parquet.filter.columnindex.enabled}} to false. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248526#comment-17248526 ] Yuming Wang commented on SPARK-26345: - [~jamestaylor] Please see [https://github.com/apache/spark/pull/30517] for more details. > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248412#comment-17248412 ] James R. Taylor commented on SPARK-26345: - Those results are excellent, [~yumwang]. I thought from earlier comments that vectorized reads in Spark weren't compatible with column indexing? Do the child JIRAs here fix that? > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248374#comment-17248374 ] Yuming Wang commented on SPARK-26345: - Benchmark and benchmark result: {code:scala} /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * *http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.spark.sql.execution.benchmark import java.io.File import scala.util.Random import org.apache.spark.SparkConf import org.apache.spark.benchmark.Benchmark import org.apache.spark.sql.{DataFrame, SparkSession} import org.apache.spark.sql.functions.{monotonically_increasing_id, timestamp_seconds} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.internal.SQLConf.ParquetOutputTimestampType import org.apache.spark.sql.types.{ByteType, Decimal, DecimalType} /** * Benchmark to measure read performance with Parquet column index. * To run this benchmark: * {{{ * 1. without sbt: bin/spark-submit --class * 2. build/sbt "sql/test:runMain " * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " * Results will be written to "benchmarks/ParquetFilterPushdownBenchmark-results.txt". * }}} */ object ParquetFilterPushdownBenchmark extends SqlBasedBenchmark { override def getSparkSession: SparkSession = { val conf = new SparkConf() .setAppName(this.getClass.getSimpleName) // Since `spark.master` always exists, overrides this value .set("spark.master", "local[1]") .setIfMissing("spark.driver.memory", "3g") .setIfMissing("spark.executor.memory", "3g") .setIfMissing("orc.compression", "snappy") .setIfMissing("spark.sql.parquet.compression.codec", "snappy") SparkSession.builder().config(conf).getOrCreate() } private val numRows = 1024 * 1024 * 15 private val width = 5 private val mid = numRows / 2 def withTempTable(tableNames: String*)(f: => Unit): Unit = { try f finally tableNames.foreach(spark.catalog.dropTempView) } private def prepareTable( dir: File, numRows: Int, width: Int, useStringForValue: Boolean): Unit = { import spark.implicits._ val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i") val valueCol = if (useStringForValue) { monotonically_increasing_id().cast("string") } else { monotonically_increasing_id() } val df = spark.range(numRows).map(_ => Random.nextLong).selectExpr(selectExpr: _*) .withColumn("value", valueCol) .sort("value") saveAsTable(df, dir) } private def prepareStringDictTable( dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = { val selectExpr = (0 to width).map { case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value" case i => s"CAST(rand() AS STRING) c$i" } val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value") saveAsTable(df, dir, true) } private def saveAsTable(df: DataFrame, dir: File, useDictionary: Boolean = false): Unit = { val parquetPath = dir.getCanonicalPath + "/parquet" df.write.mode("overwrite").parquet(parquetPath) spark.read.parquet(parquetPath).createOrReplaceTempView("parquetTable") } def filterPushDownBenchmark( values: Int, title: String, whereExpr: String, selectExpr: String = "*"): Unit = { val benchmark = new Benchmark(title, values, minNumIters = 5, output = output) Seq(false, true).foreach { columnIndexEnabled => val name = s"Parquet Vectorized ${if (columnIndexEnabled) s"(columnIndex)" else ""}" benchmark.addCase(name) { _ => withSQLConf("parquet.filter.columnindex.enabled" -> s"$columnIndexEnabled") { spark.sql(s"SELECT $selectExpr FROM parquetTable WHERE $whereExpr").noop() } } } benchmark.run() } private def runIntBenchmark(numRows: Int, width: Int, mid: Int): Unit = { Seq("value IS NULL", s"$mid < value AND value < $mid").foreach { whereExpr => val title = s"Select 0 int row ($whereExpr)".replace("value AND value", "value") filterPushDownBenchmark(numRows, title, whereExpr) } Seq( s"value = $mid", s"value <=> $mid", s"$mid
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248323#comment-17248323 ] Yuming Wang commented on SPARK-26345: - We have a pr test compatibility against Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8: https://github.com/apache/spark/pull/30517 > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248241#comment-17248241 ] Xinli Shang commented on SPARK-26345: - The Presto and Iceberg effort are not tied to each other. It is just some common code I can reuse. The PR in Iceberg is https://github.com/apache/iceberg/pull/1566 and the Issue for Presto is https://github.com/prestodb/presto/issues/15454 (PR is under development now). > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248236#comment-17248236 ] James R. Taylor commented on SPARK-26345: - Thanks for the update, [~sha...@uber.com]. I had just read that blog and it does indeed look promising. Is the Presto support you mentioned tied to Iceberg or is it independent of that? Any PRs I could follow along on? > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248231#comment-17248231 ] Xinli Shang commented on SPARK-26345: - For the performance, there is an Eng Blog I found online written by Zoltán Borók-Nagy& Gábor Szádovszky. Here is the link https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/. Once Spark is on Parquet 1.11.x, we can work on the Column Index for Spark Vectorized reader. Currently, I am working on integrating Column Index to Iceberg and Presto. The local testing on Iceberg also seems promising. > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248216#comment-17248216 ] James R. Taylor commented on SPARK-26345: - Any updates on this issue, [~zi]? Wouldn't column indexes help performance quite a bit, especially if filtered column is clustered or sorted? > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162313#comment-17162313 ] Holden Karau commented on SPARK-26345: -- We don't assign issues normally until after the merge. Leaving a comment when you start working on it is a best practice to avoid people stepping on each others toes. > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162311#comment-17162311 ] Felix Kizhakkel Jose commented on SPARK-26345: -- [~sha...@uber.com] I don't have permission to assign it to you. Probably someone who is part of committers list can assign it to you. > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162063#comment-17162063 ] Xinli Shang commented on SPARK-26345: - [~yumwang][~FelixKJose], you can assign this JIra to me. When I have time, I can start working on it. > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070982#comment-17070982 ] Felix Kizhakkel Jose commented on SPARK-26345: -- I have created a Jira in Parquet-mr for Vectorized API - https://issues.apache.org/jira/browse/PARQUET-1830. But as per the discussion, it seems like a short term solution is "As Spark already use some internal API of parquet-mr we can step forward and implement the page skipping mechanism that is implemented in parquet-mr." [~gszadovszky]. So updating this Jira to have a short term solution to benefit from Column and Offset Index implementation in Parquet-MR 1.11 > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757209#comment-16757209 ] Zoltan Ivanfi commented on SPARK-26345: --- Please note that column indexes will automatically get utilized if [spark.sql.parquet.enableVectorizedReader|https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-vectorized-parquet-reader.html#spark.sql.parquet.enableVectorizedReader] = false. If spark.sql.parquet.enableVectorizedReader = true, on the other hand (which is the default), then column indexes could only be utilized by duplicating the internal logic of parquet-mr, which would be disproportonate effort. We, the developers of the column index feature did not expect Spark to make this huge investment, and we would like to provide a vectorized API instead in a future release. > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org