subject:"\[jira\] \[Commented\] \(SPARK\-26345\) Parquet support Column indexes"

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2021-03-24 Thread Li Xian (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308292#comment-17308292
 ] 

Li Xian commented on SPARK-26345:
-

[~dongjoon] sure, I have created a new issue 
https://issues.apache.org/jira/browse/SPARK-34859 

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Parquet 1.11 supports column indexing. Spark can supports this feature for 
> better read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201
>  
> Benchmark result:
> [https://github.com/apache/spark/pull/31393#issuecomment-769767724]
> This feature is enabled by default, and users can disable it by setting 
> {{parquet.filter.columnindex.enabled}} to false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2021-03-24 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308144#comment-17308144
 ] 

Dongjoon Hyun commented on SPARK-26345:
---

Thank you for reporting. Could you file a new JIRA, [~lxian2] and 
[~sha...@uber.com]?

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Parquet 1.11 supports column indexing. Spark can supports this feature for 
> better read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201
>  
> Benchmark result:
> [https://github.com/apache/spark/pull/31393#issuecomment-769767724]
> This feature is enabled by default, and users can disable it by setting 
> {{parquet.filter.columnindex.enabled}} to false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2021-03-24 Thread Xinli Shang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308000#comment-17308000
 ] 

Xinli Shang commented on SPARK-26345:
-

Yes, it needs some synchronization. I have the modified version implementation 
in Presto. You can check it 
[here|https://github.com/shangxinli/presto/commit/f6327a161eb6cfd5137f679620e095d8257816b8#diff-bb24b92e28343804ebaf540efe6c1cda0b5e2524e6811f8fe2daee5944dad386R203].
 

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Parquet 1.11 supports column indexing. Spark can supports this feature for 
> better read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201
>  
> Benchmark result:
> [https://github.com/apache/spark/pull/31393#issuecomment-769767724]
> This feature is enabled by default, and users can disable it by setting 
> {{parquet.filter.columnindex.enabled}} to false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2021-03-24 Thread Li Xian (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307590#comment-17307590
 ] 

Li Xian commented on SPARK-26345:
-

[~yumwang] I think the current implementation has a problem. the pages returned 
by `readNextFilteredRowGroup` may not be aligned, some columns may have more 
rows than others.

Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` 
with `rowIndexes` to make sure that rows are aligned. 

Currently `VectorizedParquetRecordReader` doesn't have such synchronizing among 
pages from different columns. Using `readNextFilteredRowGroup` may result in 
incorrect result.

 

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Parquet 1.11 supports column indexing. Spark can supports this feature for 
> better read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201
>  
> Benchmark result:
> [https://github.com/apache/spark/pull/31393#issuecomment-769767724]
> This feature is enabled by default, and users can disable it by setting 
> {{parquet.filter.columnindex.enabled}} to false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-12-12 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248526#comment-17248526
 ] 

Yuming Wang commented on SPARK-26345:
-

[~jamestaylor] Please see [https://github.com/apache/spark/pull/30517] for more 
details.

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-12-12 Thread James R. Taylor (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248412#comment-17248412
 ] 

James R. Taylor commented on SPARK-26345:
-

Those results are excellent, [~yumwang]. I thought from earlier comments that 
vectorized reads in Spark weren't compatible with column indexing? Do the child 
JIRAs here fix that?

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-12-12 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248374#comment-17248374
 ] 

Yuming Wang commented on SPARK-26345:
-

Benchmark and benchmark result:

{code:scala}
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.spark.sql.execution.benchmark

import java.io.File

import scala.util.Random

import org.apache.spark.SparkConf
import org.apache.spark.benchmark.Benchmark
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions.{monotonically_increasing_id, 
timestamp_seconds}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.internal.SQLConf.ParquetOutputTimestampType
import org.apache.spark.sql.types.{ByteType, Decimal, DecimalType}

/**
 * Benchmark to measure read performance with Parquet column index.
 * To run this benchmark:
 * {{{
 *   1. without sbt: bin/spark-submit --class  
 *   2. build/sbt "sql/test:runMain "
 *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/test:runMain "
 *  Results will be written to 
"benchmarks/ParquetFilterPushdownBenchmark-results.txt".
 * }}}
 */
object ParquetFilterPushdownBenchmark extends SqlBasedBenchmark {

  override def getSparkSession: SparkSession = {
val conf = new SparkConf()
  .setAppName(this.getClass.getSimpleName)
  // Since `spark.master` always exists, overrides this value
  .set("spark.master", "local[1]")
  .setIfMissing("spark.driver.memory", "3g")
  .setIfMissing("spark.executor.memory", "3g")
  .setIfMissing("orc.compression", "snappy")
  .setIfMissing("spark.sql.parquet.compression.codec", "snappy")

SparkSession.builder().config(conf).getOrCreate()
  }

  private val numRows = 1024 * 1024 * 15
  private val width = 5
  private val mid = numRows / 2

  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
try f finally tableNames.foreach(spark.catalog.dropTempView)
  }

  private def prepareTable(
  dir: File, numRows: Int, width: Int, useStringForValue: Boolean): Unit = {
import spark.implicits._
val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
val valueCol = if (useStringForValue) {
  monotonically_increasing_id().cast("string")
} else {
  monotonically_increasing_id()
}
val df = spark.range(numRows).map(_ => 
Random.nextLong).selectExpr(selectExpr: _*)
  .withColumn("value", valueCol)
  .sort("value")

saveAsTable(df, dir)
  }

  private def prepareStringDictTable(
  dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = {
val selectExpr = (0 to width).map {
  case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value"
  case i => s"CAST(rand() AS STRING) c$i"
}
val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value")

saveAsTable(df, dir, true)
  }

  private def saveAsTable(df: DataFrame, dir: File, useDictionary: Boolean = 
false): Unit = {
val parquetPath = dir.getCanonicalPath + "/parquet"
df.write.mode("overwrite").parquet(parquetPath)
spark.read.parquet(parquetPath).createOrReplaceTempView("parquetTable")
  }

  def filterPushDownBenchmark(
  values: Int,
  title: String,
  whereExpr: String,
  selectExpr: String = "*"): Unit = {
val benchmark = new Benchmark(title, values, minNumIters = 5, output = 
output)

Seq(false, true).foreach { columnIndexEnabled =>
  val name = s"Parquet Vectorized ${if (columnIndexEnabled) 
s"(columnIndex)" else ""}"
  benchmark.addCase(name) { _ =>
withSQLConf("parquet.filter.columnindex.enabled" -> 
s"$columnIndexEnabled") {
  spark.sql(s"SELECT $selectExpr FROM parquetTable WHERE 
$whereExpr").noop()
}
  }
}

benchmark.run()
  }

  private def runIntBenchmark(numRows: Int, width: Int, mid: Int): Unit = {
Seq("value IS NULL", s"$mid < value AND value < $mid").foreach { whereExpr 
=>
  val title = s"Select 0 int row ($whereExpr)".replace("value AND value", 
"value")
  filterPushDownBenchmark(numRows, title, whereExpr)
}

Seq(
  s"value = $mid",
  s"value <=> $mid",
  s"$mid

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-12-12 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248323#comment-17248323
 ] 

Yuming Wang commented on SPARK-26345:
-

We have a pr test compatibility against Parquet 1.11.1, Avro 1.10.1 and Hive 
2.3.8: https://github.com/apache/spark/pull/30517

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-12-11 Thread Xinli Shang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248241#comment-17248241
 ] 

Xinli Shang commented on SPARK-26345:
-

The Presto and Iceberg effort are not tied to each other. It is just some 
common code I can reuse. The PR in Iceberg is 
https://github.com/apache/iceberg/pull/1566 and the Issue for Presto is 
https://github.com/prestodb/presto/issues/15454 (PR is under development now). 


> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-12-11 Thread James R. Taylor (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248236#comment-17248236
 ] 

James R. Taylor commented on SPARK-26345:
-

Thanks for the update, [~sha...@uber.com]. I had just read that blog and it 
does indeed look promising.

Is the Presto support you mentioned tied to Iceberg or is it independent of 
that? Any PRs I could follow along on?

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-12-11 Thread Xinli Shang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248231#comment-17248231
 ] 

Xinli Shang commented on SPARK-26345:
-

For the performance, there is an Eng Blog I found online written by Zoltán 
Borók-Nagy& Gábor Szádovszky. Here is the link 
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/.
 

Once Spark is on Parquet 1.11.x, we can work on the Column Index for Spark 
Vectorized reader. Currently, I am working on integrating Column Index to 
Iceberg and Presto. The local testing on Iceberg also seems promising. 

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-12-11 Thread James R. Taylor (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248216#comment-17248216
 ] 

James R. Taylor commented on SPARK-26345:
-

Any updates on this issue, [~zi]? Wouldn't column indexes help performance 
quite a bit, especially if filtered column is clustered or sorted?

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-07-21 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162313#comment-17162313
 ] 

Holden Karau commented on SPARK-26345:
--

We don't assign issues normally until after the merge. Leaving a comment when 
you start working on it is a best practice to avoid people stepping on each 
others toes.

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-07-21 Thread Felix Kizhakkel Jose (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162311#comment-17162311
 ] 

Felix Kizhakkel Jose commented on SPARK-26345:
--

[~sha...@uber.com] I don't have permission to assign it to you. Probably 
someone who is part of committers list can assign it to you.

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-07-21 Thread Xinli Shang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162063#comment-17162063
 ] 

Xinli Shang commented on SPARK-26345:
-

[~yumwang][~FelixKJose], you can assign this JIra to me. When I have time, I 
can start working on it. 

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-03-30 Thread Felix Kizhakkel Jose (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070982#comment-17070982
 ] 

Felix Kizhakkel Jose commented on SPARK-26345:
--

I have created a Jira in Parquet-mr for Vectorized API - 
https://issues.apache.org/jira/browse/PARQUET-1830. But as per the discussion, 
it seems like a short term solution is "As Spark already use some internal API 
of parquet-mr we can step forward and implement the page skipping mechanism 
that is implemented in parquet-mr." [~gszadovszky]. 

So updating this Jira to have a short term solution to benefit from Column and 
Offset Index implementation in Parquet-MR 1.11

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2019-01-31 Thread Zoltan Ivanfi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757209#comment-16757209
 ] 

Zoltan Ivanfi commented on SPARK-26345:
---

Please note that column indexes will automatically get utilized if 
[spark.sql.parquet.enableVectorizedReader|https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-vectorized-parquet-reader.html#spark.sql.parquet.enableVectorizedReader]
 = false. If spark.sql.parquet.enableVectorizedReader = true, on the other hand 
(which is the default), then column indexes could only be utilized by 
duplicating the internal logic of parquet-mr, which would be disproportonate 
effort. We, the developers of the column index feature did not expect Spark to 
make this huge investment, and we would like to provide a vectorized API 
instead in a future release.

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

[jira] [Commented] (SPARK-26345) Parquet support Column indexes

17 matches

Site Navigation

Mail list logo

Footer information