Re: Spark 1.3 + Parquet: "Skipping data using statistics"

Cheng Lian Thu, 13 Aug 2015 01:48:08 -0700


On 8/13/15 6:11 AM, YaoPau wrote:

I've seen this function referenced in a couple places, first  this forum post
<https://forums.databricks.com/questions/951/why-should-i-use-parquet.html>
and  this talk by Michael Armbrust
<https://www.youtube.com/watch?v=6axUqHCu__Y>   during the 42nd minute.

As I understand it, if you create a Parquet file using Spark, Spark will
then have access to min/max vals for each column.  If a query asks for a
value outside that range (like a timestamp), Spark will know to skip that
file entirely.

Not all types of columns can be used in filter push-down. Parquet-mr1.7.0 and prior versions only allow a set of types. See here<https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/ValidTypeMap.java#L66-L80>.

Parquet-mr 1.8 relaxed this restriction, see PARQUET-201<https://issues.apache.org/jira/browse/PARQUET-201>.


Michael says this feature is turned off by default in 1.3.  How can I turn
this on?

You can turn it on by setting spark.sql.parquet.filterPushdown to true.This is already turned on by default in Spark 1.5.

This was turned off because of a bug in parquet-mr 1.6.0rc3: PARQUET-136<https://issues.apache.org/jira/browse/PARQUET-136>, which causes NPE.Also, PARQUET-173 prevents any predicates with AND being pushed-down (itdoesn't affect correctness though).


I don't see much about this feature online.  A couple other questions:

- Does this only work for Parquet files that were created in Spark?  For
example, if I create the Parquet file using Hive + MapReduce, or Impala,
would Spark still have access to min/max values?

Spark can access the statistics information in Parquet files generatedby other systems.

It's a feature of Parquet rather than Spark, the statistics informationis always written into Parquet files. However, different systems need toimplement their own filter push-down logic to leverage this informationproperly.


- Does this feature work at the row chunk level, or just at the file level?

It works at row chunk level (or "row group" in the language of Parquet)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-Parquet-Skipping-data-using-statistics-tp24233.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.3 + Parquet: "Skipping data using statistics"

Reply via email to