On 8/13/15 6:11 AM, YaoPau wrote:
I've seen this function referenced in a couple places, first  this forum post
<https://forums.databricks.com/questions/951/why-should-i-use-parquet.html>
and  this talk by Michael Armbrust
<https://www.youtube.com/watch?v=6axUqHCu__Y>   during the 42nd minute.

As I understand it, if you create a Parquet file using Spark, Spark will
then have access to min/max vals for each column.  If a query asks for a
value outside that range (like a timestamp), Spark will know to skip that
file entirely.
Not all types of columns can be used in filter push-down. Parquet-mr 1.7.0 and prior versions only allow a set of types. See here <https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/ValidTypeMap.java#L66-L80>.

Parquet-mr 1.8 relaxed this restriction, see PARQUET-201 <https://issues.apache.org/jira/browse/PARQUET-201>.

Michael says this feature is turned off by default in 1.3.  How can I turn
this on?
You can turn it on by setting spark.sql.parquet.filterPushdown to true. This is already turned on by default in Spark 1.5.

This was turned off because of a bug in parquet-mr 1.6.0rc3: PARQUET-136 <https://issues.apache.org/jira/browse/PARQUET-136>, which causes NPE. Also, PARQUET-173 prevents any predicates with AND being pushed-down (it doesn't affect correctness though).

I don't see much about this feature online.  A couple other questions:

- Does this only work for Parquet files that were created in Spark?  For
example, if I create the Parquet file using Hive + MapReduce, or Impala,
would Spark still have access to min/max values?
Spark can access the statistics information in Parquet files generated by other systems.

It's a feature of Parquet rather than Spark, the statistics information is always written into Parquet files. However, different systems need to implement their own filter push-down logic to leverage this information properly.

- Does this feature work at the row chunk level, or just at the file level?
It works at row chunk level (or "row group" in the language of Parquet)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-Parquet-Skipping-data-using-statistics-tp24233.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Reply via email to