Parquet predicate pushdown troubles

Yana Kadiyska Fri, 09 Jan 2015 11:59:32 -0800

I am running the following (connecting to an external Hive Metastore)

 /a/shark/spark/bin/spark-shell --master spark://ip:7077  --conf
*spark.sql.parquet.filterPushdown=true*


val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

and then ran two queries:

sqlContext.sql("select count(*) from table where partition='blah' ")
andsqlContext.sql("select count(*) from table where partition='blah'
and epoch=1415561604")



According to the Input tab in the UI both scan about 140G of data which is
the size of my whole partition. So I have two questions --

1. is there a way to tell from the plan if a predicate pushdown is supposed
to happen?
I see this for the second query

res0: org.apache.spark.sql.SchemaRDD =
SchemaRDD[0] at RDD at SchemaRDD.scala:108
== Query Plan ==
== Physical Plan ==
Aggregate false, [], [Coalesce(SUM(PartialCount#49L),0) AS _c0#0L]
 Exchange SinglePartition
  Aggregate true, [], [COUNT(1) AS PartialCount#49L]
   OutputFaker []
    Project []
     ParquetTableScan [epoch#139L], (ParquetRelation <list of hdfs files>


2. am I doing something obviously wrong that this is not working? (Im
guessing it's not woring because the input size for the second query shows
unchanged and the execution time is almost 2x as long)

thanks in advance for any insights

Parquet predicate pushdown troubles

Reply via email to