We have our analytics infra built on Spark and Parquet.
We are trying to replace some of our queries based on the direct Spark RDD
API to SQL based either on Spark SQL/HiveQL.
Our motivation was to take advantage of the transparent projection &
predicate pushdown that's offered by Spark SQL and eliminate the need
for persisting
the RDD in memory. (Cache invalidation turned out to be a big problem for
us)

The below tests are done with Spark 1.1.0 on CDH 5.1.0


1. Spark SQL's (SQLContext) Parquet support was excellent for our case. The
ability to query in SQL and apply scala functions as UDFs in the SQL is
extremely convenient. Project pushdown works flawlessly, not much sure
about predicate pushdown
(we have 90% optional fields in our dataset and I remember Michael Armbrust
telling me that this is a bug in Parquet in that it doesnt allow predicate
pushdown for optional fields.)
However we have timestamp based duplicate removal which requires windowing
queries which are not working in SQLContext.sql parsing mode.

2. We then tried HiveQL using HiveContext by creating a Hive external table
backed by the same Parquet data. However, in this mode, projection pushdown
doesnt seem to work and it ends up reading the whole Parquet data for each
query.(which slows down a lot)
Please see attached the screenshot of this.
Hive itself doesnt seem to have any issues with the projection pushdown.
So this is weird. Is this due to any configuration problem?

Thanks in advance,
Anand Mohan


SparkSQLHiveParquet.png (316K) 
<http://apache-spark-user-list.1001560.n3.nabble.com/attachment/15953/0/SparkSQLHiveParquet.png>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-HiveContext-Projection-Pushdown-tp15953.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to