We have our analytics infra built on Spark and Parquet. We are trying to replace some of our queries based on the direct Spark RDD API to SQL based either on Spark SQL/HiveQL. Our motivation was to take advantage of the transparent projection & predicate pushdown that's offered by Spark SQL and eliminate the need for persisting the RDD in memory. (Cache invalidation turned out to be a big problem for us)
The below tests are done with Spark 1.1.0 on CDH 5.1.0 1. Spark SQL's (SQLContext) Parquet support was excellent for our case. The ability to query in SQL and apply scala functions as UDFs in the SQL is extremely convenient. Project pushdown works flawlessly, not much sure about predicate pushdown (we have 90% optional fields in our dataset and I remember Michael Armbrust telling me that this is a bug in Parquet in that it doesnt allow predicate pushdown for optional fields.) However we have timestamp based duplicate removal which requires windowing queries which are not working in SQLContext.sql parsing mode. 2. We then tried HiveQL using HiveContext by creating a Hive external table backed by the same Parquet data. However, in this mode, projection pushdown doesnt seem to work and it ends up reading the whole Parquet data for each query.(which slows down a lot) Please see attached the screenshot of this. Hive itself doesnt seem to have any issues with the projection pushdown. So this is weird. Is this due to any configuration problem? Thanks in advance, Anand Mohan SparkSQLHiveParquet.png (316K) <http://apache-spark-user-list.1001560.n3.nabble.com/attachment/15953/0/SparkSQLHiveParquet.png> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-HiveContext-Projection-Pushdown-tp15953.html Sent from the Apache Spark User List mailing list archive at Nabble.com.