Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

Raju Bairishetti Tue, 10 Jan 2017 20:15:53 -0800

Hello,

   Spark sql is generating query plan with all partitions information even
though if we apply filters on partitions in the query.  Due to this, spark
driver/hive metastore is hitting with OOM as each table is with lots of
partitions.


We can confirm from hive audit logs that it tries to *fetch all partitions*
from hive metastore.

 2016-12-28 07:18:33,749 INFO  [pool-4-thread-184]: HiveMetaStore.audit
(HiveMetaStore.java:logAuditEvent(371)) - ugi=rajub    ip=/x.x.x.x
cmd=get_partitions : db=xxxx tbl=xxxxx


Configured the following parameters in the spark conf to fix the above
issue(source: from spark-jira & github pullreq):

*spark.sql.hive.convertMetastoreParquet   false*
*    spark.sql.hive.metastorePartitionPruning   true*


*   plan:  rdf.explain*
*   == Physical Plan ==*
       HiveTableScan [rejection_reason#626], MetastoreRelation dbname,
tablename, None,   [(year#314 = 2016),(month#315 = 12),(day#316 =
28),(hour#317 = 2),(venture#318 = DEFAULT)]

*    get_partitions_by_filter* method is called and fetching only required
partitions.

    But we are seeing parquetDecode errors in our applications frequently
after this. Looks like these decoding errors were because of changing
serde from spark-builtin to hive serde.

I feel like,* fixing query plan generation in the spark-sql* is the right
approach instead of forcing users to use hive serde.

Is there any workaround/way to fix this issue? I would like to hear more
thoughts on this :)

------
Thanks,
Raju Bairishetti,
www.lazada.com

Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

Reply via email to