0 We have 7 tables and each table is partitioned by record_date.There is a query which involves inner join with all these tables and join is based on consumer_id. The join involves multiple partition join. Currently querying 1 week data takes very long time around 20-30 mins. We want to optimize this query. The root cause of this slowness is the data shuffling and map reduce.Each table contains around 8TB data on compression. Table is compressed with ORC-Zlib.
We tried bucketing along with the partitioning. So we bucketed each partition of the table into 100 buckets based on the consumer_id as consumer_id is the joining key. When there is a query involving one partition there is improvement in the query perf as it uses merge join. But whenever there is a query involving multiple partitions the query plan starts shuffling data and no improvements in query perf. We are using hive tez engine and vectorization is enabled.Can anyone suggest what approach we should follow when there is query with multiple tables join and multiple partitions?