Hi All,

We have 10 tables in data warehouse (hdfs/hive) written using ORC format. We 
are serving a usecase on top of that by joining 4-5 tables using Hive as of 
now. But it is not fast as we wanted it to be, so we are thinking of using 
spark for this use case.

Any suggestion on this ? Is it good idea to use the Spark for this use case ? 
Can we get better performance by using spark ?

Any pointers would be helpful.

Notes:

  *   Data is partitioned by date (yyyyMMdd) as integer.
  *   Query will fetch data for last 7 days from some tables while joining with 
other tables.

Approach we thought of as now :

  *   Create dataframe for each table and partition by same column for all 
tables ( Lets say Country as partition column )
  *   Register all tables as temporary tables
  *   Run the sql query with joins

But the problem we are seeing with this approach is , even though we already 
partitioned using country it still does hashParittioning + shuffle during join. 
All the table join contain `Country` column with some extra column based on the 
table.

Is there any way to avoid these shuffles ? and improve performance ?


Thanks and regards
Manjunath

Reply via email to