Anand, best place to understand the join queries on hive is from the presentation by Namit Jain from Facebook.
Here is the pdf https://cwiki.apache.org/Hive/presentations.data/Hive%20Summit%202011-join.pdf you can search the video on youtube. Its very well described On Sun, Apr 1, 2012 at 11:59 PM, Ladda, Anand <[email protected]>wrote: > I am trying to understand what are some of the options/settings > available to tune the performance of Hive Queries. I have seen the benefits > of Map side joins and Partitioning/Clustering. However I have yet to > realize the impact map side aggregation has on query performance. I tried > running this query against with and without map-side join turned on and did > not see much difference in the execution times. The raw data in this > partition is about 5.5 million. Looking for some pointers to see what type > of queries benefit from Map-side aggregation**** > > ** ** > > set hive.auto.convert.join=false;**** > > set hive.map.aggr=false;**** > > Non-partitioned, non-clustered single table with where clause on date and > no map side aggregation**** > > select a11.emp_id, count(1), count (distinct a11.customer_id), > sum(a11.qty_sold) from orderdetailrcfile a11 where order_date ='01-01-2008' > group by a11.emp_id;**** > > 400 secs**** > > set hive.map.aggr=true;**** > > Non-partitioned, non-clustered single table with where clause with where > clause on date and map side aggregation**** > > select a11.emp_id, count(1), count (distinct a11.customer_id), > sum(a11.qty_sold) from orderdetailrcfile a11 where order_date ='01-01-2008' > group by a11.emp_id;**** > > 390 secs**** > > ** ** > > Also is there any reason to not turn on map-side joins all the time. In my > tests I have always seen the performance either be the same or improve with > map-side joins turned on. Are there any other parameters or Hive features > that can help improve the performance of Hive queries. **** > > Thanks**** > > Anand**** > > ** ** > -- Nitin Pawar
