Hi, In hive different types of joins are there like join, map join , bucket map join and etc.
Please take a look of these it may help you to optimize your query https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization https://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919 Hope It Helps, Chinna On Thu, Mar 27, 2014 at 4:24 AM, Srinivasan Ramaswamy <ursva...@gmail.com>wrote: > I have a join query where i am joining huge tables and i am trying to > optimize this hive query. > > INSERT OVERWRITE TABLE result > SELECT /*+ STREAMTABLE(product) */ > i.IMAGE_ID, > p.PRODUCT_NO, > p.STORE_NO, > p.PRODUCT_CAT_NO, > p.CAPTION, > p.PRODUCT_DESC, > p.IMAGE1_ID, > p.IMAGE2_ID, > s.STORE_ID, > s.STORE_NAME, > p.CREATE_DATE, > CASE WHEN custImg.IMAGE_ID is NULL THEN 0 ELSE 1 END, > CASE WHEN custImg1.IMAGE_ID is NULL THEN 0 ELSE 1 END, > CASE WHEN custImg2.IMAGE_ID is NULL THEN 0 ELSE 1 END > FROM image i > JOIN PRODUCT p ON i.IMAGE_ID = p.IMAGE1_ID > JOIN PRODUCT_CAT pcat ON p.PRODUCT_CAT_NO = pcat.PRODUCT_CAT_NO > JOIN STORE s ON p.STORE_NO = s.STORE_NO > JOIN STOCK_INFO si ON si.STOCK_INFO_ID = pcat.STOCK_INFO_ID > LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg ON i.IMAGE_ID = > custImg.IMAGE_ID > LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg1 ON p.IMAGE1_ID = > custImg1.IMAGE_ID > LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg2 ON p.IMAGE2_ID = > custImg2.IMAGE_ID; > > Here are some facts about the tables > image table has 60 million rows > product table has 1 billion rows > product_cat has 1000 rows > store has 1m rows > stock_info has 100 rows > customizable_image has 200k rows > > a product can have one or two images (image1 and image2) and product level > information are stored only in product table. i tried moving the join with > product to the bottom but i couldnt as all other following joins require > data from the product table. > > Here is what i tried so far: > 1. I gave the hint to hive to stream product table as its the biggest one > 2. I bucketed the table (during create table of image and product) into > 256 buckets (on image_id) and then did the join - didnt give me any > significant performance gain > 3. changed the input format to sequence file from textfile(gzip files) , > so that it can be splittable and hence more mappers can be run if hive want > to run more mappers > > The query is still taking longer than 5 hours in Hive (running in aws with > 3 large nodes) where as in RDBMS it takes only 5 hrs. I need some help in > optimizing this query, so that it executes much faster. what else can i > try, does partitioning the table help in improving join performance ? > > This brings me to the question, "is Hive even the right choice (compared > to rdbms) for such complex joins" ? > > Thanks > Srini >