Re: optimize hive query for multitable join where one table is huge

Chinna Rao Lalam Fri, 28 Mar 2014 03:01:29 -0700

Hi,

In hive different types of joins are there like join, map join , bucket map
join and etc.


Please take a look of these it may help you to optimize your query

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization

https://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919


Hope It Helps,
Chinna


On Thu, Mar 27, 2014 at 4:24 AM, Srinivasan Ramaswamy <ursva...@gmail.com>wrote:

> I have a join query where i am joining huge tables and i am trying to
> optimize this hive query.
>
>     INSERT OVERWRITE TABLE result
>     SELECT /*+ STREAMTABLE(product) */
>     i.IMAGE_ID,
>     p.PRODUCT_NO,
>     p.STORE_NO,
>     p.PRODUCT_CAT_NO,
>     p.CAPTION,
>     p.PRODUCT_DESC,
>     p.IMAGE1_ID,
>     p.IMAGE2_ID,
>     s.STORE_ID,
>     s.STORE_NAME,
>     p.CREATE_DATE,
>     CASE WHEN custImg.IMAGE_ID is NULL THEN 0 ELSE 1 END,
>     CASE WHEN custImg1.IMAGE_ID is NULL THEN 0 ELSE 1 END,
>     CASE WHEN custImg2.IMAGE_ID is NULL THEN 0 ELSE 1 END
>     FROM image i
>     JOIN PRODUCT p ON i.IMAGE_ID = p.IMAGE1_ID
>     JOIN PRODUCT_CAT pcat ON p.PRODUCT_CAT_NO = pcat.PRODUCT_CAT_NO
>     JOIN STORE s ON p.STORE_NO = s.STORE_NO
>     JOIN STOCK_INFO si ON si.STOCK_INFO_ID = pcat.STOCK_INFO_ID
>     LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg ON i.IMAGE_ID =
> custImg.IMAGE_ID
>      LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg1 ON p.IMAGE1_ID =
> custImg1.IMAGE_ID
>     LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg2 ON p.IMAGE2_ID =
> custImg2.IMAGE_ID;
>
> Here are some facts about the tables
> image table has 60 million rows
> product table has 1 billion rows
> product_cat has 1000 rows
> store has 1m rows
> stock_info has 100 rows
> customizable_image has 200k rows
>
> a product can have one or two images (image1 and image2) and product level
> information are stored only in product table. i tried moving the join with
> product to the bottom but i couldnt as all other following joins require
> data from the product table.
>
> Here is what i tried so far:
> 1. I gave the hint to hive to stream product table as its the biggest one
> 2. I bucketed the table (during create table of image and product) into
> 256 buckets (on image_id) and then did the join - didnt give me any
> significant performance gain
> 3. changed the input format to sequence file from textfile(gzip files) ,
> so that it can be splittable and hence more mappers can be run if hive want
> to run more mappers
>
> The query is still taking longer than 5 hours in Hive (running in aws with
> 3 large nodes) where as in RDBMS it takes only 5 hrs. I need some help in
> optimizing this query, so that it executes much faster. what else can i
> try, does partitioning the table help in improving join performance ?
>
> This brings me to the question, "is Hive even the right choice (compared
> to rdbms) for such complex joins" ?
>
> Thanks
> Srini
>

Re: optimize hive query for multitable join where one table is huge

Reply via email to