hey Yong,
Even without the group by (pure cross join) the query is only using one
reducer. Even specifying more reducers doesn't help:
set mapred.reduce.tasks=50;
SELECT id1,
m.keyword,
prep_kw.keyword
FROM (select id1, keyword from import1) m
CROSS JOIN
(SELECT keyword FROM et_keywords) prep_kw;
...
Hadoop job information for Stage-1: number of mappers: 3; number of
reducers: 1
What could be setup wrong here? Or can it be avoided to use this ugly cross
join at all? I mean my original problem is actually something else ;-)
Cheers
Wolli
2014-03-05 15:07 GMT+01:00 java8964 <[email protected]>:
> Hi, Wolli:
>
> Cross join doesn't mean Hive has to use one reduce.
>
> From query point of view, the following cases will use one reducer:
>
> 1) Order by in your query (Instead of using sort by)
> 2) Only one reducer group, which means all the data have to send to one
> reducer, as there is only one reducer group.
>
> In your case, distinct count of id1 will be the reducer group count. Did
> you explicitly set the reducer count in your hive session?
>
> Yong
>
> ------------------------------
> Date: Wed, 5 Mar 2014 14:17:24 +0100
> Subject: Best way to avoid cross join
> From: [email protected]
> To: [email protected]
>
> Hey everyone,
>
> before i write a lot of text, i just post something which is already
> written:
> http://www.sqlservercentral.com/Forums/Topic1328496-360-1.aspx
>
> The first posts adresses a pretty similar problem i also have. Currently
> my implementation looks like this:
>
> SELECT id1,
> MAX(
> CASE
> WHEN m.keyword IS NULL
> THEN 0
> WHEN instr(m.keyword, prep_kw.keyword) > 0
> THEN 1
> ELSE 0
> END) AS flag
> FROM (select id1, keyword from import1) m
> CROSS JOIN
> (SELECT keyword FROM et_keywords) prep_kw
> GROUP BY id1;
>
> Since there is a cross join involved, the execution gets pinned down to 1
> reducer only and it takes ages to complete.
>
> The thread i posted is solving this with some special SQLserver tactics.
> But I was wondering if anybody has encountered the problem in Hive already
> and found a better way to solve this.
>
> I'm using Hive 0.11 on a MapR Distribution, if this is somehow important.
>
> Cheers
> Wolli
>