In this case its data-skew. Of all the keys there will be few keys which have 
more records. What extra information do you need? 


-----Original Message-----
From: Gopal Vijayaraghavan [mailto:[email protected]] On Behalf Of Gopal 
Vijayaraghavan
Sent: Tuesday, January 5, 2016 5:36 PM
To: [email protected]
Cc: Kiran Kolli <[email protected]>
Subject: Re: TEZ reducer skew


>Looking at the counters REDUCE_INPUT_GROUPS are almost approximately 
>same across reducer tasks. But REDUCE_INPUT_RECORDS of the skewed tasks 
>are like 180 times more than others. How to avoid skew to reducers.

That really depends. Is the skew a representation of the input or is it an 
artificially introduced skew due to a query plan.

If your input is skewed (as in, user='' is the same as user=null), then 
occasionally you can write query fragments which remove such skews before 
shuffling.

Occasionally, a user will write an incorrect query which produces this as well. 
For instance,

select sum(sales) from txns tx, (select a.type from accounts where account_date 
= '2015-12-25') t where tx.type = t.type;

is a skewed query accidentally written by a user.

If the skew is really in the input, Tez (or map-reduce/spark) cannot actually 
redistribute a skewed key arbitrarily without knowing the semantics of 
redistribution in the higher level planner.

This problem has many many workarounds, but none of them apply to any other 
scenario - so please elaborate.

Cheers,
Gopal


Reply via email to