>Looking at the counters REDUCE_INPUT_GROUPS are almost approximately same >across reducer tasks. But REDUCE_INPUT_RECORDS of the skewed tasks are >like 180 times more than others. How to avoid skew to reducers.
That really depends. Is the skew a representation of the input or is it an artificially introduced skew due to a query plan. If your input is skewed (as in, user='' is the same as user=null), then occasionally you can write query fragments which remove such skews before shuffling. Occasionally, a user will write an incorrect query which produces this as well. For instance, select sum(sales) from txns tx, (select a.type from accounts where account_date = '2015-12-25') t where tx.type = t.type; is a skewed query accidentally written by a user. If the skew is really in the input, Tez (or map-reduce/spark) cannot actually redistribute a skewed key arbitrarily without knowing the semantics of redistribution in the higher level planner. This problem has many many workarounds, but none of them apply to any other scenario - so please elaborate. Cheers, Gopal
