reducer throttling?

Dexin Wang Thu, 17 Mar 2011 11:04:04 -0700

We do some processing in hadoop then as the last step, we write the result
to database. Database is not good at handling hundreds of concurrent
connections and fast writes. So we need to throttle down the number of tasks
that writes to DB. Since we have no control on the number of mappers, we add
an artificial reducer step to achieve that, either by doing GROUP or ORDER,
like this:


sorted_data = ORDER data BY f1 PARALLEL 10;
-- then write sorted_data to DB

or

grouped_data = GROUP data BY f1 PARALLEL 10;
data_to_write = FOREACH grouped_data GENERATE $1;

I feel neither is good approach. They just add unnecessary computing time,
especially the first one. And GROUP may result in too large of bags issue.

Any better suggestions?

reducer throttling?

Reply via email to