We use bulk insert technique after the job completes. You can control the 
amount of each bulk insert by controlling the amount of reducers.  

Sent from my iPhone

On Mar 17, 2011, at 2:03 PM, Dexin Wang <[email protected]> wrote:

> We do some processing in hadoop then as the last step, we write the result
> to database. Database is not good at handling hundreds of concurrent
> connections and fast writes. So we need to throttle down the number of tasks
> that writes to DB. Since we have no control on the number of mappers, we add
> an artificial reducer step to achieve that, either by doing GROUP or ORDER,
> like this:
> 
> sorted_data = ORDER data BY f1 PARALLEL 10;
> -- then write sorted_data to DB
> 
> or
> 
> grouped_data = GROUP data BY f1 PARALLEL 10;
> data_to_write = FOREACH grouped_data GENERATE $1;
> 
> I feel neither is good approach. They just add unnecessary computing time,
> especially the first one. And GROUP may result in too large of bags issue.
> 
> Any better suggestions?

Reply via email to