We do some processing in hadoop then as the last step, we write the result to database. Database is not good at handling hundreds of concurrent connections and fast writes. So we need to throttle down the number of tasks that writes to DB. Since we have no control on the number of mappers, we add an artificial reducer step to achieve that, either by doing GROUP or ORDER, like this:
sorted_data = ORDER data BY f1 PARALLEL 10; -- then write sorted_data to DB or grouped_data = GROUP data BY f1 PARALLEL 10; data_to_write = FOREACH grouped_data GENERATE $1; I feel neither is good approach. They just add unnecessary computing time, especially the first one. And GROUP may result in too large of bags issue. Any better suggestions?
