There be dragons, but in years past I solved a similar problem with the 
MultiThreadedMapper [1], and it would be possible to do something similar in a 
DoFn implementation. Basically the you can read multiple inputs and farm them 
off to threads, then synchronize and flush after N items are processed and do a 
final flush to the emitter in the cleanup(…) method.

There are lots of pitfalls to managing your own threads, of course. You’d need 
to detach incoming values passed to the DoFn so they don’t get clobbered by 
other threads, it could fight against Hadoop’s resource management (since 
Hadoop wants to manage how many threads are running), and writing 
multi-threaded code is pretty terrible in general. But it’s an option at least.

[1]
https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/map/MultithreadedMapper.html

On Sep 25, 2014, at 11:03 PM, Allan Shoup 
<[email protected]<mailto:[email protected]>> wrote:

I failed to mention that the I don't have an opportunity to read the source - 
my input is a PTable of Avro keys and values.

On Thu, Sep 25, 2014 at 8:48 PM, Josh Wills 
<[email protected]<mailto:[email protected]>> wrote:
NLineSource, to control how many shards the small input is split up into?

J

On Thu, Sep 25, 2014 at 6:10 PM, Allan Shoup 
<[email protected]<mailto:[email protected]>> wrote:
I have a very cpu-intensive DoFn which running over a relatively small input. 
Running on a Hadoop cluster, the job that it is run in sometimes executes the 
function in map tasks and sometimes in reduce tasks. What's the best way to 
reliably increase parallelization?

One option may be to force a reduce step and control the number of reducers. 
Are there any better options?



CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.

Reply via email to