Re: performance impact of batching emit(...)

Josh Wills Thu, 09 Jan 2014 15:38:20 -0800

Hey Leen,

I don't have a better idea than trial and error at this point, since the
best choice of flushEvery would depend on a combination of how much memory
is available to the tasks, how large the cached objects are, and a rough
estimate of how many unique elements there are in the data set. It's the
sort of thing that our much-discussed-but-not-implemented-yet framework for
tracking stats on runtime metrics for optimizing pipelines should track.


J


On Thu, Jan 9, 2014 at 1:30 PM, Leen Toelen <[email protected]> wrote:

> Hi,
>
> when looking at PreDistinct I notice that calls to emitter.emit(...) are
> stored in memory until more than 'flushEvery' records are found. How does
> this batching impact performance, since the calls to emit(...) are not
> batched in the cleanup method but called in a loop?
>
> Is there an easy way to find the best size for 'flushEvery' other than try
> and error?
>
> Best regards,
> Leen
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: performance impact of batching emit(...)

Reply via email to