OK, thanks.
On Fri, Jan 10, 2014 at 12:37 AM, Josh Wills <[email protected]> wrote: > Hey Leen, > > I don't have a better idea than trial and error at this point, since the > best choice of flushEvery would depend on a combination of how much memory > is available to the tasks, how large the cached objects are, and a rough > estimate of how many unique elements there are in the data set. It's the > sort of thing that our much-discussed-but-not-implemented-yet framework for > tracking stats on runtime metrics for optimizing pipelines should track. > > J > > > On Thu, Jan 9, 2014 at 1:30 PM, Leen Toelen <[email protected]> wrote: > >> Hi, >> >> when looking at PreDistinct I notice that calls to emitter.emit(...) are >> stored in memory until more than 'flushEvery' records are found. How does >> this batching impact performance, since the calls to emit(...) are not >> batched in the cleanup method but called in a loop? >> >> Is there an easy way to find the best size for 'flushEvery' other than >> try and error? >> >> Best regards, >> Leen >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> >
