Ok, I missed Aggregate.top() (guess my research wasn't thorough). I'll go with the framework's built-in function, seem cleaner than using Context.
Thanks a lot for your answers! Vincent On Sun, May 3, 2015 at 8:11 AM, Josh Wills <[email protected]> wrote: > Hey Vincent, > > Yeah, you can get at it. Each DoFn inherits a protected getContext() > method that has the getNumReduceTasks() method defined on it, just like it > does in the Nutch code you cited. We try (with varying degrees of success) > to make the underlying MR framework as accessible as possible. > > J > > On Sun, May 3, 2015 at 2:16 AM, David Ortiz <[email protected]> wrote: > >> Do you actually care about the number of reducers, or just get top n from >> a table? The latter is built into the framework. >> >> On Sat, May 2, 2015, 6:12 PM Vincent Fabro <[email protected]> >> wrote: >> >>> Dear all >>> >>> Is it possible to access the number of reducer tasks from Crunch >>> (something equivalent to context.getNumReduceTasks() in Hadoop)? >>> >>> Context: I'm porting Nutch to Crunch. One operation (in >>> GeneratorJob.java, GeneratorMapper.java and GeneratorReducer.java - >>> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java) >>> takes the n top urls acccording to a score. If I understand well, "n/num of >>> reduce tasks" urls are selected for each reduce task (GeneratorReducer, >>> line 102). If there's a good shuffle, the result is good enough. >>> >>> Thanks in advance! >>> >>> Vincent >>> >> > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> >
