On Wed, Sep 25, 2013 at 2:36 PM, Josh Wills <[email protected]> wrote:
> FWIW, what I usually do in these situations (and they seem to come up a > lot for machine learning projects) is use a combiner with a post-processing > reducer that has a different signature. Chao's case is a little different > because the DoFn needs to know whether it's in the combiner or the reducer > contexts, but the Crunch framework knows this via the NodeContext, so there > must be a way to communicate this to the CombineFn. If there isn't, we > should make a change to expose it. > That sounds like it would be pretty handy -- I remember someone else on the list asking about a similar thing a few months ago as well. > > For this example, the output of both my Combiner and my Reducer would be a > Collection<Integer>, and if I was in the reducer case, I would emit just a > single Integer to that collection (the max from that combiner), and if I > was in the reducer context, I would emit the entire Iterable<Integer> as a > Collection<Integer>. Then I would have a post-processing MapFn that would > take the values from the Collection<Integer> and join them to a string. > I think that's along the same kind of line that I was going with, but if I'm understanding the issue correctly then there shouldn't even be a need to know if you're in the reducer or combiner if you're working with Collection<Integer>. I think that the combiner would be outputting the top-k entries, and not just the top-1 entry, so both the combiner and the reducer have the same logic, and can be the same class (although this necessitates converting the PTable<K, V> to PTable<K, Collection<V>> at the start). - Gabriel > > > On Wed, Sep 25, 2013 at 2:58 AM, Chao Shi <[email protected]> wrote: > >> Yes. It was a typo. I mean PTable#combineValues. >> >> >> 2013/9/25 Gabriel Reid <[email protected]> >> >>> Hi Chao, >>> >>> >>>> Your approach is tricky. I agree that this kind of MR logic is pretty >>>> common. So it would be nice to add such feature to crunch. At the first >>>> glance, I think the problem in PTable#collectValues is that it return a >>>> PTable rather than a PGroupedTable (I haven't check the internal logic >>>> yet). >>>> >>>> >>> I think that PTable#collectValues is for a different kind of use case -- >>> internally it just does a groupByKey and then puts all the values in a >>> single collection for each key, so I'm not sure how it would apply here. Or >>> did you mean the combineValues method? >>> >>> - Gabriel >>> >> >> >
