Hey Chao/Gabriel, You two seem to be agreeing, which makes me think I misread Chao's initial problem specification. :) In any case, it seems like the PTable<K, Collection<V>> approach will do what you want here, which makes me happy.
J On Wed, Sep 25, 2013 at 6:32 AM, Chao Shi <[email protected]> wrote: > Hi Josh, > > I don't quite understand your second paragraph. Did you mean Gabriel's > approach? As a reducer reads output from a combiner, this requires it must > read PType<String, Colletcion<Integer>>. In fact, with this approach, I > don't think the CombineFn needs to tell whether it is run in combiner or > reducer context: it simply emits top K values. If there no much overhead to > use the singleton collection, I think this approach would perfectly fit > crunch's model. > > > 2013/9/25 Josh Wills <[email protected]> > >> FWIW, what I usually do in these situations (and they seem to come up a >> lot for machine learning projects) is use a combiner with a post-processing >> reducer that has a different signature. Chao's case is a little different >> because the DoFn needs to know whether it's in the combiner or the reducer >> contexts, but the Crunch framework knows this via the NodeContext, so there >> must be a way to communicate this to the CombineFn. If there isn't, we >> should make a change to expose it. >> >> For this example, the output of both my Combiner and my Reducer would be >> a Collection<Integer>, and if I was in the reducer case, I would emit just >> a single Integer to that collection (the max from that combiner), and if I >> was in the reducer context, I would emit the entire Iterable<Integer> as a >> Collection<Integer>. Then I would have a post-processing MapFn that would >> take the values from the Collection<Integer> and join them to a string. >> >> >> On Wed, Sep 25, 2013 at 2:58 AM, Chao Shi <[email protected]> wrote: >> >>> Yes. It was a typo. I mean PTable#combineValues. >>> >>> >>> 2013/9/25 Gabriel Reid <[email protected]> >>> >>>> Hi Chao, >>>> >>>> >>>>> Your approach is tricky. I agree that this kind of MR logic is pretty >>>>> common. So it would be nice to add such feature to crunch. At the first >>>>> glance, I think the problem in PTable#collectValues is that it return a >>>>> PTable rather than a PGroupedTable (I haven't check the internal logic >>>>> yet). >>>>> >>>>> >>>> I think that PTable#collectValues is for a different kind of use case >>>> -- internally it just does a groupByKey and then puts all the values in a >>>> single collection for each key, so I'm not sure how it would apply here. Or >>>> did you mean the combineValues method? >>>> >>>> - Gabriel >>>> >>> >>> >> >
