FWIW, what I usually do in these situations (and they seem to come up a lot for machine learning projects) is use a combiner with a post-processing reducer that has a different signature. Chao's case is a little different because the DoFn needs to know whether it's in the combiner or the reducer contexts, but the Crunch framework knows this via the NodeContext, so there must be a way to communicate this to the CombineFn. If there isn't, we should make a change to expose it.
For this example, the output of both my Combiner and my Reducer would be a Collection<Integer>, and if I was in the reducer case, I would emit just a single Integer to that collection (the max from that combiner), and if I was in the reducer context, I would emit the entire Iterable<Integer> as a Collection<Integer>. Then I would have a post-processing MapFn that would take the values from the Collection<Integer> and join them to a string. On Wed, Sep 25, 2013 at 2:58 AM, Chao Shi <[email protected]> wrote: > Yes. It was a typo. I mean PTable#combineValues. > > > 2013/9/25 Gabriel Reid <[email protected]> > >> Hi Chao, >> >> >>> Your approach is tricky. I agree that this kind of MR logic is pretty >>> common. So it would be nice to add such feature to crunch. At the first >>> glance, I think the problem in PTable#collectValues is that it return a >>> PTable rather than a PGroupedTable (I haven't check the internal logic yet). >>> >>> >> I think that PTable#collectValues is for a different kind of use case -- >> internally it just does a groupByKey and then puts all the values in a >> single collection for each key, so I'm not sure how it would apply here. Or >> did you mean the combineValues method? >> >> - Gabriel >> > >
