On Sat, Apr 6, 2013 at 11:41 AM, Gabriel Reid <[email protected]>wrote:
> > > > On Sat, Apr 6, 2013 at 8:28 PM, Josh Wills <[email protected]> wrote: > >> We could also try caching/spilling the contents of the Iterable so that >> it could actually be used more than once. I'm wondering if we could detect >> that multiple clients were calling the same groupByKey() output and >> automatically swap out the Iterable for one that cached the results. >> >> > Yeah, that's definitely an option -- but are we talking about two > different issues here? The issue that Chad brought up is the ability for a > single DoFn to iterate over an iterable of values multiple times, while (as > far as I understand) you're talking about having multiple DoFns running on > reducer input, right? > > For the first case, it seems acceptable to me to just enforce that the > iterable can only be iterated over once. For the second case, I think it > could definitely be interesting to try to do what you're talking about (if > I'm correctly understanding what you were suggesting :-)) > A very good point-- I misread Chad's email. Will open up a separate JIRA for the caching idea. > - Gabriel > > > > >> >> On Fri, Apr 5, 2013 at 12:40 PM, Gabriel Reid <[email protected]>wrote: >> >>> Hi Chad, >>> >>> Good point -- I know that this has tripped people up in the past. I >>> think that definitely documenting this and possibly enforcing it sounds >>> like a good idea -- I've logged a ticket in JIRA (with the content of your >>> mail), see https://issues.apache.org/jira/browse/CRUNCH-192 >>> >>> - Gabriel >>> >>> >>> On 05 Apr 2013, at 21:30, Chad Urso McDaniel <[email protected]> wrote: >>> >>> > BLUF: The Iterable parameter to CombineFn.process implies you can >>> iterate multiple times when you cannot and this leads to surprising >>> behavior. >>> > >>> > As many of you probably know, the signature of CombineFn.process is >>> > --- >>> > process(Pair<K, Iterable<V>> input, Emitter<Pair<K, V>> emitter) >>> > --- >>> > >>> > The corresponding Hadoop Reducer signature is >>> > --- >>> > reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, >>> Reporter reporter) >>> > --- >>> > >>> > I assume the Crunch use of Iterable is for convenient use in "for" >>> loops. >>> > >>> > Unfortunately, the behavior of this Iterable seems to return the same >>> Iterator object each time Iterable.iterator() is called. >>> > >>> > This makes sense to me based on the underlying hadoop mapreduce, but >>> violates what I think most expect from the Iterable interface. >>> > >>> > I understand that it's too late to change the interface, but could we >>> at least have an javadoc or an exception thrown if the Iterable is used >>> more than once? >>> >>> >> >> >> -- >> Director of Data Science >> Cloudera <http://www.cloudera.com> >> Twitter: @josh_wills <http://twitter.com/josh_wills> >> > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
