Thanks Josh. I logged https://issues.apache.org/jira/browse/CRUNCH-569 and will try submitting a patch for this.
On Thu, Oct 8, 2015 at 1:51 PM Josh Wills <[email protected]> wrote: > Yeah, I could see how that would happen. I think the move would be to > inject a deep copy inside of the RDD that is underneath a cached > PCollection. I can probably take a crack at a patch later this weekend, I > have a busy couple of days w/the baby and new job and what not. :) > > J > > On Thu, Oct 8, 2015 at 9:32 AM, Nithin Asokan <[email protected]> wrote: > >> First I would like to thank everyone on the quick response and fixes on >> most issues. Great job everyone! >> >> I noticed that using cache() on PTable built using SparkPipeline seems to >> reuse object for downstream DoFn's. Here is an example that exhibits this >> behavior >> >> https://gist.github.com/nasokan/531b4ff9bf827d0835ab >> >> I would expect the output of this program to create a pair with same key, >> value. However, this produces Pair with different key value. I have tested >> this with text file input source and it works as expected. Removing cache() >> also produces expected result. So I'm suspecting this issue to be specific >> to avro and cache(). >> >> Any thoughts on this behavior? >> >> Thank you! >> Nithin >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> >
