Yeah, I could see how that would happen. I think the move would be to inject a deep copy inside of the RDD that is underneath a cached PCollection. I can probably take a crack at a patch later this weekend, I have a busy couple of days w/the baby and new job and what not. :)
J On Thu, Oct 8, 2015 at 9:32 AM, Nithin Asokan <[email protected]> wrote: > First I would like to thank everyone on the quick response and fixes on > most issues. Great job everyone! > > I noticed that using cache() on PTable built using SparkPipeline seems to > reuse object for downstream DoFn's. Here is an example that exhibits this > behavior > > https://gist.github.com/nasokan/531b4ff9bf827d0835ab > > I would expect the output of this program to create a pair with same key, > value. However, this produces Pair with different key value. I have tested > this with text file input source and it works as expected. Removing cache() > also produces expected result. So I'm suspecting this issue to be specific > to avro and cache(). > > Any thoughts on this behavior? > > Thank you! > Nithin > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
