Arrays are not immutable and do not have the equals semantics you want to use them as a key. Use a Scala immutable List. On Oct 9, 2014 12:32 PM, "Ge, Yao (Y.)" <y...@ford.com> wrote:
> Yes. I was using String array as arguments in the reduceByKey. I think > String array is actually immutable and simply returning the first argument > without cloning one should work. I will look into mapPartitions as we can > have up to 40% duplicates. Will follow up on this if necessary. Thanks very > much Sean! > > -Yao > > -----Original Message----- > From: Sean Owen [mailto:so...@cloudera.com] > Sent: Thursday, October 09, 2014 3:04 AM > To: Ge, Yao (Y.) > Cc: user@spark.apache.org > Subject: Re: Dedup > > I think the question is about copying the argument. If it's an immutable > value like String, yes just return the first argument and ignore the > second. If you're dealing with a notoriously mutable value like a Hadoop > Writable, you need to copy the value you return. > > This works fine although you will spend a fair bit of time marshaling all > of those duplicates together just to discard all but one. > > If there are lots of duplicates, It would take a bit more work, but would > be faster, to do something like this: mapPartitions and retain one input > value each unique dedup criteria, and then output those pairs, and then > reduceByKey the result. > > On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.) <y...@ford.com> wrote: > > I need to do deduplication processing in Spark. The current plan is to > > generate a tuple where key is the dedup criteria and value is the > > original input. I am thinking to use reduceByKey to discard duplicate > > values. If I do that, can I simply return the first argument or should > > I return a copy of the first argument. Is there are better way to do > dedup in Spark? > > > > > > > > -Yao >