RE: Dedup

Sean Owen Thu, 09 Oct 2014 07:15:41 -0700

Arrays are not immutable and do not have the equals semantics you want to
use them as a key.  Use a Scala immutable List.
On Oct 9, 2014 12:32 PM, "Ge, Yao (Y.)" <y...@ford.com> wrote:


> Yes. I was using String array as arguments in the reduceByKey. I think
> String array is actually immutable and simply returning the first argument
> without cloning one should work. I will look into mapPartitions as we can
> have up to 40% duplicates. Will follow up on this if necessary. Thanks very
> much Sean!
>
> -Yao
>
> -----Original Message-----
> From: Sean Owen [mailto:so...@cloudera.com]
> Sent: Thursday, October 09, 2014 3:04 AM
> To: Ge, Yao (Y.)
> Cc: user@spark.apache.org
> Subject: Re: Dedup
>
> I think the question is about copying the argument. If it's an immutable
> value like String, yes just return the first argument and ignore the
> second. If you're dealing with a notoriously mutable value like a Hadoop
> Writable, you need to copy the value you return.
>
> This works fine although you will spend a fair bit of time marshaling all
> of those duplicates together just to discard all but one.
>
> If there are lots of duplicates, It would take a bit more work, but would
> be faster, to do something like this: mapPartitions and retain one input
> value each unique dedup criteria, and then output those pairs, and then
> reduceByKey the result.
>
> On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.) <y...@ford.com> wrote:
> > I need to do deduplication processing in Spark. The current plan is to
> > generate a tuple where key is the dedup criteria and value is the
> > original input. I am thinking to use reduceByKey to discard duplicate
> > values. If I do that, can I simply return the first argument or should
> > I return a copy of the first argument. Is there are better way to do
> dedup in Spark?
> >
> >
> >
> > -Yao
>

RE: Dedup

Reply via email to