RE: Dedup

2016-01-12 Thread gpmacalalad
sowen wrote
> Arrays are not immutable and do not have the equals semantics you want to
> use them as a key.  Use a Scala immutable List.
> On Oct 9, 2014 12:32 PM, "Ge, Yao (Y.)" <

> yge@

> > wrote:
> 
>> Yes. I was using String array as arguments in the reduceByKey. I think
>> String array is actually immutable and simply returning the first
>> argument
>> without cloning one should work. I will look into mapPartitions as we can
>> have up to 40% duplicates. Will follow up on this if necessary. Thanks
>> very
>> much Sean!
>>
>> -Yao
>>
>> -Original Message-
>> From: Sean Owen [mailto:

> sowen@

> ]
>> Sent: Thursday, October 09, 2014 3:04 AM
>> To: Ge, Yao (Y.)
>> Cc: 

> user@.apache

>> Subject: Re: Dedup
>>
>> I think the question is about copying the argument. If it's an immutable
>> value like String, yes just return the first argument and ignore the
>> second. If you're dealing with a notoriously mutable value like a Hadoop
>> Writable, you need to copy the value you return.
>>
>> This works fine although you will spend a fair bit of time marshaling all
>> of those duplicates together just to discard all but one.
>>
>> If there are lots of duplicates, It would take a bit more work, but would
>> be faster, to do something like this: mapPartitions and retain one input
>> value each unique dedup criteria, and then output those pairs, and then
>> reduceByKey the result.
>>
>> On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.) <

> yge@

> > wrote:
>> > I need to do deduplication processing in Spark. The current plan is to
>> > generate a tuple where key is the dedup criteria and value is the
>> > original input. I am thinking to use reduceByKey to discard duplicate
>> > values. If I do that, can I simply return the first argument or should
>> > I return a copy of the first argument. Is there are better way to do
>> dedup in Spark?
>> >
>> >
>> >
>> > -Yao
>>

Hi I'm a bit new at (scala/spark), we are doing data deduplication. so far I
can handle exact match for 3M line of data. but I'm  on a delema on fuzzy
match using cosine and jaro winkler. My biggest problem is on what way to
optimize my method getting a match with a 90% above. I am planning to group
first before matching but this may result to missingout some important
match. can someone help me,much appreciated.  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Dedup-tp15967p25951.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Dedup

2014-10-09 Thread Sean Owen
Arrays are not immutable and do not have the equals semantics you want to
use them as a key.  Use a Scala immutable List.
On Oct 9, 2014 12:32 PM, "Ge, Yao (Y.)"  wrote:

> Yes. I was using String array as arguments in the reduceByKey. I think
> String array is actually immutable and simply returning the first argument
> without cloning one should work. I will look into mapPartitions as we can
> have up to 40% duplicates. Will follow up on this if necessary. Thanks very
> much Sean!
>
> -Yao
>
> -Original Message-
> From: Sean Owen [mailto:so...@cloudera.com]
> Sent: Thursday, October 09, 2014 3:04 AM
> To: Ge, Yao (Y.)
> Cc: user@spark.apache.org
> Subject: Re: Dedup
>
> I think the question is about copying the argument. If it's an immutable
> value like String, yes just return the first argument and ignore the
> second. If you're dealing with a notoriously mutable value like a Hadoop
> Writable, you need to copy the value you return.
>
> This works fine although you will spend a fair bit of time marshaling all
> of those duplicates together just to discard all but one.
>
> If there are lots of duplicates, It would take a bit more work, but would
> be faster, to do something like this: mapPartitions and retain one input
> value each unique dedup criteria, and then output those pairs, and then
> reduceByKey the result.
>
> On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.)  wrote:
> > I need to do deduplication processing in Spark. The current plan is to
> > generate a tuple where key is the dedup criteria and value is the
> > original input. I am thinking to use reduceByKey to discard duplicate
> > values. If I do that, can I simply return the first argument or should
> > I return a copy of the first argument. Is there are better way to do
> dedup in Spark?
> >
> >
> >
> > -Yao
>


RE: Dedup

2014-10-09 Thread Ge, Yao (Y.)
Yes. I was using String array as arguments in the reduceByKey. I think String 
array is actually immutable and simply returning the first argument without 
cloning one should work. I will look into mapPartitions as we can have up to 
40% duplicates. Will follow up on this if necessary. Thanks very much Sean!

-Yao  

-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Thursday, October 09, 2014 3:04 AM
To: Ge, Yao (Y.)
Cc: user@spark.apache.org
Subject: Re: Dedup

I think the question is about copying the argument. If it's an immutable value 
like String, yes just return the first argument and ignore the second. If 
you're dealing with a notoriously mutable value like a Hadoop Writable, you 
need to copy the value you return.

This works fine although you will spend a fair bit of time marshaling all of 
those duplicates together just to discard all but one.

If there are lots of duplicates, It would take a bit more work, but would be 
faster, to do something like this: mapPartitions and retain one input value 
each unique dedup criteria, and then output those pairs, and then reduceByKey 
the result.

On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.)  wrote:
> I need to do deduplication processing in Spark. The current plan is to 
> generate a tuple where key is the dedup criteria and value is the 
> original input. I am thinking to use reduceByKey to discard duplicate 
> values. If I do that, can I simply return the first argument or should 
> I return a copy of the first argument. Is there are better way to do dedup in 
> Spark?
>
>
>
> -Yao


Re: Dedup

2014-10-09 Thread Sean Owen
I think the question is about copying the argument. If it's an
immutable value like String, yes just return the first argument and
ignore the second. If you're dealing with a notoriously mutable value
like a Hadoop Writable, you need to copy the value you return.

This works fine although you will spend a fair bit of time marshaling
all of those duplicates together just to discard all but one.

If there are lots of duplicates, It would take a bit more work, but
would be faster, to do something like this: mapPartitions and retain
one input value each unique dedup criteria, and then output those
pairs, and then reduceByKey the result.

On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.)  wrote:
> I need to do deduplication processing in Spark. The current plan is to
> generate a tuple where key is the dedup criteria and value is the original
> input. I am thinking to use reduceByKey to discard duplicate values. If I do
> that, can I simply return the first argument or should I return a copy of
> the first argument. Is there are better way to do dedup in Spark?
>
>
>
> -Yao

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Dedup

2014-10-08 Thread Akhil Das
If you are looking to eliminate duplicate rows (or similar) then you can
define a key from the data and on that key you can do reduceByKey.

Thanks
Best Regards

On Thu, Oct 9, 2014 at 10:30 AM, Sonal Goyal  wrote:

> What is your data like? Are you looking at exact matching or are you
> interested in nearly same records? Do you need to merge similar records to
> get a canonical value?
>
> Best Regards,
> Sonal
> Nube Technologies 
>
> 
>
>
>
> On Thu, Oct 9, 2014 at 2:31 AM, Flavio Pompermaier 
> wrote:
>
>> Maybe you could implement something like this (i don't know if something
>> similar already exists in spark):
>>
>> http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf
>>
>> Best,
>> Flavio
>> On Oct 8, 2014 9:58 PM, "Nicholas Chammas" 
>> wrote:
>>
>>> Multiple values may be different, yet still be considered duplicates
>>> depending on how the dedup criteria is selected. Is that correct? Do you
>>> care in that case what value you select for a given key?
>>>
>>> On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.)  wrote:
>>>
  I need to do deduplication processing in Spark. The current plan is
 to generate a tuple where key is the dedup criteria and value is the
 original input. I am thinking to use reduceByKey to discard duplicate
 values. If I do that, can I simply return the first argument or should I
 return a copy of the first argument. Is there are better way to do dedup in
 Spark?



 -Yao

>>>
>>>
>


Re: Dedup

2014-10-08 Thread Sonal Goyal
What is your data like? Are you looking at exact matching or are you
interested in nearly same records? Do you need to merge similar records to
get a canonical value?

Best Regards,
Sonal
Nube Technologies 





On Thu, Oct 9, 2014 at 2:31 AM, Flavio Pompermaier 
wrote:

> Maybe you could implement something like this (i don't know if something
> similar already exists in spark):
>
> http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf
>
> Best,
> Flavio
> On Oct 8, 2014 9:58 PM, "Nicholas Chammas" 
> wrote:
>
>> Multiple values may be different, yet still be considered duplicates
>> depending on how the dedup criteria is selected. Is that correct? Do you
>> care in that case what value you select for a given key?
>>
>> On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.)  wrote:
>>
>>>  I need to do deduplication processing in Spark. The current plan is to
>>> generate a tuple where key is the dedup criteria and value is the original
>>> input. I am thinking to use reduceByKey to discard duplicate values. If I
>>> do that, can I simply return the first argument or should I return a copy
>>> of the first argument. Is there are better way to do dedup in Spark?
>>>
>>>
>>>
>>> -Yao
>>>
>>
>>


Re: Dedup

2014-10-08 Thread Flavio Pompermaier
Maybe you could implement something like this (i don't know if something
similar already exists in spark):

http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf

Best,
Flavio
On Oct 8, 2014 9:58 PM, "Nicholas Chammas" 
wrote:

> Multiple values may be different, yet still be considered duplicates
> depending on how the dedup criteria is selected. Is that correct? Do you
> care in that case what value you select for a given key?
>
> On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.)  wrote:
>
>>  I need to do deduplication processing in Spark. The current plan is to
>> generate a tuple where key is the dedup criteria and value is the original
>> input. I am thinking to use reduceByKey to discard duplicate values. If I
>> do that, can I simply return the first argument or should I return a copy
>> of the first argument. Is there are better way to do dedup in Spark?
>>
>>
>>
>> -Yao
>>
>
>


Re: Dedup

2014-10-08 Thread Nicholas Chammas
Multiple values may be different, yet still be considered duplicates
depending on how the dedup criteria is selected. Is that correct? Do you
care in that case what value you select for a given key?

On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.)  wrote:

>  I need to do deduplication processing in Spark. The current plan is to
> generate a tuple where key is the dedup criteria and value is the original
> input. I am thinking to use reduceByKey to discard duplicate values. If I
> do that, can I simply return the first argument or should I return a copy
> of the first argument. Is there are better way to do dedup in Spark?
>
>
>
> -Yao
>