Re: skewed data in join

2017-02-17 Thread Jon Gregg
It depends how you salt it.  See slide 40 and onwards from a spark summit
talk here: http://www.slideshare.net/cloudera/top-5-mistakes-
to-avoid-when-writing-apache-spark-applications  The speakers use a mod8
integer salt appended to the end of the key, the salt that works best for
you might be different.

On Thu, Feb 16, 2017 at 12:40 PM, Gourav Sengupta  wrote:

> Hi,
>
> Thanks for your kind response. The hash key using random numbers increases
> the time for processing the data. My entire join for the entire month
> finishes within 150 seconds for 471 million records and then stays for
> another 6 mins for 55 million records.
>
> Using hash keys increases the processing time to 11 mins. Therefore I am
> not quite clear why should I do that. The overall idea was to ensure that
> the entire processing of around 520 million records in may be another 10
> seconds more.
>
>
>
> Regards,
> Gourav Sengupta
>
> On Thu, Feb 16, 2017 at 4:54 PM, Anis Nasir  wrote:
>
>> You can also so something similar to what is mentioned in [1].
>>
>> The basic idea is to use two hash functions for each key and assigning it
>> to the least loaded of the two hashed worker.
>>
>> Cheers,
>> Anis
>>
>>
>> [1]. https://melmeric.files.wordpress.com/2014/11/the-power-of-
>> both-choices-practical-load-balancing-for-distributed-
>> stream-processing-engines.pdf
>>
>>
>> On Fri, 17 Feb 2017 at 01:34, Yong Zhang  wrote:
>>
>>> Yes. You have to change your key, or as BigData term, "adding salt".
>>>
>>>
>>> Yong
>>>
>>> --
>>> *From:* Gourav Sengupta 
>>> *Sent:* Thursday, February 16, 2017 11:11 AM
>>> *To:* user
>>> *Subject:* skewed data in join
>>>
>>> Hi,
>>>
>>> Is there a way to do multiple reducers for joining on skewed data?
>>>
>>> Regards,
>>> Gourav
>>>
>>
>


Re: skewed data in join

2017-02-16 Thread Gourav Sengupta
Hi,

Thanks for your kind response. The hash key using random numbers increases
the time for processing the data. My entire join for the entire month
finishes within 150 seconds for 471 million records and then stays for
another 6 mins for 55 million records.

Using hash keys increases the processing time to 11 mins. Therefore I am
not quite clear why should I do that. The overall idea was to ensure that
the entire processing of around 520 million records in may be another 10
seconds more.



Regards,
Gourav Sengupta

On Thu, Feb 16, 2017 at 4:54 PM, Anis Nasir  wrote:

> You can also so something similar to what is mentioned in [1].
>
> The basic idea is to use two hash functions for each key and assigning it
> to the least loaded of the two hashed worker.
>
> Cheers,
> Anis
>
>
> [1]. https://melmeric.files.wordpress.com/2014/11/the-
> power-of-both-choices-practical-load-balancing-for-
> distributed-stream-processing-engines.pdf
>
>
> On Fri, 17 Feb 2017 at 01:34, Yong Zhang  wrote:
>
>> Yes. You have to change your key, or as BigData term, "adding salt".
>>
>>
>> Yong
>>
>> --
>> *From:* Gourav Sengupta 
>> *Sent:* Thursday, February 16, 2017 11:11 AM
>> *To:* user
>> *Subject:* skewed data in join
>>
>> Hi,
>>
>> Is there a way to do multiple reducers for joining on skewed data?
>>
>> Regards,
>> Gourav
>>
>


Re: skewed data in join

2017-02-16 Thread Anis Nasir
You can also so something similar to what is mentioned in [1].

The basic idea is to use two hash functions for each key and assigning it
to the least loaded of the two hashed worker.

Cheers,
Anis


[1].
https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf


On Fri, 17 Feb 2017 at 01:34, Yong Zhang  wrote:

> Yes. You have to change your key, or as BigData term, "adding salt".
>
>
> Yong
>
> --
> *From:* Gourav Sengupta 
> *Sent:* Thursday, February 16, 2017 11:11 AM
> *To:* user
> *Subject:* skewed data in join
>
> Hi,
>
> Is there a way to do multiple reducers for joining on skewed data?
>
> Regards,
> Gourav
>


Re: skewed data in join

2017-02-16 Thread Yong Zhang
Yes. You have to change your key, or as BigData term, "adding salt".


Yong


From: Gourav Sengupta 
Sent: Thursday, February 16, 2017 11:11 AM
To: user
Subject: skewed data in join

Hi,

Is there a way to do multiple reducers for joining on skewed data?

Regards,
Gourav


skewed data in join

2017-02-16 Thread Gourav Sengupta
Hi,

Is there a way to do multiple reducers for joining on skewed data?

Regards,
Gourav