subject:"skewed data in join"

Re: skewed data in join

2017-02-17 Thread Jon Gregg

It depends how you salt it.  See slide 40 and onwards from a spark summit
talk here: http://www.slideshare.net/cloudera/top-5-mistakes-
to-avoid-when-writing-apache-spark-applications  The speakers use a mod8
integer salt appended to the end of the key, the salt that works best for
you might be different.

On Thu, Feb 16, 2017 at 12:40 PM, Gourav Sengupta  wrote:

> Hi,
>
> Thanks for your kind response. The hash key using random numbers increases
> the time for processing the data. My entire join for the entire month
> finishes within 150 seconds for 471 million records and then stays for
> another 6 mins for 55 million records.
>
> Using hash keys increases the processing time to 11 mins. Therefore I am
> not quite clear why should I do that. The overall idea was to ensure that
> the entire processing of around 520 million records in may be another 10
> seconds more.
>
>
>
> Regards,
> Gourav Sengupta
>
> On Thu, Feb 16, 2017 at 4:54 PM, Anis Nasir  wrote:
>
>> You can also so something similar to what is mentioned in [1].
>>
>> The basic idea is to use two hash functions for each key and assigning it
>> to the least loaded of the two hashed worker.
>>
>> Cheers,
>> Anis
>>
>>
>> [1]. https://melmeric.files.wordpress.com/2014/11/the-power-of-
>> both-choices-practical-load-balancing-for-distributed-
>> stream-processing-engines.pdf
>>
>>
>> On Fri, 17 Feb 2017 at 01:34, Yong Zhang  wrote:
>>
>>> Yes. You have to change your key, or as BigData term, "adding salt".
>>>
>>>
>>> Yong
>>>
>>> --
>>> *From:* Gourav Sengupta 
>>> *Sent:* Thursday, February 16, 2017 11:11 AM
>>> *To:* user
>>> *Subject:* skewed data in join
>>>
>>> Hi,
>>>
>>> Is there a way to do multiple reducers for joining on skewed data?
>>>
>>> Regards,
>>> Gourav
>>>
>>
>

Re: skewed data in join

2017-02-16 Thread Gourav Sengupta

Hi,

Thanks for your kind response. The hash key using random numbers increases
the time for processing the data. My entire join for the entire month
finishes within 150 seconds for 471 million records and then stays for
another 6 mins for 55 million records.

Using hash keys increases the processing time to 11 mins. Therefore I am
not quite clear why should I do that. The overall idea was to ensure that
the entire processing of around 520 million records in may be another 10
seconds more.

Regards,
Gourav Sengupta

On Thu, Feb 16, 2017 at 4:54 PM, Anis Nasir  wrote:

> You can also so something similar to what is mentioned in [1].
>
> The basic idea is to use two hash functions for each key and assigning it
> to the least loaded of the two hashed worker.
>
> Cheers,
> Anis
>
>
> [1]. https://melmeric.files.wordpress.com/2014/11/the-
> power-of-both-choices-practical-load-balancing-for-
> distributed-stream-processing-engines.pdf
>
>
> On Fri, 17 Feb 2017 at 01:34, Yong Zhang  wrote:
>
>> Yes. You have to change your key, or as BigData term, "adding salt".
>>
>>
>> Yong
>>
>> --
>> *From:* Gourav Sengupta 
>> *Sent:* Thursday, February 16, 2017 11:11 AM
>> *To:* user
>> *Subject:* skewed data in join
>>
>> Hi,
>>
>> Is there a way to do multiple reducers for joining on skewed data?
>>
>> Regards,
>> Gourav
>>
>

Re: skewed data in join

2017-02-16 Thread Anis Nasir

You can also so something similar to what is mentioned in [1].

The basic idea is to use two hash functions for each key and assigning it
to the least loaded of the two hashed worker.

Cheers,
Anis

[1].
https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf

On Fri, 17 Feb 2017 at 01:34, Yong Zhang  wrote:

> Yes. You have to change your key, or as BigData term, "adding salt".
>
>
> Yong
>
> --
> *From:* Gourav Sengupta 
> *Sent:* Thursday, February 16, 2017 11:11 AM
> *To:* user
> *Subject:* skewed data in join
>
> Hi,
>
> Is there a way to do multiple reducers for joining on skewed data?
>
> Regards,
> Gourav
>

Re: skewed data in join

2017-02-16 Thread Yong Zhang

Yes. You have to change your key, or as BigData term, "adding salt".


Yong


From: Gourav Sengupta 
Sent: Thursday, February 16, 2017 11:11 AM
To: user
Subject: skewed data in join

Hi,

Is there a way to do multiple reducers for joining on skewed data?

Regards,
Gourav

skewed data in join

2017-02-16 Thread Gourav Sengupta

Hi,

Is there a way to do multiple reducers for joining on skewed data?

Regards,
Gourav

Re: skewed data in join

Re: skewed data in join

Re: skewed data in join

Re: skewed data in join

skewed data in join

5 matches

Site Navigation

Mail list logo

Footer information