Re: skewed data in join
It depends how you salt it. See slide 40 and onwards from a spark summit talk here: http://www.slideshare.net/cloudera/top-5-mistakes- to-avoid-when-writing-apache-spark-applications The speakers use a mod8 integer salt appended to the end of the key, the salt that works best for you might be different. On Thu, Feb 16, 2017 at 12:40 PM, Gourav Sengupta wrote: > Hi, > > Thanks for your kind response. The hash key using random numbers increases > the time for processing the data. My entire join for the entire month > finishes within 150 seconds for 471 million records and then stays for > another 6 mins for 55 million records. > > Using hash keys increases the processing time to 11 mins. Therefore I am > not quite clear why should I do that. The overall idea was to ensure that > the entire processing of around 520 million records in may be another 10 > seconds more. > > > > Regards, > Gourav Sengupta > > On Thu, Feb 16, 2017 at 4:54 PM, Anis Nasir wrote: > >> You can also so something similar to what is mentioned in [1]. >> >> The basic idea is to use two hash functions for each key and assigning it >> to the least loaded of the two hashed worker. >> >> Cheers, >> Anis >> >> >> [1]. https://melmeric.files.wordpress.com/2014/11/the-power-of- >> both-choices-practical-load-balancing-for-distributed- >> stream-processing-engines.pdf >> >> >> On Fri, 17 Feb 2017 at 01:34, Yong Zhang wrote: >> >>> Yes. You have to change your key, or as BigData term, "adding salt". >>> >>> >>> Yong >>> >>> -- >>> *From:* Gourav Sengupta >>> *Sent:* Thursday, February 16, 2017 11:11 AM >>> *To:* user >>> *Subject:* skewed data in join >>> >>> Hi, >>> >>> Is there a way to do multiple reducers for joining on skewed data? >>> >>> Regards, >>> Gourav >>> >> >
Re: skewed data in join
Hi, Thanks for your kind response. The hash key using random numbers increases the time for processing the data. My entire join for the entire month finishes within 150 seconds for 471 million records and then stays for another 6 mins for 55 million records. Using hash keys increases the processing time to 11 mins. Therefore I am not quite clear why should I do that. The overall idea was to ensure that the entire processing of around 520 million records in may be another 10 seconds more. Regards, Gourav Sengupta On Thu, Feb 16, 2017 at 4:54 PM, Anis Nasir wrote: > You can also so something similar to what is mentioned in [1]. > > The basic idea is to use two hash functions for each key and assigning it > to the least loaded of the two hashed worker. > > Cheers, > Anis > > > [1]. https://melmeric.files.wordpress.com/2014/11/the- > power-of-both-choices-practical-load-balancing-for- > distributed-stream-processing-engines.pdf > > > On Fri, 17 Feb 2017 at 01:34, Yong Zhang wrote: > >> Yes. You have to change your key, or as BigData term, "adding salt". >> >> >> Yong >> >> -- >> *From:* Gourav Sengupta >> *Sent:* Thursday, February 16, 2017 11:11 AM >> *To:* user >> *Subject:* skewed data in join >> >> Hi, >> >> Is there a way to do multiple reducers for joining on skewed data? >> >> Regards, >> Gourav >> >
Re: skewed data in join
You can also so something similar to what is mentioned in [1]. The basic idea is to use two hash functions for each key and assigning it to the least loaded of the two hashed worker. Cheers, Anis [1]. https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf On Fri, 17 Feb 2017 at 01:34, Yong Zhang wrote: > Yes. You have to change your key, or as BigData term, "adding salt". > > > Yong > > -- > *From:* Gourav Sengupta > *Sent:* Thursday, February 16, 2017 11:11 AM > *To:* user > *Subject:* skewed data in join > > Hi, > > Is there a way to do multiple reducers for joining on skewed data? > > Regards, > Gourav >
Re: skewed data in join
Yes. You have to change your key, or as BigData term, "adding salt". Yong From: Gourav Sengupta Sent: Thursday, February 16, 2017 11:11 AM To: user Subject: skewed data in join Hi, Is there a way to do multiple reducers for joining on skewed data? Regards, Gourav
skewed data in join
Hi, Is there a way to do multiple reducers for joining on skewed data? Regards, Gourav