Re: Re: Re: spark sql data skew

2018-07-23 Thread Gourav Sengupta
https://docs.databricks.com/spark/latest/spark-sql/skew-join.html The above might help, in case you are using a join. On Mon, Jul 23, 2018 at 4:49 AM, 崔苗 wrote: > but how to get count(distinct userId) group by company from count(distinct > userId) group by company+x? > count(userId) is

Re:Re: Re: spark sql data skew

2018-07-22 Thread 崔苗
but how to get count(distinct userId) group by company from count(distinct userId) group by company+x? count(userId) is different from count(distinct userId) 在 2018-07-21 00:49:58,Xiaomeng Wan 写道: try divide and conquer, create a column x for the fist character of userid, and group by

Re: Re: spark sql data skew

2018-07-20 Thread Xiaomeng Wan
try divide and conquer, create a column x for the fist character of userid, and group by company+x. if still too large, try first two character. On 17 July 2018 at 02:25, 崔苗 wrote: > 30G user data, how to get distinct users count after creating a composite > key based on company and userid? > >

Re: spark sql data skew

2018-07-13 Thread Jean Georges Perrin
Just thinking out loud… repartition by key? create a composite key based on company and userid? How big is your dataset? > On Jul 13, 2018, at 06:20, 崔苗 wrote: > > Hi, > when I want to count(distinct userId) by company,I met the data skew and the > task takes too long time,how to count