Re: Re: Re: spark sql data skew

2018-07-23 Thread Gourav Sengupta
https://docs.databricks.com/spark/latest/spark-sql/skew-join.html

The above might help, in case you are using a join.

On Mon, Jul 23, 2018 at 4:49 AM, 崔苗  wrote:

> but how to get count(distinct userId) group by company from count(distinct
> userId) group by company+x?
> count(userId) is different from count(distinct userId)
>
>
> 在 2018-07-21 00:49:58,Xiaomeng Wan  写道:
>
> try divide and conquer, create a column x for the fist character of
> userid, and group by company+x. if still too large, try first two character.
>
> On 17 July 2018 at 02:25, 崔苗  wrote:
>
>> 30G user data, how to get distinct users count after creating a composite
>> key based on company and userid?
>>
>>
>> 在 2018-07-13 18:24:52,Jean Georges Perrin  写道:
>>
>> Just thinking out loud… repartition by key? create a composite key based
>> on company and userid?
>>
>> How big is your dataset?
>>
>> On Jul 13, 2018, at 06:20, 崔苗  wrote:
>>
>> Hi,
>> when I want to count(distinct userId) by company,I met the data skew and
>> the task takes too long time,how to count distinct by keys on skew data in
>> spark sql ?
>>
>> thanks for any reply
>>
>>
>>
>>
>
>


Re: Re: spark sql data skew

2018-07-20 Thread Xiaomeng Wan
try divide and conquer, create a column x for the fist character of userid,
and group by company+x. if still too large, try first two character.

On 17 July 2018 at 02:25, 崔苗  wrote:

> 30G user data, how to get distinct users count after creating a composite
> key based on company and userid?
>
>
> 在 2018-07-13 18:24:52,Jean Georges Perrin  写道:
>
> Just thinking out loud… repartition by key? create a composite key based
> on company and userid?
>
> How big is your dataset?
>
> On Jul 13, 2018, at 06:20, 崔苗  wrote:
>
> Hi,
> when I want to count(distinct userId) by company,I met the data skew and
> the task takes too long time,how to count distinct by keys on skew data in
> spark sql ?
>
> thanks for any reply
>
>
>
>