try divide and conquer, create a column x for the fist character of userid, and group by company+x. if still too large, try first two character.
On 17 July 2018 at 02:25, 崔苗 <cuim...@danale.com> wrote: > 30G user data, how to get distinct users count after creating a composite > key based on company and userid? > > > 在 2018-07-13 18:24:52,Jean Georges Perrin <j...@jgp.net> 写道: > > Just thinking out loud… repartition by key? create a composite key based > on company and userid? > > How big is your dataset? > > On Jul 13, 2018, at 06:20, 崔苗 <cuim...@danale.com> wrote: > > Hi, > when I want to count(distinct userId) by company,I met the data skew and > the task takes too long time,how to count distinct by keys on skew data in > spark sql ? > > thanks for any reply > > > >