count distinct in spark sql aggregation

fightf...@163.com Wed, 09 Dec 2015 17:36:37 -0800

Hi, 
I have a use case that need to get daily, weekly or monthly active users count 
according to the native hourly data, say as a large datasets.
The native datasets are instantly updated and I want to get the distinct active 
user count per time dimension. Anyone can show some 
efficient way of reaching this ? 
If I want to get daily active distinct user count , I would get this day's each 
hour dataset and do some calculation ? My initial thought on this
is to use a key value store and use a hashset to store the hourly userid. Then 
I can compare and distinct each hourly userid set and got the 
daily distinct count. However , I am not sure about this implementation can be 
some efficient workaround. 
Hope some guys can shed a little light on this.


Best,
Sun.



fightf...@163.com

count distinct in spark sql aggregation

Reply via email to