Re: [Pyspark 2.4] Best way to define activity within different time window

2019-06-11 Thread Georg Heiler
For grouping with each: look into grouping sets
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-multi-dimensional-aggregation.html

Am Di., 11. Juni 2019 um 06:09 Uhr schrieb Rishi Shah <
rishishah.s...@gmail.com>:

> Thank you both for your input!
>
> To calculate moving average of active users, could you comment on whether
> to go for RDD based implementation or dataframe? If dataframe, will window
> function work here?
>
> In general, how would spark behave when working with dataframe with date,
> week, month, quarter, year columns and groupie against each one one by one?
>
>
>
> On Sun, Jun 9, 2019 at 1:17 PM Jörn Franke  wrote:
>
>> Depending on what accuracy is needed, hyperloglogs can be an interesting
>> alternative
>> https://en.m.wikipedia.org/wiki/HyperLogLog
>>
>> Am 09.06.2019 um 15:59 schrieb big data :
>>
>> From m opinion, Bitmap is the best solution for active users calculation.
>> Other solution almost bases on count(distinct) calculation process, which
>> is more slower.
>>
>> If you 've implemented Bitmap solution including how to build Bitmap, how
>> to load Bitmap, then Bitmap is the best choice.
>> 在 2019/6/5 下午6:49, Rishi Shah 写道:
>>
>> Hi All,
>>
>> Is there a best practice around calculating daily, weekly, monthly,
>> quarterly, yearly active users?
>>
>> One approach is to create a window of daily bitmap and aggregate it based
>> on period later. However I was wondering if anyone has a better approach to
>> tackling this problem..
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>>
>
> --
> Regards,
>
> Rishi Shah
>


Re: [Pyspark 2.4] Best way to define activity within different time window

2019-06-10 Thread Rishi Shah
Thank you both for your input!

To calculate moving average of active users, could you comment on whether
to go for RDD based implementation or dataframe? If dataframe, will window
function work here?

In general, how would spark behave when working with dataframe with date,
week, month, quarter, year columns and groupie against each one one by one?



On Sun, Jun 9, 2019 at 1:17 PM Jörn Franke  wrote:

> Depending on what accuracy is needed, hyperloglogs can be an interesting
> alternative
> https://en.m.wikipedia.org/wiki/HyperLogLog
>
> Am 09.06.2019 um 15:59 schrieb big data :
>
> From m opinion, Bitmap is the best solution for active users calculation.
> Other solution almost bases on count(distinct) calculation process, which
> is more slower.
>
> If you 've implemented Bitmap solution including how to build Bitmap, how
> to load Bitmap, then Bitmap is the best choice.
> 在 2019/6/5 下午6:49, Rishi Shah 写道:
>
> Hi All,
>
> Is there a best practice around calculating daily, weekly, monthly,
> quarterly, yearly active users?
>
> One approach is to create a window of daily bitmap and aggregate it based
> on period later. However I was wondering if anyone has a better approach to
> tackling this problem..
>
> --
> Regards,
>
> Rishi Shah
>
>

-- 
Regards,

Rishi Shah


Re: [Pyspark 2.4] Best way to define activity within different time window

2019-06-09 Thread Jörn Franke
Depending on what accuracy is needed, hyperloglogs can be an interesting 
alternative 
https://en.m.wikipedia.org/wiki/HyperLogLog

> Am 09.06.2019 um 15:59 schrieb big data :
> 
> From m opinion, Bitmap is the best solution for active users calculation. 
> Other solution almost bases on count(distinct) calculation process, which is 
> more slower.
> 
> If you 've implemented Bitmap solution including how to build Bitmap, how to 
> load Bitmap, then Bitmap is the best choice.
> 
>> 在 2019/6/5 下午6:49, Rishi Shah 写道:
>> Hi All,
>> 
>> Is there a best practice around calculating daily, weekly, monthly, 
>> quarterly, yearly active users?
>> 
>> One approach is to create a window of daily bitmap and aggregate it based on 
>> period later. However I was wondering if anyone has a better approach to 
>> tackling this problem.. 
>> 
>> -- 
>> Regards,
>> 
>> Rishi Shah


Re: [Pyspark 2.4] Best way to define activity within different time window

2019-06-09 Thread big data
From m opinion, Bitmap is the best solution for active users calculation. Other 
solution almost bases on count(distinct) calculation process, which is more 
slower.

If you 've implemented Bitmap solution including how to build Bitmap, how to 
load Bitmap, then Bitmap is the best choice.

在 2019/6/5 下午6:49, Rishi Shah 写道:
Hi All,

Is there a best practice around calculating daily, weekly, monthly, quarterly, 
yearly active users?

One approach is to create a window of daily bitmap and aggregate it based on 
period later. However I was wondering if anyone has a better approach to 
tackling this problem..

--
Regards,

Rishi Shah


[Pyspark 2.4] Best way to define activity within different time window

2019-06-05 Thread Rishi Shah
Hi All,

Is there a best practice around calculating daily, weekly, monthly,
quarterly, yearly active users?

One approach is to create a window of daily bitmap and aggregate it based
on period later. However I was wondering if anyone has a better approach to
tackling this problem..

-- 
Regards,

Rishi Shah