Re: Hive HLL for appx count distinct

Gopal Vijayaraghavan Wed, 30 Dec 2015 20:58:39 -0800

> In the hive-hll-udf, you seem to mention about RRD. Is that something
>supported by Hive?


No. RRDTool is what most people are replacing with Hive to store time
series data in.

Raw RRDTool files on a local disk have no availability model (i.e lose a
disk, you lose data).

The rollup concept however is very powerful, to maintain distinct
aggregates of a time-series (& throw out the expired ones), which is what
my example was 

last 30 days HLL + last 23 hours HLL + generate HLL over current_hour.

to count billions of distincts across them with a few megabytes of storage.

This can be then further extended to build hundreds of bitsets per hour,
one for each tracked A/B experiment to collect stats on.

Cheers,
Gopal

Re: Hive HLL for appx count distinct

Reply via email to