Re: How to implememt distinct count in trident topolgy?

Sam Goodwin Wed, 16 Jul 2014 14:45:31 -0700

Even with Redis you'll need to maintain the sliding window yourself.

Does it need to be exact? If you want to estimate the number of distinct
users seen in a sliding window then use the HyperLogLog data structure with
a ring buffer. It's fast, accurate and memory efficient. For example,
allocate 60 HyperLogLog structure for 60 minutes (1 per minute) and then
use a Ring Buffer algorithm to maintain the sliding window. When you want
the total count you can just merge all the HyperLogLog structures and
extract the count. It's not exact but it's close enough and can be tuned
based on your precision requirements.

Twitter's alegebird package has a monoid implementation of the HLL
algorithm
https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/HyperLogLog.scala#L353
 which basically means you can merge them into 1 and should be all you
need. The Redis HLL also allows you to merge two HLLs. Please note that
using the Redis HLL algorithm will make it harder to implement a
transactional topology. If you ever want that then I suggest you implement
the above algorithm and serialize/deserialize in your TridentState.

If you want a more precise window you can just increase the bucket counts.
You may also be able to adapt this exponential histogram sliding window
algorithm for your needs
http://www-cs-students.stanford.edu/~datar/papers/sicomp_streams.pdf

On Wed, Jul 16, 2014 at 1:21 PM, Danijel Schiavuzzi <[email protected]>
wrote:

> Take a look at a distributed data structure server, for example Redis. The
> are various Storm integrations available.
>
> On Monday, July 14, 2014, 唐思成 <[email protected]> wrote:
>
>>  Use case is simple, count unique user in for in a window slide, and I
>> found the common solutions over the Internet is to use HashSet to fliter
>> the duplicated user,like this
>>
>>  public class Distinct extends BaseFilter {
>>     private static final long serialVersionUID = 1L;
>>
>>     private Set<String> distincter = Collections.synchronizedSet(new 
>> HashSet<String>());
>>      @Override
>>     public boolean isKeep(TridentTuple tuple) {
>>         String id = this.getId(tuple);
>>         return distincter.add(id);
>>     }
>>      public String getId(TridentTuple t) {
>>         StringBuilder sb = new StringBuilder();
>>         for (int i = 0; i < t.size(); i++) {
>>             sb.append(t.getString(i));
>>         }
>>         return sb.toString();
>>     }
>> }
>>
>> However, the HashSet is stored in memory, when the data grows to a very
>> large level, I think it will cause a OOM.
>> So is there a scalable solution?
>>
>> 2014-07-14
>> ------------------------------
>> 唐思成
>>
>
>
> --
> Danijel Schiavuzzi
>
> E: [email protected]
> W: www.schiavuzzi.com
> T: +385989035562
> Skype: danijels7
>

Re: How to implememt distinct count in trident topolgy?

Reply via email to