Re: Compute the top 100 million in the total 10 billion data efficiently.

Ted Dunning Tue, 21 Jan 2014 04:55:34 -0800

Top what?

Most frequent?  Or the top 1% based on some score attached to the tuples.

The latter is trivial. The former less so. 

If you have the score problem, you just need to use an approximate quantile 
algorithm like t-digest to get a continuous estimate of the 99-th percentile. 

For the most frequent problem there are other approximation algorithms but you 
may have some issues with the number of hits that you are looking for.  

Sent from my iPhone

> On Jan 20, 2014, at 21:58, churly lin <[email protected]> wrote:
> 
> Hi all:
> Recently, This question occurs to me, how to compute the top 100 million in 
> the total 10 billion records efficiently using Storm?
> 
> The total 10 billion records is the input of topology with the top 100 
> million records as output.

Re: Compute the top 100 million in the total 10 billion data efficiently.

Reply via email to