Re[2]: Compute the top 100 million in the total 10 billion data efficiently.

churylin Tue, 21 Jan 2014 05:33:16 -0800

 Hi Ted,
Sorry for my ambiguous question. I did mean the top 1% base no the score 
attached to the 10 billion tuples. 
You  mentioned a approximate algorithm. That's great! I will check it out 
later. But, Is there a way to calculate it in a precise way?
Thanks.

Sent from  myMail for iOS

2014年1月21日 星期二 20:54 +0800 from Ted Dunning  <[email protected]>:
Top what?

Most frequent?  Or the top 1% based on some score attached to the tuples.

The latter is trivial. The former less so.

If you have the score problem, you just need to use an approximate quantile 
algorithm like t-digest to get a continuous estimate of the 99-th percentile.

For the most frequent problem there are other approximation algorithms but you 
may have some issues with the number of hits that you are looking for.

Sent from my iPhone

> On Jan 20, 2014, at 21:58, churly lin < [email protected] > wrote:
>
> Hi all:
> Recently, This question occurs to me, how to compute the top 100 million in 
> the total 10 billion records efficiently using Storm?
>
> The total 10 billion records is the input of topology with the top 100 
> million records as output.

Re[2]: Compute the top 100 million in the total 10 billion data efficiently.

Reply via email to