Hi Ted, Sorry for my ambiguous question. I did mean the top 1% base no the score attached to the 10 billion tuples. You mentioned a approximate algorithm. That's great! I will check it out later. But, Is there a way to calculate it in a precise way? Thanks.
Sent from myMail for iOS 2014年1月21日 星期二 20:54 +0800 from Ted Dunning <[email protected]>: Top what? Most frequent? Or the top 1% based on some score attached to the tuples. The latter is trivial. The former less so. If you have the score problem, you just need to use an approximate quantile algorithm like t-digest to get a continuous estimate of the 99-th percentile. For the most frequent problem there are other approximation algorithms but you may have some issues with the number of hits that you are looking for. Sent from my iPhone > On Jan 20, 2014, at 21:58, churly lin < [email protected] > wrote: > > Hi all: > Recently, This question occurs to me, how to compute the top 100 million in > the total 10 billion records efficiently using Storm? > > The total 10 billion records is the input of topology with the top 100 > million records as output.
