Top what? Most frequent? Or the top 1% based on some score attached to the tuples.
The latter is trivial. The former less so. If you have the score problem, you just need to use an approximate quantile algorithm like t-digest to get a continuous estimate of the 99-th percentile. For the most frequent problem there are other approximation algorithms but you may have some issues with the number of hits that you are looking for. Sent from my iPhone > On Jan 20, 2014, at 21:58, churly lin <[email protected]> wrote: > > Hi all: > Recently, This question occurs to me, how to compute the top 100 million in > the total 10 billion records efficiently using Storm? > > The total 10 billion records is the input of topology with the top 100 > million records as output.
