You don't need the full set in ram. You only need to keep the largest 100m in ram, but you would need to keep it in a sorted data structure. Our ram is tight you can keep keys only then extract the data in a second pass. On Jan 22, 2014 4:36 AM, "Ted Dunning" <[email protected]> wrote:
> > On Tue, Jan 21, 2014 at 7:31 AM, <[email protected]> wrote: > >> You mentioned a approximate algorithm. That's great! I will check it out >> later. But, Is there a way to calculate it in a precise way? > > > If you want to select the 1% largest numbers, then you have a few choices. > > If you have memory for the full set, you can sort. > > If you have room to keep 1% of the samples in memory, you need to do 100 > passes. > > If you are willing to accept small errors, then you can do it in a single > pass. > > These trade-offs are not optional, but are theorems. > > >
