Thanks a lot, everyone! This is really helpful. I only expect to get the top k items out of an n-sized vector. The heap should be small enough to fit in memory and (for my case) it might be an overkill to try to parallelize it. I will implement this using the algorithms that were suggested here. I also think it would be great if there was a more general class that could be instantiated with a vector, but for my specific case it's probably not needed.
Again, thanks everyone. I'm surprised of the level of engagement of this community. This is definitely a huge plus to our decision to go with Mahout. Thanks, Julian 2011/4/25 Ted Dunning <[email protected]> > The simplest implementation is O(n log(k)) (inserts every element, then > drops at most one element). > > Caching the k-th best element and only inserting new items has the same > worst case behavior in ordered cases. > > There is a true O(n + k) algorithm based on quicksort (I think). > > On Mon, Apr 25, 2011 at 2:26 PM, Jake Mannix <[email protected]> > wrote: > > > On Mon, Apr 25, 2011 at 2:23 PM, Jake Mannix <[email protected]> > > wrote: > > > > > > > > > This is O(n + k*log(k)) for a randomly sorted list. > > > > > > > This is a bit sloppy: there's a multiplicative factor of log(n) also in > the > > second > > additive factor, but that's still linear overall. > > > > -jake > > >
