oh never mind. that's trivial. As Ted mentioned, i perhaps by mistake assumed the problem is to find most frequently used queries.
if he just needs top N with the highest similarity score...well... that's kind of a problem i am solving for LSI over hbase right now... I don't want to disclose exactly how, or Ted will say that's not the way :) But, there are definitely ways to organize the vector space model to find N closest without scanning the entire vector set. -d On Mon, Apr 25, 2011 at 2:23 PM, Jake Mannix <[email protected]> wrote: > On Mon, Apr 25, 2011 at 2:03 PM, Dmitriy Lyubimov <[email protected]> wrote: > >> if you "forget" some of the items you saw along with the counters, >> how'd you add them back to the heap? >> > > What do you mean? This is the common search problem: you're iterating over > a list of int, double pairs (int is "docId", double is the "score"), and you > want to > keep the topK out of all N of them. Keep a size-K heap of docId/score > pairs, > and as you iterate, check to see if your current one beats your current K'th > best > result - if not, you know it will *never* be in the overall topK, and it can > be discarded > (O(1) operation, this happens most of the time if K << n, and the list is > randomly ordered). If it beats the bottom of the heap, insert it (O(log(K) > operation), > dropping the old bottom of the heap. > > This is O(n + k*log(k)) for a randomly sorted list. > > -jake >
