But then again i think i misunderstood the problem the second time...
On Mon, Apr 25, 2011 at 2:59 PM, Dmitriy Lyubimov <[email protected]> wrote: > oh never mind. > > that's trivial. As Ted mentioned, i perhaps by mistake assumed the > problem is to find most frequently used queries. > > if he just needs top N with the highest similarity score...well... > that's kind of a problem i am solving for LSI over hbase right now... > I don't want to disclose exactly how, or Ted will say that's not the > way :) But, there are definitely ways to organize the vector space > model to find N closest without scanning the entire vector set. > > -d > > > > On Mon, Apr 25, 2011 at 2:23 PM, Jake Mannix <[email protected]> wrote: >> On Mon, Apr 25, 2011 at 2:03 PM, Dmitriy Lyubimov <[email protected]> wrote: >> >>> if you "forget" some of the items you saw along with the counters, >>> how'd you add them back to the heap? >>> >> >> What do you mean? This is the common search problem: you're iterating over >> a list of int, double pairs (int is "docId", double is the "score"), and you >> want to >> keep the topK out of all N of them. Keep a size-K heap of docId/score >> pairs, >> and as you iterate, check to see if your current one beats your current K'th >> best >> result - if not, you know it will *never* be in the overall topK, and it can >> be discarded >> (O(1) operation, this happens most of the time if K << n, and the list is >> randomly ordered). If it beats the bottom of the heap, insert it (O(log(K) >> operation), >> dropping the old bottom of the heap. >> >> This is O(n + k*log(k)) for a randomly sorted list. >> >> -jake >> >
