oh never mind.

that's trivial. As Ted mentioned, i perhaps by mistake assumed the
problem is to find most frequently used queries.

if he just needs top N with the highest similarity score...well...
that's kind of a problem i am solving for LSI over hbase right now...
I don't want to disclose exactly how, or Ted will say that's not the
way :) But, there are definitely ways to organize the vector space
model to find N closest without scanning the entire vector set.

-d



On Mon, Apr 25, 2011 at 2:23 PM, Jake Mannix <[email protected]> wrote:
> On Mon, Apr 25, 2011 at 2:03 PM, Dmitriy Lyubimov <[email protected]> wrote:
>
>> if you "forget" some of the items you saw along with the counters,
>> how'd you add them back to the heap?
>>
>
> What do you mean?  This is the common search problem: you're iterating over
> a list of int, double pairs (int is "docId", double is the "score"), and you
> want to
> keep the topK out of all N of them.  Keep a size-K heap of docId/score
> pairs,
> and as you iterate, check to see if your current one beats your current K'th
> best
> result - if not, you know it will *never* be in the overall topK, and it can
> be discarded
> (O(1) operation, this happens most of the time if K << n, and the list is
> randomly ordered).  If it beats the bottom of the heap, insert it (O(log(K)
> operation),
> dropping the old bottom of the heap.
>
> This is O(n + k*log(k)) for a randomly sorted list.
>
>  -jake
>

Reply via email to