It is really hard to answer that.  I would guess that it would be hard to
get much gain for uniformly distributed points in reasonably high
dimensional spaces (> 100 or so).  LSH, kd-trees and ball-trees all have
some difficulty at that size and the real kicker is that the logic and one
at a time operations required are massively slower than just bulk dot
products, especially if you can do multiple queries at the same time.

On the other hand, real data are almost never uniformly distributed so you
might wind up with very good speed gains with any of these techniques.
 Theory is difficult to apply in this case.

On Mon, Nov 14, 2011 at 6:48 PM, Dmitriy Lyubimov <[email protected]> wrote:

> Btw are there any known limitations on using in-memory kd trees to do knn
> scans?
>
> Their operational characteristics don't seem that bad.  Still probably not
> as good as some lsh based schemes but still should be quite good for many
> practical uses where N is millions and k is perhaps a thousand or so.
>
> For cosine and tanimoto stuff, it may need preliminary conversion to
> hypershperical.
>
> Also with the use of some improvements ( Max distance scan,
> bin-first-search) perhaps it may get to the point where it gets somewhat
> acceptable?
>  On Nov 13, 2011 10:10 PM, "Ted Dunning" <[email protected]> wrote:
>
> > That handles coherent.
> >
> > IT doesn't handle usable.
> >
> > Storing the vectors as binary payloads handles the situation for
> > projection-like applications, but that doesn't help retrieval.
> >
> > The problem with retrieval is that you need to do a dot product for all
> > vectors.  THat used to require that you keep a low resolution version of
> > all of the vectors in memory and implement a very efficient dot product
> for
> > speed.  If you can batch up several queries (10 or more at a time), you
> can
> > get almost an order of magnitude improvement in retrieval speed since the
> > queries are almost always memory bound on the corpus vectors.
> >
> > With modern memory sizes, you can definitely hold some pretty big
> corpora.
> >  With 100 dimensions of floats, you can store two documents per kB which
> > gives 2 million documents per GB.  The problem really has more to do with
> > retrieval speed.  While the ALU can do multiplications and additions at a
> > pretty high rate, main memory can typically only be accessed at a rate
> of a
> > GB per second or so (or a few times that) which means that retrieval time
> > is going to be a significant fraction of a second or more for many
> corpora.
> >  That might be good enough, but high volume search engines typically need
> > search times of 10ms or better for the raw search engine which doesn't
> > allow for many vectors even if I am an order of magnitude off.
> >
> > The issue of memory bandwidth isn't getting better very quickly so there
> > isn't much hope for making this better.  A good inverted index running
> out
> > of memory can handle a LOT of documents and still meet the 10ms goal.
>  This
> > means that LSI as a raw search engine on large corpora is going to be
> less
> > cost effective by a couple of orders of magnitude.  That leaves lots of
> > room to use synthetic tokens, query expansion and other magic tricks in
> the
> > original inverted index search engine to achieve similar effects at much
> > less cost.
> >
> >
> > On Sun, Nov 13, 2011 at 3:56 PM, Jake Mannix <[email protected]>
> > wrote:
> >
> > > Store the vectors as binary payloads and keep the projection matrix in
> > > memory with the qureryBuilder, to add an "lsi cosine" between query and
> > doc
> > > scoring feature?
> > >
> > > On Nov 13, 2011 12:59 PM, "Ted Dunning" <[email protected]> wrote:
> > >
> > > Essentially not.
> > >
> > > And I would worry about how to push the LSI vectors back into lucene
> in a
> > > coherent and usable way.
> > >
> > > On Sun, Nov 13, 2011 at 10:47 AM, Sebastian Schelter <[email protected]>
> > > wrote: > Is there some docum...
> > >
> >
>

Reply via email to