We did something related for a recent project. Basically - Build a Lucene index using transformed data - Build the search query using similar transformations - Then take the top N, and do a more expensive scoring calculation
In the end, after much tweaking, it worked well - able to handle 1000 queries/sec on a biggish AWS box, by keeping everything in memory. -- Ken On Nov 13, 2011, at 10:18pm, Jake Mannix wrote: > On Sun, Nov 13, 2011 at 10:09 PM, Ted Dunning <[email protected]> wrote: > >> That handles coherent. >> >> IT doesn't handle usable. >> >> Storing the vectors as binary payloads handles the situation for >> projection-like applications, but that doesn't help retrieval. >> > > It's not just projection, it's for added relevance: if you are already doing > Lucene for your scoring needs, you already are getting some good precision > and recall. > > The idea is this: you take results you are *already* scoring, and add to > that > scoring function an LSI cosine as one feature among many. Hopefully it > will improve precision, even if it will do nothing for recall (as it's only > being > applied to results already retrieved by the text query). > > Alternatively, to improve recall, at index-time, supplement each document > by terms in a new field "lsi_expanded" which are the terms closest in the > SVD projected space to the document, but aren't already in it. Then at > query time, add an "... OR lsi_expanded:<query>" clause onto your query. > Instant query-expansion for recall enhancement. > > Or do both, and play with both your precision and recall. > > -jake -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
