On Jul 18, 2012, at 9:07am, Pat Ferrel wrote: > Lance Norskog's suggestion to look at Lucene's MoreLikeThis feature looks > like a good fit and seems to do about what you describe below. It seems a > good idea to reorder the returned docs by some distance or similarity measure. > > The major problem you mention in extracting good terms; are you talking about > creating the query or in creating the Solr index?
Both, since they (must) use the same approach for the query to do a good job of matching against docs in the index. Often two word phrases are great terms, but just as often they wind up being junk - "otherwise resolving", where the TF*IDF (or LLR) score is high enough to make it one of top terms for a document, but it doesn't really capture anything about the meaning of the document, and thus is just noise as far as similarity is concerned. -- Ken > > BTW RowSimilarity works so well for doc similarity I'm resisting taking it > out and will concentrate on reducing the size of the matrix it deals with to > mitigate the scaling problems. For the realtime queries I think I'll look > deeper into MoreLikeThis. In our use case we'll be taking the TFIDF terms > weights from a doc and reweighting some terms based on a user gesture. > > > On 7/17/12 8:22 PM, Ken Krugler wrote: >> Hi Pat, >> >> On Jul 14, 2012, at 8:17am, Pat Ferrel wrote: >> >>> Intersting. >>> >>> I have another requirement, which is to do something like real time vector >>> based queries. Imagine taking a doc vector, reweighting some terms then >>> doing a query with it, perhaps in a truncated form. There are several ways >>> to do this but only solr would offer something real time results afaik. It >>> looks like I could use your approach below to do this. A quick look at >>> eDisMax however suggests some problems. The use of pf2 and pf3 would jamb >>> the query vector into synthesized bi and tri grams for instance. >> The simplistic approach I used was to extract the top 50 terms (with TF*IDF >> weights) from the target document, then use those terms (with weights as >> boosts) to do a regular Lucene OR query & request the top 20 hits. >> >> The index I'm searching against has Solr documents with a multi-value field >> that contains the top 50 terms, generated using the same approach as with >> the target document. It also contains stored weights for each of those terms. >> >> I didn't use payload boosting, but could have to improve the quality of this >> search - seemed to be working well enough, and speed was pretty important. >> >> Solr returns back a sorted list of hits, and then I do a regular vector >> similarity calculation between the target & each of these top 20 hits, and >> select the best one (assuming it passes a similarity threshold). >> >>> I'd be interested in hearing more about how you use it. Is there a better >>> venue than the mahout list? >> If you'd like more details, that's probably better for an off-list >> discussion…doesn't feel very Mahout-ish in nature :) >> >> Though a discussion of the major problem (how to extract "good" terms from >> the text) would be very interesting, as I wound up crafting what felt like a >> kludgy pseudo-NLP solution. >> >> -- Ken >> >> -------------------------- Ken Krugler http://www.scaleunlimited.com custom >> big data solutions & training Hadoop, Cascading, Mahout & Solr > > -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
