Lance Norskog's suggestion to look at Lucene's MoreLikeThis feature looks like a good fit and seems to do about what you describe below. It seems a good idea to reorder the returned docs by some distance or similarity measure.

The major problem you mention in extracting good terms; are you talking about creating the query or in creating the Solr index?

BTW RowSimilarity works so well for doc similarity I'm resisting taking it out and will concentrate on reducing the size of the matrix it deals with to mitigate the scaling problems. For the realtime queries I think I'll look deeper into MoreLikeThis. In our use case we'll be taking the TFIDF terms weights from a doc and reweighting some terms based on a user gesture.


On 7/17/12 8:22 PM, Ken Krugler wrote:
Hi Pat,

On Jul 14, 2012, at 8:17am, Pat Ferrel wrote:

Intersting.

I have another requirement, which is to do something like real time vector 
based queries. Imagine taking a doc vector, reweighting some terms then doing a 
query with it, perhaps in a truncated form. There are several ways to do this 
but only solr would offer something real time results afaik. It looks like I 
could use your approach below to do this. A quick look at eDisMax however 
suggests some problems. The use of pf2 and pf3 would jamb the query vector into 
synthesized bi and tri grams for instance.
The simplistic approach I used was to extract the top 50 terms (with TF*IDF 
weights) from the target document, then use those terms (with weights as boosts) to 
do a regular Lucene OR query & request the top 20 hits.

The index I'm searching against has Solr documents with a multi-value field 
that contains the top 50 terms, generated using the same approach as with the 
target document. It also contains stored weights for each of those terms.

I didn't use payload boosting, but could have to improve the quality of this 
search - seemed to be working well enough, and speed was pretty important.

Solr returns back a sorted list of hits, and then I do a regular vector similarity 
calculation between the target & each of these top 20 hits, and select the best 
one (assuming it passes a similarity threshold).

I'd be interested in hearing more about how you use it. Is there a better venue 
than the mahout list?
If you'd like more details, that's probably better for an off-list 
discussion…doesn't feel very Mahout-ish in nature :)

Though a discussion of the major problem (how to extract "good" terms from the 
text) would be very interesting, as I wound up crafting what felt like a kludgy 
pseudo-NLP solution.

-- Ken

-------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr


Reply via email to