Re: RowSimilarity

Pat Ferrel Wed, 18 Jul 2012 09:08:10 -0700

Lance Norskog's suggestion to look at Lucene's MoreLikeThis featurelooks like a good fit and seems to do about what you describe below. Itseems a good idea to reorder the returned docs by some distance orsimilarity measure.

The major problem you mention in extracting good terms; are you talkingabout creating the query or in creating the Solr index?

BTW RowSimilarity works so well for doc similarity I'm resisting takingit out and will concentrate on reducing the size of the matrix it dealswith to mitigate the scaling problems. For the realtime queries I thinkI'll look deeper into MoreLikeThis. In our use case we'll be taking theTFIDF terms weights from a doc and reweighting some terms based on auser gesture.



On 7/17/12 8:22 PM, Ken Krugler wrote:

Hi Pat,

On Jul 14, 2012, at 8:17am, Pat Ferrel wrote:

Intersting.

I have another requirement, which is to do something like real time vector 
based queries. Imagine taking a doc vector, reweighting some terms then doing a 
query with it, perhaps in a truncated form. There are several ways to do this 
but only solr would offer something real time results afaik. It looks like I 
could use your approach below to do this. A quick look at eDisMax however 
suggests some problems. The use of pf2 and pf3 would jamb the query vector into 
synthesized bi and tri grams for instance.

The simplistic approach I used was to extract the top 50 terms (with TF*IDF 
weights) from the target document, then use those terms (with weights as boosts) to 
do a regular Lucene OR query & request the top 20 hits.

The index I'm searching against has Solr documents with a multi-value field 
that contains the top 50 terms, generated using the same approach as with the 
target document. It also contains stored weights for each of those terms.

I didn't use payload boosting, but could have to improve the quality of this 
search - seemed to be working well enough, and speed was pretty important.

Solr returns back a sorted list of hits, and then I do a regular vector similarity 
calculation between the target & each of these top 20 hits, and select the best 
one (assuming it passes a similarity threshold).

I'd be interested in hearing more about how you use it. Is there a better venue 
than the mahout list?

If you'd like more details, that's probably better for an off-list 
discussion…doesn't feel very Mahout-ish in nature :)

Though a discussion of the major problem (how to extract "good" terms from the 
text) would be very interesting, as I wound up crafting what felt like a kludgy 
pseudo-NLP solution.

-- Ken

-------------------------- Ken Krugler http://www.scaleunlimited.comcustom big data solutions & training Hadoop, Cascading, Mahout & Solr

Re: RowSimilarity

Reply via email to