Re: RowSimilarity

Ken Krugler Wed, 18 Jul 2012 09:53:56 -0700

On Jul 18, 2012, at 9:07am, Pat Ferrel wrote:

> Lance Norskog's suggestion to look at Lucene's MoreLikeThis feature looks 
> like a good fit and seems to do about what you describe below. It seems a 
> good idea to reorder the returned docs by some distance or similarity measure.
> 
> The major problem you mention in extracting good terms; are you talking about 
> creating the query or in creating the Solr index?


Both, since they (must) use the same approach for the query to do a good job of 
matching against docs in the index.

Often two word phrases are great terms, but just as often they wind up being 
junk - "otherwise resolving", where the TF*IDF (or LLR) score is high enough to 
make it one of top terms for a document, but it doesn't really capture anything 
about the meaning of the document, and thus is just noise as far as similarity 
is concerned.

-- Ken
 
> 
> BTW RowSimilarity works so well for doc similarity I'm resisting taking it 
> out and will concentrate on reducing the size of the matrix it deals with to 
> mitigate the scaling problems. For the realtime queries I think I'll look 
> deeper into MoreLikeThis. In our use case we'll be taking the TFIDF terms 
> weights from a doc and reweighting some terms based on a user gesture.
> 
> 
> On 7/17/12 8:22 PM, Ken Krugler wrote:
>> Hi Pat,
>> 
>> On Jul 14, 2012, at 8:17am, Pat Ferrel wrote:
>> 
>>> Intersting.
>>> 
>>> I have another requirement, which is to do something like real time vector 
>>> based queries. Imagine taking a doc vector, reweighting some terms then 
>>> doing a query with it, perhaps in a truncated form. There are several ways 
>>> to do this but only solr would offer something real time results afaik. It 
>>> looks like I could use your approach below to do this. A quick look at 
>>> eDisMax however suggests some problems. The use of pf2 and pf3 would jamb 
>>> the query vector into synthesized bi and tri grams for instance.
>> The simplistic approach I used was to extract the top 50 terms (with TF*IDF 
>> weights) from the target document, then use those terms (with weights as 
>> boosts) to do a regular Lucene OR query & request the top 20 hits.
>> 
>> The index I'm searching against has Solr documents with a multi-value field 
>> that contains the top 50 terms, generated using the same approach as with 
>> the target document. It also contains stored weights for each of those terms.
>> 
>> I didn't use payload boosting, but could have to improve the quality of this 
>> search - seemed to be working well enough, and speed was pretty important.
>> 
>> Solr returns back a sorted list of hits, and then I do a regular vector 
>> similarity calculation between the target & each of these top 20 hits, and 
>> select the best one (assuming it passes a similarity threshold).
>> 
>>> I'd be interested in hearing more about how you use it. Is there a better 
>>> venue than the mahout list?
>> If you'd like more details, that's probably better for an off-list 
>> discussion…doesn't feel very Mahout-ish in nature :)
>> 
>> Though a discussion of the major problem (how to extract "good" terms from 
>> the text) would be very interesting, as I wound up crafting what felt like a 
>> kludgy pseudo-NLP solution.
>> 
>> -- Ken
>> 
>> -------------------------- Ken Krugler http://www.scaleunlimited.com custom 
>> big data solutions & training Hadoop, Cascading, Mahout & Solr 
> 
> 

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: RowSimilarity

Reply via email to