Lance Norskog's suggestion to look at Lucene's MoreLikeThis feature
looks like a good fit and seems to do about what you describe below. It
seems a good idea to reorder the returned docs by some distance or
similarity measure.
The major problem you mention in extracting good terms; are you talking
about creating the query or in creating the Solr index?
BTW RowSimilarity works so well for doc similarity I'm resisting taking
it out and will concentrate on reducing the size of the matrix it deals
with to mitigate the scaling problems. For the realtime queries I think
I'll look deeper into MoreLikeThis. In our use case we'll be taking the
TFIDF terms weights from a doc and reweighting some terms based on a
user gesture.
On 7/17/12 8:22 PM, Ken Krugler wrote:
Hi Pat,
On Jul 14, 2012, at 8:17am, Pat Ferrel wrote:
Intersting.
I have another requirement, which is to do something like real time vector
based queries. Imagine taking a doc vector, reweighting some terms then doing a
query with it, perhaps in a truncated form. There are several ways to do this
but only solr would offer something real time results afaik. It looks like I
could use your approach below to do this. A quick look at eDisMax however
suggests some problems. The use of pf2 and pf3 would jamb the query vector into
synthesized bi and tri grams for instance.
The simplistic approach I used was to extract the top 50 terms (with TF*IDF
weights) from the target document, then use those terms (with weights as boosts) to
do a regular Lucene OR query & request the top 20 hits.
The index I'm searching against has Solr documents with a multi-value field
that contains the top 50 terms, generated using the same approach as with the
target document. It also contains stored weights for each of those terms.
I didn't use payload boosting, but could have to improve the quality of this
search - seemed to be working well enough, and speed was pretty important.
Solr returns back a sorted list of hits, and then I do a regular vector similarity
calculation between the target & each of these top 20 hits, and select the best
one (assuming it passes a similarity threshold).
I'd be interested in hearing more about how you use it. Is there a better venue
than the mahout list?
If you'd like more details, that's probably better for an off-list
discussion…doesn't feel very Mahout-ish in nature :)
Though a discussion of the major problem (how to extract "good" terms from the
text) would be very interesting, as I wound up crafting what felt like a kludgy
pseudo-NLP solution.
-- Ken
-------------------------- Ken Krugler http://www.scaleunlimited.com
custom big data solutions & training Hadoop, Cascading, Mahout & Solr