On 07/22/2013 12:20 PM, Pat Ferrel wrote:
My understanding of the Solr proposal puts B's row similarity matrix in a vector per
item. That means each row is turned into "terms" = external IDs--not sure how
the weights of each term are encoded.
This is the key question for me. The best idea I've had is to use
termFreq as a proxy for weight. It's only an integer, so there are
scaling issues to consider, but you can apply a per-field weight to
manage that. Also, Lucene (and Solr) doesn't provide an obvious way to
load term frequencies directly: probably the simplest thing to do is
just to repeat the cross-term N times and let the text analysis take
care of counting them. Inefficient, but probably the quickest way to
get going. Alternatively, there are some lower level Lucene indexing
APIs (DocFieldConsumer et al) which I haven't really plumbed entirely,
but would allow for more direct loading of fields.
Then one probably wants to override the scoring in some way (unless
TFIDF is the way to go somehow??)