Hi there!
As a Solr newbie who has however worked with Lucene before, I have an
unusual question for the experts:
Question:
Can I, and if so, how do I perform index-time term boosting in documents
where each boost-value is not the same for all documents (no global
boosting of a given term) but instead can be per-document?
In other words: I understand there's a way to specify term boost values
for search queries, but is that also possible for indexed documents?
Here's what I'm fundamentally trying to do:
I want to index and search over documents that have a special,
associative-array-like property:
Each document has a list of unique words, and each word has a numeric
value between 0 and 1. These values express similarity in the
dimensions with this word/name. For example "cat": 0.99 is similar to
"cat: 0.98", but not to "cat": 0.21. All documents have the same set of
words, and there are lots of them: about 1 million. (If necessary, I
can reduce the number of words to tens of thousands, but then the
documents would not share the same set of words any more). Most of the
word values for a typical document are 0.00.
Example:
Documents in the index:
d1:
cat: 0.99
dog: 0.42
car: 0.00
d2:
cat: 0.02
dog: 0.00
car: 0.00
Incoming search query (with these numeric term-boosts):
q:
cat: 0.99
dog: 0.11
car: 0.00 (not specified in query)
The ideal result would be that q matches d1 much more than d2.
Here's my analysis of my situation and potential solutions:
- because I have so many words, I cannot use a separate field for each
word, this would overload Solr/Lucene. This is unfortunate, because I
know there is index-time boosting on a per-field basis (reference:
http://wiki.apache.org/solr/SolrRelevancyFAQ#head-d846ae0059c4e6b7f0d0bb2547ac336a8f18ac2f),
and because I could have used Function Queries (reference:
http://wiki.apache.org/solr/FunctionQuery).
- As a (stupid) workaround, I could convert my documents to into pure
text: the numeric values would be translated from e.g. "cat": 0.99 to
repeat the word "cat" 99 times. This would be done for a particular
document for all words and the text would be then used for regular
scoring in Solr. This approach seems doable, but inefficient and far
from elegant.
Am I reinventing the wheel here or is what I'm trying to do something
fundamentally different than what Solr and Lucene has to offer?
Any comments highly appreciated. What can I do about this?
Thanks,
Andreas