Hi there!
As a Solr newbie who has however worked with Lucene before, I have an unusual question for the experts:

Question:

Can I, and if so, how do I perform index-time term boosting in documents where each boost-value is not the same for all documents (no global boosting of a given term) but instead can be per-document? In other words: I understand there's a way to specify term boost values for search queries, but is that also possible for indexed documents?


Here's what I'm fundamentally trying to do:

I want to index and search over documents that have a special, associative-array-like property: Each document has a list of unique words, and each word has a numeric value between 0 and 1. These values express similarity in the dimensions with this word/name. For example "cat": 0.99 is similar to "cat: 0.98", but not to "cat": 0.21. All documents have the same set of words, and there are lots of them: about 1 million. (If necessary, I can reduce the number of words to tens of thousands, but then the documents would not share the same set of words any more). Most of the word values for a typical document are 0.00.
Example:
Documents in the index:
d1:
cat: 0.99
dog: 0.42
car: 0.00

d2:
cat: 0.02
dog: 0.00
car: 0.00

Incoming search query (with these numeric term-boosts):
q:
cat: 0.99
dog: 0.11
car: 0.00 (not specified in query)

The ideal result would be that q matches d1 much more than d2.


Here's my analysis of my situation and potential solutions:

- because I have so many words, I cannot use a separate field for each word, this would overload Solr/Lucene. This is unfortunate, because I know there is index-time boosting on a per-field basis (reference: http://wiki.apache.org/solr/SolrRelevancyFAQ#head-d846ae0059c4e6b7f0d0bb2547ac336a8f18ac2f), and because I could have used Function Queries (reference: http://wiki.apache.org/solr/FunctionQuery). - As a (stupid) workaround, I could convert my documents to into pure text: the numeric values would be translated from e.g. "cat": 0.99 to repeat the word "cat" 99 times. This would be done for a particular document for all words and the text would be then used for regular scoring in Solr. This approach seems doable, but inefficient and far from elegant.


Am I reinventing the wheel here or is what I'm trying to do something fundamentally different than what Solr and Lucene has to offer?

Any comments highly appreciated.  What can I do about this?


Thanks,

Andreas

Reply via email to