An unusual question for the experts -- term boosting for individual documents?

Andreas von Hessling Thu, 05 Jun 2008 13:53:56 -0700

Hi there!

As a Solr newbie who has however worked with Lucene before, I have anunusual question for the experts:


Question:

Can I, and if so, how do I perform index-time term boosting in documentswhere each boost-value is not the same for all documents (no globalboosting of a given term) but instead can be per-document?In other words: I understand there's a way to specify term boost valuesfor search queries, but is that also possible for indexed documents?



Here's what I'm fundamentally trying to do:

I want to index and search over documents that have a special,associative-array-like property:Each document has a list of unique words, and each word has a numericvalue between 0 and 1. These values express similarity in thedimensions with this word/name. For example "cat": 0.99 is similar to"cat: 0.98", but not to "cat": 0.21. All documents have the same set ofwords, and there are lots of them: about 1 million. (If necessary, Ican reduce the number of words to tens of thousands, but then thedocuments would not share the same set of words any more). Most of theword values for a typical document are 0.00.

Example:
Documents in the index:
d1:
cat: 0.99
dog: 0.42
car: 0.00

d2:
cat: 0.02
dog: 0.00
car: 0.00

Incoming search query (with these numeric term-boosts):
q:
cat: 0.99
dog: 0.11
car: 0.00 (not specified in query)

The ideal result would be that q matches d1 much more than d2.


Here's my analysis of my situation and potential solutions:

- because I have so many words, I cannot use a separate field for eachword, this would overload Solr/Lucene. This is unfortunate, because Iknow there is index-time boosting on a per-field basis (reference:http://wiki.apache.org/solr/SolrRelevancyFAQ#head-d846ae0059c4e6b7f0d0bb2547ac336a8f18ac2f),and because I could have used Function Queries (reference:http://wiki.apache.org/solr/FunctionQuery).- As a (stupid) workaround, I could convert my documents to into puretext: the numeric values would be translated from e.g. "cat": 0.99 torepeat the word "cat" 99 times. This would be done for a particulardocument for all words and the text would be then used for regularscoring in Solr. This approach seems doable, but inefficient and farfrom elegant.

Am I reinventing the wheel here or is what I'm trying to do somethingfundamentally different than what Solr and Lucene has to offer?


Any comments highly appreciated.  What can I do about this?


Thanks,

Andreas

An unusual question for the experts -- *term* boosting for individual documents?

Reply via email to

An unusual question for the experts -- term boosting for individual documents?