On 5/18/06, jason rutherglen <[EMAIL PROTECTED]> wrote:
It uses Jakarta HTTP Client. And implements a PriorityQueue like thing using the java.util.concurrent queues and thread pool for merging results.
Are you able to contribute this code, or is it proprietary? Have you implemented sorting by field also? That would currently require the additional constraint that the sort field be stored as well as indexed (Lucene only requires it be indexed).
Perhaps the global IDF is not a big deal? The idea is to distribute evenly over all the machines the documents. However when a new server comes online, this may present a problem as it would start at 0 documents.
Hmmm, yes, idf values could get out-of-whack when there are very few documents on a server.
I probably would not cache the global IDF, would simply merge it each time. I actually do not fully understand what the global IDF means as I need to dig more deeply into this.
inverse-document-frequency. it makes rarer terms count more. it's two components are the number of docs in the collection, and the number of docs containing a specific term.
> I don't think everything can be done in a single call since by the time you score docs against a query you have lost how you arrived at the composite score. I'm not sure what this means "you have lost how you arrived at the composite score" could you explain.
If you query for "x OR y", the doc score you get will be a combination of the doc score for x and the doc score for y. After you have the document score for the complete query, you can't adjust the IDF for just one of the terms because you don't know the individual scores for x and y anymore. -Yonik