I have nutch configured to crawl several dozen sites and store the results into 
Solr for searching. An adaptive fetch schedule is being used so that pages 
which change less frequently are crawled less often. However, I've run into an 
issue where many of the documents have a boost of "Infinity". Consequently, the 
document score in Solr is extremely high for these documents, and they swamp 
other more valid search results.

My question is why the boost value is "Infinity" for these documents?

I'm running nutch 1.2 via a simple script that iterates through a full crawl 
cycle. I'm wondering if re-crawling these pages (because of the adaptive fetch 
schedule) is affecting the score - i.e. a link from PageA to PageB is being 
counted multiple times and boosting the score of PageB?

I'm not able to follow all the code in the OPICScoringFilter and the "updatedb" 
process to be entirely sure that links are not being double counted. Any 
insights or pointers would be greatly appreciated.

Blessings,
TwP

PS  I am seeing a steady growth of boost values in the Solr search results. 
There are documents with normal boost values (around 1.0) and documents with 
increasing boost values all the way up to "Infinity". Obviously they plateau at 
"Infinity".

Reply via email to