Hi all.
Im using nutch 1.9 in local mode and solr 4.10 with half million of documents.
An adaptive fetch schedule is being used for crawl pages that changes
frequently.
I have detected that nutch is calculting a extremely high boost for some
documents and the document score in Solr is extremely high for these documents,
and
in consequence the order of documents is changed by this wrong boost.
This a correct solr output for me using "cubadebate" query:
*******************************
{
"responseHeader": {
"status": 0,
"QTime": 195
},
"response": {
"numFound": 183486,
"start": 0,
"maxScore": 2.7115784,
"docs": [
{
"url": "http://www.cubadebate.cu/",
"boost": 1.0175576,
"score": 2.7115784
},
{
"url": "http://www.cubadebate.cu/editores/preguntas-frecuentes/",
"boost": 0.11512774,
"score": 0.59315777
},
{
"url": "http://www.cubadebate.cu/editores/",
"boost": 0.16240995,
"score": 0.50842094
},
{
"url": "http://www.cubadebate.cu/feed/",
"boost": 0.8635264,
"score": 0.42501986
},
{
"url": "http://www.cubadebate.cu/etiqueta/cine/",
"boost": 0.13792185,
"score": 0.3541832
},
{
"url": "http://www.cubadebate.cu/web2/",
"boost": 0.114989564,
"score": 0.3389473
},
{
"url":
"http://www.cubadebate.cu/opinion/2015/03/06/diferencias-conciliables/",
"boost": 0.18748672,
"score": 0.28334656
},
{
"url":
"http://www.cubadebate.cu/noticias/2015/02/02/freddy-asiel-voy-por-el-desquite/",
"boost": 0.13997546,
"score": 0.28334656
},
{
"url": "http://www.cubadebate.cu/especiales/2015/03/05/querido-hugo/",
"boost": 0.13172969,
"score": 0.28334656
},
{
"url":
"http://www.cubadebate.cu/noticias/2015/02/08/grammys-la-lista-completa-de-los-ganadores/comment-page-1/",
"boost": 0.12959023,
"score": 0.24792825
}
]
},
***********************************************
this a incorrect solr output using "cubadebate" query:
{
"responseHeader": {
"status": 0,
"QTime": 111
},
"response": {
"numFound": 172952,
"start": 0,
"maxScore": 22939964,
"docs": [
{
"url":
"http://www.tvcubana.icrt.cu/seccion-temas/1088-yo-tambien-estoy-en-la-celac",
"boost": 1422334460,
"score": 22939964
},
{
"url":
"http://www.perlavision.icrt.cu/index.php/deportes/boxeo/14065-domadores-de-cuba-enfrentaran-a-guerreros-de-mexico-en-semifinal-de-la-v-serie-mundial-de-boxeo",
"boost": 1675646080,
"score": 22476484
},
{
"url": "http://www.radiohc.cu/noticias/deportes/page/387",
"boost": 1325039870,
"score": 21191032
},
{
"url":
"http://www.perlavision.icrt.cu/index.php/bloqueo/13922-nacera-en-mayo-engage-cuba-un-vigoroso-lobby-antibloqueo-en-congreso-de-eeuu",
"boost": 1663792640,
"score": 18730402
},
{
"url":
"http://www.perlavision.icrt.cu/index.php/deportes/boxeo/14004-cuba-en-semifinales-de-serie-mundial-el-proximo-mes",
"boost": 1528675840,
"score": 18730402
},
{
"url": "http://www.radiohc.cu/noticias/ciencias/page/76",
"boost": 1326217090,
"score": 18542152
},
{
"url": "http://www.radiohc.cu/noticias/cultura/page/272",
"boost": 1327128190,
"score": 18542152
},
{
"url":
"http://www.tvcubana.icrt.cu/archivo/118-archiv0/1060-beisbol-cubano-sera-el-tema-de-la-mesa-redonda-en-sus-emisiones-de-miercoles-y-jueves",
"boost": 1424298370,
"score": 18542152
},
{
"url":
"http://www.tvcubana.icrt.cu/archivo/118-archiv0/1073-el-programa-nacional-de-medicamentos-en-la-mesa-redonda-miercoles-y-jueves",
"boost": 1424231940,
"score": 18542152
},
{
"url":
"http://www.tvcubana.icrt.cu/archivo/118-archiv0/897-la-mesa-redonda-presentara-miercoles-y-jueves-las-cooerativas-no-agropecuarias-p",
"boost": 1424386690,
"score": 18542152
}
]
},
In this case the boost is extremely high,
i have look at solrindexer plugin and i have seen this line 123
inputDoc.setDocumentBoost(doc.getWeight());
in IndexerMapReduce.java(src/java/org/apache/nutch/indexer) in line 316 also
similar things:
i think this increase the boost for all document.
// apply boost to all indexed fields.
doc.setWeight(boost);
Please i really appreciated any advice or solution for this problem.
Thanks in advance.