See https://issues.apache.org/jira/browse/NUTCH-1958 and the reference to a related discussion. The choice of scoring depends on the nature of your crawl, you can also not use a scoring filter at all in which case all the docs will get a boost of 1
On 20 May 2015 at 20:55, Eyeris RodrIguez Rueda <[email protected]> wrote: > Yes Julien. > Im using only scoring-opic. this my plugin.include property. > I have attached my nutch-site.xml > is there any problem with scoring opic ? > Do you recommend me use another scoring(depth or link)? > > <property> > <name>plugin.includes</name> > > <value>protocol-(http|httpclient)|urlfilter-(domain|regex|domainblacklist)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata|required)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|microformats-customtag|language-identifier|links-extractor|mimetype-filter|mimetype-alias-filter</value> > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints plugin. > By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. In order to use HTTPS please > enable > protocol-httpclient, but be aware of possible intermittent problems with > the > underlying commons-httpclient library. > </description> > </property> > > > > ----- Mensaje original ----- > De: "Julien Nioche" <[email protected]> > Para: [email protected] > Enviados: MiƩrcoles, 20 de Mayo 2015 15:06:38 > Asunto: [MASSMAIL]Re: about boost field extremely high > > Hi Eyeris > > The boost value is simply the output of what the ScoringFilters give for a > document. Are you using OPIC? > > Julien > > On 20 May 2015 at 19:32, Eyeris RodrIguez Rueda <[email protected]> wrote: > > > Hi all. > > Im using nutch 1.9 in local mode and solr 4.10 with half million of > > documents. > > An adaptive fetch schedule is being used for crawl pages that changes > > frequently. > > I have detected that nutch is calculting a extremely high boost for some > > documents and the document score in Solr is extremely high for these > > documents, and > > in consequence the order of documents is changed by this wrong boost. > > This a correct solr output for me using "cubadebate" query: > > ******************************* > > { > > "responseHeader": { > > "status": 0, > > "QTime": 195 > > }, > > "response": { > > "numFound": 183486, > > "start": 0, > > "maxScore": 2.7115784, > > "docs": [ > > { > > "url": "http://www.cubadebate.cu/", > > "boost": 1.0175576, > > "score": 2.7115784 > > }, > > { > > "url": "http://www.cubadebate.cu/editores/preguntas-frecuentes/ > ", > > "boost": 0.11512774, > > "score": 0.59315777 > > }, > > { > > "url": "http://www.cubadebate.cu/editores/", > > "boost": 0.16240995, > > "score": 0.50842094 > > }, > > { > > "url": "http://www.cubadebate.cu/feed/", > > "boost": 0.8635264, > > "score": 0.42501986 > > }, > > { > > "url": "http://www.cubadebate.cu/etiqueta/cine/", > > "boost": 0.13792185, > > "score": 0.3541832 > > }, > > { > > "url": "http://www.cubadebate.cu/web2/", > > "boost": 0.114989564, > > "score": 0.3389473 > > }, > > { > > "url": " > > http://www.cubadebate.cu/opinion/2015/03/06/diferencias-conciliables/", > > "boost": 0.18748672, > > "score": 0.28334656 > > }, > > { > > "url": " > > > http://www.cubadebate.cu/noticias/2015/02/02/freddy-asiel-voy-por-el-desquite/ > > ", > > "boost": 0.13997546, > > "score": 0.28334656 > > }, > > { > > "url": " > > http://www.cubadebate.cu/especiales/2015/03/05/querido-hugo/", > > "boost": 0.13172969, > > "score": 0.28334656 > > }, > > { > > "url": " > > > http://www.cubadebate.cu/noticias/2015/02/08/grammys-la-lista-completa-de-los-ganadores/comment-page-1/ > > ", > > "boost": 0.12959023, > > "score": 0.24792825 > > } > > ] > > }, > > *********************************************** > > this a incorrect solr output using "cubadebate" query: > > { > > "responseHeader": { > > "status": 0, > > "QTime": 111 > > }, > > "response": { > > "numFound": 172952, > > "start": 0, > > "maxScore": 22939964, > > "docs": [ > > { > > "url": " > > > http://www.tvcubana.icrt.cu/seccion-temas/1088-yo-tambien-estoy-en-la-celac > > ", > > "boost": 1422334460, > > "score": 22939964 > > }, > > { > > "url": " > > > http://www.perlavision.icrt.cu/index.php/deportes/boxeo/14065-domadores-de-cuba-enfrentaran-a-guerreros-de-mexico-en-semifinal-de-la-v-serie-mundial-de-boxeo > > ", > > "boost": 1675646080, > > "score": 22476484 > > }, > > { > > "url": "http://www.radiohc.cu/noticias/deportes/page/387", > > "boost": 1325039870, > > "score": 21191032 > > }, > > { > > "url": " > > > http://www.perlavision.icrt.cu/index.php/bloqueo/13922-nacera-en-mayo-engage-cuba-un-vigoroso-lobby-antibloqueo-en-congreso-de-eeuu > > ", > > "boost": 1663792640, > > "score": 18730402 > > }, > > { > > "url": " > > > http://www.perlavision.icrt.cu/index.php/deportes/boxeo/14004-cuba-en-semifinales-de-serie-mundial-el-proximo-mes > > ", > > "boost": 1528675840, > > "score": 18730402 > > }, > > { > > "url": "http://www.radiohc.cu/noticias/ciencias/page/76", > > "boost": 1326217090, > > "score": 18542152 > > }, > > { > > "url": "http://www.radiohc.cu/noticias/cultura/page/272", > > "boost": 1327128190, > > "score": 18542152 > > }, > > { > > "url": " > > > http://www.tvcubana.icrt.cu/archivo/118-archiv0/1060-beisbol-cubano-sera-el-tema-de-la-mesa-redonda-en-sus-emisiones-de-miercoles-y-jueves > > ", > > "boost": 1424298370, > > "score": 18542152 > > }, > > { > > "url": " > > > http://www.tvcubana.icrt.cu/archivo/118-archiv0/1073-el-programa-nacional-de-medicamentos-en-la-mesa-redonda-miercoles-y-jueves > > ", > > "boost": 1424231940, > > "score": 18542152 > > }, > > { > > "url": " > > > http://www.tvcubana.icrt.cu/archivo/118-archiv0/897-la-mesa-redonda-presentara-miercoles-y-jueves-las-cooerativas-no-agropecuarias-p > > ", > > "boost": 1424386690, > > "score": 18542152 > > } > > ] > > }, > > > > In this case the boost is extremely high, > > i have look at solrindexer plugin and i have seen this line 123 > > inputDoc.setDocumentBoost(doc.getWeight()); > > > > in IndexerMapReduce.java(src/java/org/apache/nutch/indexer) in line 316 > > also similar things: > > i think this increase the boost for all document. > > // apply boost to all indexed fields. > > doc.setWeight(boost); > > > > Please i really appreciated any advice or solution for this problem. > > Thanks in advance. > > > > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

