Yes Julien. Im using only scoring-opic. this my plugin.include property. I have attached my nutch-site.xml is there any problem with scoring opic ? Do you recommend me use another scoring(depth or link)?
<property> <name>plugin.includes</name> <value>protocol-(http|httpclient)|urlfilter-(domain|regex|domainblacklist)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata|required)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|microformats-customtag|language-identifier|links-extractor|mimetype-filter|mimetype-alias-filter</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> ----- Mensaje original ----- De: "Julien Nioche" <[email protected]> Para: [email protected] Enviados: MiƩrcoles, 20 de Mayo 2015 15:06:38 Asunto: [MASSMAIL]Re: about boost field extremely high Hi Eyeris The boost value is simply the output of what the ScoringFilters give for a document. Are you using OPIC? Julien On 20 May 2015 at 19:32, Eyeris RodrIguez Rueda <[email protected]> wrote: > Hi all. > Im using nutch 1.9 in local mode and solr 4.10 with half million of > documents. > An adaptive fetch schedule is being used for crawl pages that changes > frequently. > I have detected that nutch is calculting a extremely high boost for some > documents and the document score in Solr is extremely high for these > documents, and > in consequence the order of documents is changed by this wrong boost. > This a correct solr output for me using "cubadebate" query: > ******************************* > { > "responseHeader": { > "status": 0, > "QTime": 195 > }, > "response": { > "numFound": 183486, > "start": 0, > "maxScore": 2.7115784, > "docs": [ > { > "url": "http://www.cubadebate.cu/", > "boost": 1.0175576, > "score": 2.7115784 > }, > { > "url": "http://www.cubadebate.cu/editores/preguntas-frecuentes/", > "boost": 0.11512774, > "score": 0.59315777 > }, > { > "url": "http://www.cubadebate.cu/editores/", > "boost": 0.16240995, > "score": 0.50842094 > }, > { > "url": "http://www.cubadebate.cu/feed/", > "boost": 0.8635264, > "score": 0.42501986 > }, > { > "url": "http://www.cubadebate.cu/etiqueta/cine/", > "boost": 0.13792185, > "score": 0.3541832 > }, > { > "url": "http://www.cubadebate.cu/web2/", > "boost": 0.114989564, > "score": 0.3389473 > }, > { > "url": " > http://www.cubadebate.cu/opinion/2015/03/06/diferencias-conciliables/", > "boost": 0.18748672, > "score": 0.28334656 > }, > { > "url": " > http://www.cubadebate.cu/noticias/2015/02/02/freddy-asiel-voy-por-el-desquite/ > ", > "boost": 0.13997546, > "score": 0.28334656 > }, > { > "url": " > http://www.cubadebate.cu/especiales/2015/03/05/querido-hugo/", > "boost": 0.13172969, > "score": 0.28334656 > }, > { > "url": " > http://www.cubadebate.cu/noticias/2015/02/08/grammys-la-lista-completa-de-los-ganadores/comment-page-1/ > ", > "boost": 0.12959023, > "score": 0.24792825 > } > ] > }, > *********************************************** > this a incorrect solr output using "cubadebate" query: > { > "responseHeader": { > "status": 0, > "QTime": 111 > }, > "response": { > "numFound": 172952, > "start": 0, > "maxScore": 22939964, > "docs": [ > { > "url": " > http://www.tvcubana.icrt.cu/seccion-temas/1088-yo-tambien-estoy-en-la-celac > ", > "boost": 1422334460, > "score": 22939964 > }, > { > "url": " > http://www.perlavision.icrt.cu/index.php/deportes/boxeo/14065-domadores-de-cuba-enfrentaran-a-guerreros-de-mexico-en-semifinal-de-la-v-serie-mundial-de-boxeo > ", > "boost": 1675646080, > "score": 22476484 > }, > { > "url": "http://www.radiohc.cu/noticias/deportes/page/387", > "boost": 1325039870, > "score": 21191032 > }, > { > "url": " > http://www.perlavision.icrt.cu/index.php/bloqueo/13922-nacera-en-mayo-engage-cuba-un-vigoroso-lobby-antibloqueo-en-congreso-de-eeuu > ", > "boost": 1663792640, > "score": 18730402 > }, > { > "url": " > http://www.perlavision.icrt.cu/index.php/deportes/boxeo/14004-cuba-en-semifinales-de-serie-mundial-el-proximo-mes > ", > "boost": 1528675840, > "score": 18730402 > }, > { > "url": "http://www.radiohc.cu/noticias/ciencias/page/76", > "boost": 1326217090, > "score": 18542152 > }, > { > "url": "http://www.radiohc.cu/noticias/cultura/page/272", > "boost": 1327128190, > "score": 18542152 > }, > { > "url": " > http://www.tvcubana.icrt.cu/archivo/118-archiv0/1060-beisbol-cubano-sera-el-tema-de-la-mesa-redonda-en-sus-emisiones-de-miercoles-y-jueves > ", > "boost": 1424298370, > "score": 18542152 > }, > { > "url": " > http://www.tvcubana.icrt.cu/archivo/118-archiv0/1073-el-programa-nacional-de-medicamentos-en-la-mesa-redonda-miercoles-y-jueves > ", > "boost": 1424231940, > "score": 18542152 > }, > { > "url": " > http://www.tvcubana.icrt.cu/archivo/118-archiv0/897-la-mesa-redonda-presentara-miercoles-y-jueves-las-cooerativas-no-agropecuarias-p > ", > "boost": 1424386690, > "score": 18542152 > } > ] > }, > > In this case the boost is extremely high, > i have look at solrindexer plugin and i have seen this line 123 > inputDoc.setDocumentBoost(doc.getWeight()); > > in IndexerMapReduce.java(src/java/org/apache/nutch/indexer) in line 316 > also similar things: > i think this increase the boost for all document. > // apply boost to all indexed fields. > doc.setWeight(boost); > > Please i really appreciated any advice or solution for this problem. > Thanks in advance. > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
nutch-site.xml
Description: XML document

