See https://issues.apache.org/jira/browse/NUTCH-1958 and the reference to a
related discussion. The choice of scoring depends on the nature of your
crawl, you can also not use a scoring filter at all in which case all the
docs will get a boost of 1


On 20 May 2015 at 20:55, Eyeris RodrIguez Rueda <[email protected]> wrote:

> Yes Julien.
> Im using only scoring-opic. this my plugin.include property.
> I have attached my nutch-site.xml
> is there any problem with scoring opic ?
> Do you recommend me use another scoring(depth or link)?
>
> <property>
>   <name>plugin.includes</name>
>
> <value>protocol-(http|httpclient)|urlfilter-(domain|regex|domainblacklist)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata|required)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|microformats-customtag|language-identifier|links-extractor|mimetype-filter|mimetype-alias-filter</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin.
> By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please
> enable
>   protocol-httpclient, but be aware of possible intermittent problems with
> the
>   underlying commons-httpclient library.
>   </description>
> </property>
>
>
>
> ----- Mensaje original -----
> De: "Julien Nioche" <[email protected]>
> Para: [email protected]
> Enviados: MiƩrcoles, 20 de Mayo 2015 15:06:38
> Asunto: [MASSMAIL]Re: about boost field extremely high
>
> Hi Eyeris
>
> The boost value is simply the output of what the ScoringFilters give for a
> document. Are you using OPIC?
>
> Julien
>
> On 20 May 2015 at 19:32, Eyeris RodrIguez Rueda <[email protected]> wrote:
>
> > Hi all.
> > Im using nutch 1.9 in local mode and solr 4.10 with half million of
> > documents.
> > An adaptive fetch schedule is being used for crawl pages that changes
> > frequently.
> > I have detected that nutch is calculting a extremely high boost for some
> > documents and the document score in Solr is extremely high for these
> > documents, and
> > in consequence the order of documents is changed by this wrong boost.
> > This a correct solr output for me using "cubadebate" query:
> > *******************************
> > {
> >   "responseHeader": {
> >     "status": 0,
> >     "QTime": 195
> >   },
> >   "response": {
> >     "numFound": 183486,
> >     "start": 0,
> >     "maxScore": 2.7115784,
> >     "docs": [
> >       {
> >         "url": "http://www.cubadebate.cu/";,
> >         "boost": 1.0175576,
> >         "score": 2.7115784
> >       },
> >       {
> >         "url": "http://www.cubadebate.cu/editores/preguntas-frecuentes/
> ",
> >         "boost": 0.11512774,
> >         "score": 0.59315777
> >       },
> >       {
> >         "url": "http://www.cubadebate.cu/editores/";,
> >         "boost": 0.16240995,
> >         "score": 0.50842094
> >       },
> >       {
> >         "url": "http://www.cubadebate.cu/feed/";,
> >         "boost": 0.8635264,
> >         "score": 0.42501986
> >       },
> >       {
> >         "url": "http://www.cubadebate.cu/etiqueta/cine/";,
> >         "boost": 0.13792185,
> >         "score": 0.3541832
> >       },
> >       {
> >         "url": "http://www.cubadebate.cu/web2/";,
> >         "boost": 0.114989564,
> >         "score": 0.3389473
> >       },
> >       {
> >         "url": "
> > http://www.cubadebate.cu/opinion/2015/03/06/diferencias-conciliables/";,
> >         "boost": 0.18748672,
> >         "score": 0.28334656
> >       },
> >       {
> >         "url": "
> >
> http://www.cubadebate.cu/noticias/2015/02/02/freddy-asiel-voy-por-el-desquite/
> > ",
> >         "boost": 0.13997546,
> >         "score": 0.28334656
> >       },
> >       {
> >         "url": "
> > http://www.cubadebate.cu/especiales/2015/03/05/querido-hugo/";,
> >         "boost": 0.13172969,
> >         "score": 0.28334656
> >       },
> >       {
> >         "url": "
> >
> http://www.cubadebate.cu/noticias/2015/02/08/grammys-la-lista-completa-de-los-ganadores/comment-page-1/
> > ",
> >         "boost": 0.12959023,
> >         "score": 0.24792825
> >       }
> >     ]
> >   },
> > ***********************************************
> > this a incorrect solr output using "cubadebate" query:
> > {
> >   "responseHeader": {
> >     "status": 0,
> >     "QTime": 111
> >   },
> >   "response": {
> >     "numFound": 172952,
> >     "start": 0,
> >     "maxScore": 22939964,
> >     "docs": [
> >       {
> >         "url": "
> >
> http://www.tvcubana.icrt.cu/seccion-temas/1088-yo-tambien-estoy-en-la-celac
> > ",
> >         "boost": 1422334460,
> >         "score": 22939964
> >       },
> >       {
> >         "url": "
> >
> http://www.perlavision.icrt.cu/index.php/deportes/boxeo/14065-domadores-de-cuba-enfrentaran-a-guerreros-de-mexico-en-semifinal-de-la-v-serie-mundial-de-boxeo
> > ",
> >         "boost": 1675646080,
> >         "score": 22476484
> >       },
> >       {
> >         "url": "http://www.radiohc.cu/noticias/deportes/page/387";,
> >         "boost": 1325039870,
> >         "score": 21191032
> >       },
> >       {
> >         "url": "
> >
> http://www.perlavision.icrt.cu/index.php/bloqueo/13922-nacera-en-mayo-engage-cuba-un-vigoroso-lobby-antibloqueo-en-congreso-de-eeuu
> > ",
> >         "boost": 1663792640,
> >         "score": 18730402
> >       },
> >       {
> >         "url": "
> >
> http://www.perlavision.icrt.cu/index.php/deportes/boxeo/14004-cuba-en-semifinales-de-serie-mundial-el-proximo-mes
> > ",
> >         "boost": 1528675840,
> >         "score": 18730402
> >       },
> >       {
> >         "url": "http://www.radiohc.cu/noticias/ciencias/page/76";,
> >         "boost": 1326217090,
> >         "score": 18542152
> >       },
> >       {
> >         "url": "http://www.radiohc.cu/noticias/cultura/page/272";,
> >         "boost": 1327128190,
> >         "score": 18542152
> >       },
> >       {
> >         "url": "
> >
> http://www.tvcubana.icrt.cu/archivo/118-archiv0/1060-beisbol-cubano-sera-el-tema-de-la-mesa-redonda-en-sus-emisiones-de-miercoles-y-jueves
> > ",
> >         "boost": 1424298370,
> >         "score": 18542152
> >       },
> >       {
> >         "url": "
> >
> http://www.tvcubana.icrt.cu/archivo/118-archiv0/1073-el-programa-nacional-de-medicamentos-en-la-mesa-redonda-miercoles-y-jueves
> > ",
> >         "boost": 1424231940,
> >         "score": 18542152
> >       },
> >       {
> >         "url": "
> >
> http://www.tvcubana.icrt.cu/archivo/118-archiv0/897-la-mesa-redonda-presentara-miercoles-y-jueves-las-cooerativas-no-agropecuarias-p
> > ",
> >         "boost": 1424386690,
> >         "score": 18542152
> >       }
> >     ]
> >   },
> >
> > In this case the boost is extremely high,
> > i have look at solrindexer plugin and i have seen this line 123
> >   inputDoc.setDocumentBoost(doc.getWeight());
> >
> > in IndexerMapReduce.java(src/java/org/apache/nutch/indexer) in line 316
> > also similar things:
> > i think this increase the boost for all document.
> >  // apply boost to all indexed fields.
> >     doc.setWeight(boost);
> >
> > Please i really appreciated any advice or solution for this problem.
> > Thanks in advance.
> >
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to