RE: Document scores(boost)

Imtiaz Shakil Siddique Thu, 10 Sep 2015 15:11:06 -0700

The reason I'm concerned with Opic scoring algorithm is that sometimes it
gives some spam/useless sites very good document boosting. This creates a
lot of problem when searching in solr as irrelevant documents gets higher
score even if their tf-idf value is small.


If I ignore the internal links in webgraph then (webgraph > linkrank >
scoreupdater) combination would be a better choice than Opic. Ami I correct?

Thank you for your advice Sir.
Imtiaz Shakil Siddique
On Sep 11, 2015 12:39 AM, "Markus Jelsma" <[email protected]>
wrote:

> Hello, if you are really interested in having offline scores calculated
> then ideally you must perform those jobs after updating the DB and before
> indexing, at each cycle because you probably get new data. However, you can
> also use it asyncronously by periodically dumping the scores to a flat file
> (NodeDumper can do that). Solr can then read that file as an External File
> Field.
>
> But again, only if you really need it. By default the webgraph ignores
> internal links, for good reasons, as the graph will become too dense and
> internal scores are not very useful. In almost all cases, you don't need
> it, only if you are going to crawl very large portions of the web. I most
> cases, TF*IDF or BM25 scoring in Solr/Lucene is superiour.
>
> Markus
>
>
> -----Original message-----
> > From:Imtiaz Shakil Siddique <[email protected]>
> > Sent: Thursday 10th September 2015 19:11
> > To: [email protected]
> > Subject: RE: Document scores(boost)
> >
> > Hello Markus Jelsma,
> >
> > Thank you for the advice. But this score calculation is done after the
> data
> > is indexed to solr. So when the scores are updated inside the crawldb
> Solr
> > won't get it.
> >
> > I think a workaround for this problem would be shifting the solr index
> > phase at the bottom of all the operations.
> > But one thing I'm not clear is that how often should I run this webgraph
> > update commands .
> >
> > Thank you,
> > Imtiaz Shakil Siddique
> > On Sep 10, 2015 8:36 PM, "Markus Jelsma" <[email protected]>
> wrote:
> >
> > > Yes, remove OPIC from the config will simple disable it.
> > >
> > > The webgraph program will create a webgraph datastructure for the
> > > specified segments. The linkrank program will then calculate the
> scores for
> > > each node. Finally, the scoreupdater writes the score from the webgraph
> > > back into the crawldb. This program is very intensive. Use it only if
> you
> > > really need it.
> > >
> > > Markus
> > >
> > > -----Original message-----
> > > > From:Imtiaz Shakil Siddique <[email protected]>
> > > > Sent: Thursday 10th September 2015 16:04
> > > > To: [email protected]
> > > > Subject: Re: Document scores(boost)
> > > >
> > > > Hello Markus Jelsma,
> > > >
> > > > So you are suggesting that I should
> > > > 1. remove "scoring-opic" plugin
> > > > 2. run the webgraph > linkrank > scoreupdater from /bin/crawl script
> > > > if I want to calculate document boost with all segments in hand.
> > > >
> > > >
> > > > It'd be very helpful if you could explain what these four things do (
> > > webgraph,
> > > > linkrank, scoreupdater,nodedumper )
> > > >
> > > > Thank you so much for the help.
> > > > Imtiaz Shakil Siddique
> > > >
> > > >
> > > > On 10 September 2015 at 19:27, Markus Jelsma <
> [email protected]
> > > >
> > > > wrote:
> > > >
> > > > > Hello - OPIC is useless in incremental crawls. You can either
> disable
> > > > > scoring altogether, or use webgraph > linkrank > scoreupdater.
> > > > > Markus
> > > > >
> > > > > -----Original message-----
> > > > > > From:Imtiaz Shakil Siddique <[email protected]>
> > > > > > Sent: Wednesday 9th September 2015 23:09
> > > > > > To: [email protected]
> > > > > > Subject: Document scores(boost)
> > > > > >
> > > > > > Hello,
> > > > > > I've been using nutch 1.9/1.10 for about six months. One thing I
> > > noticed
> > > > > > that at each iteration(during parsing phase) nutch calculates
> > > document
> > > > > > boost(using Opic algorithm)
> > > > > >
> > > > > > 1. My question is how this score is adjusted with respect to all
> the
> > > > > > segments.
> > > > > >
> > > > > > 2. Another question is inside bin/crawl script what does the
> > > webgraph,
> > > > > > linkrank, scoreupdater,nodedumper do? Can anyone be kind enough
> to
> > > > > explain?
> > > > > >
> > > > > > Thank you so much.
> > > > > > Imtiaz Shakil Siddique
> > > > > >
> > > > >
> > > >
> > >
> >
>

RE: Document scores(boost)

Reply via email to