The reason I'm concerned with Opic scoring algorithm is that sometimes it gives some spam/useless sites very good document boosting. This creates a lot of problem when searching in solr as irrelevant documents gets higher score even if their tf-idf value is small.
If I ignore the internal links in webgraph then (webgraph > linkrank > scoreupdater) combination would be a better choice than Opic. Ami I correct? Thank you for your advice Sir. Imtiaz Shakil Siddique On Sep 11, 2015 12:39 AM, "Markus Jelsma" <[email protected]> wrote: > Hello, if you are really interested in having offline scores calculated > then ideally you must perform those jobs after updating the DB and before > indexing, at each cycle because you probably get new data. However, you can > also use it asyncronously by periodically dumping the scores to a flat file > (NodeDumper can do that). Solr can then read that file as an External File > Field. > > But again, only if you really need it. By default the webgraph ignores > internal links, for good reasons, as the graph will become too dense and > internal scores are not very useful. In almost all cases, you don't need > it, only if you are going to crawl very large portions of the web. I most > cases, TF*IDF or BM25 scoring in Solr/Lucene is superiour. > > Markus > > > -----Original message----- > > From:Imtiaz Shakil Siddique <[email protected]> > > Sent: Thursday 10th September 2015 19:11 > > To: [email protected] > > Subject: RE: Document scores(boost) > > > > Hello Markus Jelsma, > > > > Thank you for the advice. But this score calculation is done after the > data > > is indexed to solr. So when the scores are updated inside the crawldb > Solr > > won't get it. > > > > I think a workaround for this problem would be shifting the solr index > > phase at the bottom of all the operations. > > But one thing I'm not clear is that how often should I run this webgraph > > update commands . > > > > Thank you, > > Imtiaz Shakil Siddique > > On Sep 10, 2015 8:36 PM, "Markus Jelsma" <[email protected]> > wrote: > > > > > Yes, remove OPIC from the config will simple disable it. > > > > > > The webgraph program will create a webgraph datastructure for the > > > specified segments. The linkrank program will then calculate the > scores for > > > each node. Finally, the scoreupdater writes the score from the webgraph > > > back into the crawldb. This program is very intensive. Use it only if > you > > > really need it. > > > > > > Markus > > > > > > -----Original message----- > > > > From:Imtiaz Shakil Siddique <[email protected]> > > > > Sent: Thursday 10th September 2015 16:04 > > > > To: [email protected] > > > > Subject: Re: Document scores(boost) > > > > > > > > Hello Markus Jelsma, > > > > > > > > So you are suggesting that I should > > > > 1. remove "scoring-opic" plugin > > > > 2. run the webgraph > linkrank > scoreupdater from /bin/crawl script > > > > if I want to calculate document boost with all segments in hand. > > > > > > > > > > > > It'd be very helpful if you could explain what these four things do ( > > > webgraph, > > > > linkrank, scoreupdater,nodedumper ) > > > > > > > > Thank you so much for the help. > > > > Imtiaz Shakil Siddique > > > > > > > > > > > > On 10 September 2015 at 19:27, Markus Jelsma < > [email protected] > > > > > > > > wrote: > > > > > > > > > Hello - OPIC is useless in incremental crawls. You can either > disable > > > > > scoring altogether, or use webgraph > linkrank > scoreupdater. > > > > > Markus > > > > > > > > > > -----Original message----- > > > > > > From:Imtiaz Shakil Siddique <[email protected]> > > > > > > Sent: Wednesday 9th September 2015 23:09 > > > > > > To: [email protected] > > > > > > Subject: Document scores(boost) > > > > > > > > > > > > Hello, > > > > > > I've been using nutch 1.9/1.10 for about six months. One thing I > > > noticed > > > > > > that at each iteration(during parsing phase) nutch calculates > > > document > > > > > > boost(using Opic algorithm) > > > > > > > > > > > > 1. My question is how this score is adjusted with respect to all > the > > > > > > segments. > > > > > > > > > > > > 2. Another question is inside bin/crawl script what does the > > > webgraph, > > > > > > linkrank, scoreupdater,nodedumper do? Can anyone be kind enough > to > > > > > explain? > > > > > > > > > > > > Thank you so much. > > > > > > Imtiaz Shakil Siddique > > > > > > > > > > > > > > > > > > > > >

