Hi, 

Actually the generatorSortValue() method does not have access to the ParseData 
object (which holds all the info extracted by the parsers from the webpage raw 
content) as you pointed out. Essentially this method is used in the Generator 
class in a very early stage of the crawling process way before the URL have 
been fetched or parsed (which is from where the oulinks ~ new links come from). 

The best approach is to use the generatorSortValue() which will assign the 
initial score and actually will (as you figured out) get you where you want. 

How do you put your ismarked key into CrawlDatum? do you put it in the 
metadata? Perhaps you could alter the score in CrawlDatum directly, since the 
default implementation of the scoring plugins for this method is: 
datum.getScore() * initSort;

Taking into account what you’re trying to do, I think you could use the 
passScoreAfterParsing() method of the ScoringFilter interface. This method 
get’s called by the Fetcher after the parse process is done, so you’ll have 
access to the ParseMetadata and you can alter this value. I’m not clear if this 
will work, but at least worth check it out. One question about this approach is 
that if the CrawlDatum score is synchronized with the Parse/Content score.

Regards,

On Sep 10, 2014, at 3:24 AM, Benjamin Derei <[email protected]> wrote:

> Hello,
> 
> I'm using nutch 1.9.
> I want to alter the score used for sorting the topn page for the next parsing.
> I found it working by modifying the return of generatorsortvalue of a 
> scoringfilter plugin.
> But this fonction don't have anchors text in inputs...
> I wrote some inelegant and inefficient code that put a "ismarked" key in 
> crawldatum for knowing if anchors text or url contains some words... From 
> what function i have to do this?
> Is there a complete schema of datas path though each plugins type functions?
> 
> Benjamin.
> 
> Envoyé de mon iPad
> 
>> Le 10 sept. 2014 à 04:02, Jorge Luis Betancourt Gonzalez 
>> <[email protected]> a écrit :
>> 
>> You’ll need to write a couple of plugins to accomplish this. Which version 
>> of Nutch are you using? In the first case, the score you want to alter is 
>> the score that’s indexed into Solr (i.e your backend) ? 
>> 
>> Regards,
>> 
>>> On Sep 9, 2014, at 2:38 PM, Benjamin Derei <[email protected]> wrote:
>>> 
>>> hi,
>>> 
>>> i'm a beginner in java and nutch.
>>> 
>>> I want to orient the crawl with two rules:
>>> -if language identifier plugin detect that page is non "fr" the score
>>> for sorting should be divided by two.
>>> -if an anchor text or link cibling this page contain some therms the
>>> score for sorting should be multiplied by ten.
>>> 
>>> Any help ?
>>> 
>>> Benjamin.
>> 
>> Concurso "Mi selfie por los 5". Detalles en 
>> http://justiciaparaloscinco.wordpress.com

Concurso "Mi selfie por los 5". Detalles en 
http://justiciaparaloscinco.wordpress.com

Reply via email to