Hi,

But where can i get the inlinks containing url and anchors?

Ben.

Envoyé de mon iPad

> Le 10 sept. 2014 à 16:02, Jorge Luis Betancourt Gonzalez 
> <[email protected]> a écrit :
> 
> Hi, 
> 
> Actually the generatorSortValue() method does not have access to the 
> ParseData object (which holds all the info extracted by the parsers from the 
> webpage raw content) as you pointed out. Essentially this method is used in 
> the Generator class in a very early stage of the crawling process way before 
> the URL have been fetched or parsed (which is from where the oulinks ~ new 
> links come from). 
> 
> The best approach is to use the generatorSortValue() which will assign the 
> initial score and actually will (as you figured out) get you where you want. 
> 
> How do you put your ismarked key into CrawlDatum? do you put it in the 
> metadata? Perhaps you could alter the score in CrawlDatum directly, since the 
> default implementation of the scoring plugins for this method is: 
> datum.getScore() * initSort;
> 
> Taking into account what you’re trying to do, I think you could use the 
> passScoreAfterParsing() method of the ScoringFilter interface. This method 
> get’s called by the Fetcher after the parse process is done, so you’ll have 
> access to the ParseMetadata and you can alter this value. I’m not clear if 
> this will work, but at least worth check it out. One question about this 
> approach is that if the CrawlDatum score is synchronized with the 
> Parse/Content score.
> 
> Regards,
> 
>> On Sep 10, 2014, at 3:24 AM, Benjamin Derei <[email protected]> wrote:
>> 
>> Hello,
>> 
>> I'm using nutch 1.9.
>> I want to alter the score used for sorting the topn page for the next 
>> parsing.
>> I found it working by modifying the return of generatorsortvalue of a 
>> scoringfilter plugin.
>> But this fonction don't have anchors text in inputs...
>> I wrote some inelegant and inefficient code that put a "ismarked" key in 
>> crawldatum for knowing if anchors text or url contains some words... From 
>> what function i have to do this?
>> Is there a complete schema of datas path though each plugins type functions?
>> 
>> Benjamin.
>> 
>> Envoyé de mon iPad
>> 
>>> Le 10 sept. 2014 à 04:02, Jorge Luis Betancourt Gonzalez 
>>> <[email protected]> a écrit :
>>> 
>>> You’ll need to write a couple of plugins to accomplish this. Which version 
>>> of Nutch are you using? In the first case, the score you want to alter is 
>>> the score that’s indexed into Solr (i.e your backend) ? 
>>> 
>>> Regards,
>>> 
>>>> On Sep 9, 2014, at 2:38 PM, Benjamin Derei <[email protected]> wrote:
>>>> 
>>>> hi,
>>>> 
>>>> i'm a beginner in java and nutch.
>>>> 
>>>> I want to orient the crawl with two rules:
>>>> -if language identifier plugin detect that page is non "fr" the score
>>>> for sorting should be divided by two.
>>>> -if an anchor text or link cibling this page contain some therms the
>>>> score for sorting should be multiplied by ten.
>>>> 
>>>> Any help ?
>>>> 
>>>> Benjamin.
>>> 
>>> Concurso "Mi selfie por los 5". Detalles en 
>>> http://justiciaparaloscinco.wordpress.com
> 
> Concurso "Mi selfie por los 5". Detalles en 
> http://justiciaparaloscinco.wordpress.com

Reply via email to