Hi, But where can i get the inlinks containing url and anchors?
Ben. Envoyé de mon iPad > Le 10 sept. 2014 à 16:02, Jorge Luis Betancourt Gonzalez > <[email protected]> a écrit : > > Hi, > > Actually the generatorSortValue() method does not have access to the > ParseData object (which holds all the info extracted by the parsers from the > webpage raw content) as you pointed out. Essentially this method is used in > the Generator class in a very early stage of the crawling process way before > the URL have been fetched or parsed (which is from where the oulinks ~ new > links come from). > > The best approach is to use the generatorSortValue() which will assign the > initial score and actually will (as you figured out) get you where you want. > > How do you put your ismarked key into CrawlDatum? do you put it in the > metadata? Perhaps you could alter the score in CrawlDatum directly, since the > default implementation of the scoring plugins for this method is: > datum.getScore() * initSort; > > Taking into account what you’re trying to do, I think you could use the > passScoreAfterParsing() method of the ScoringFilter interface. This method > get’s called by the Fetcher after the parse process is done, so you’ll have > access to the ParseMetadata and you can alter this value. I’m not clear if > this will work, but at least worth check it out. One question about this > approach is that if the CrawlDatum score is synchronized with the > Parse/Content score. > > Regards, > >> On Sep 10, 2014, at 3:24 AM, Benjamin Derei <[email protected]> wrote: >> >> Hello, >> >> I'm using nutch 1.9. >> I want to alter the score used for sorting the topn page for the next >> parsing. >> I found it working by modifying the return of generatorsortvalue of a >> scoringfilter plugin. >> But this fonction don't have anchors text in inputs... >> I wrote some inelegant and inefficient code that put a "ismarked" key in >> crawldatum for knowing if anchors text or url contains some words... From >> what function i have to do this? >> Is there a complete schema of datas path though each plugins type functions? >> >> Benjamin. >> >> Envoyé de mon iPad >> >>> Le 10 sept. 2014 à 04:02, Jorge Luis Betancourt Gonzalez >>> <[email protected]> a écrit : >>> >>> You’ll need to write a couple of plugins to accomplish this. Which version >>> of Nutch are you using? In the first case, the score you want to alter is >>> the score that’s indexed into Solr (i.e your backend) ? >>> >>> Regards, >>> >>>> On Sep 9, 2014, at 2:38 PM, Benjamin Derei <[email protected]> wrote: >>>> >>>> hi, >>>> >>>> i'm a beginner in java and nutch. >>>> >>>> I want to orient the crawl with two rules: >>>> -if language identifier plugin detect that page is non "fr" the score >>>> for sorting should be divided by two. >>>> -if an anchor text or link cibling this page contain some therms the >>>> score for sorting should be multiplied by ten. >>>> >>>> Any help ? >>>> >>>> Benjamin. >>> >>> Concurso "Mi selfie por los 5". Detalles en >>> http://justiciaparaloscinco.wordpress.com > > Concurso "Mi selfie por los 5". Detalles en > http://justiciaparaloscinco.wordpress.com

