Hi Yulio, On Tue, May 3, 2016 at 7:53 AM, <[email protected]> wrote:
> > From: Yulio Aleman Jimenez <[email protected]> > To: [email protected] > Cc: > Date: Sun, 1 May 2016 17:57:30 -0400 (CDT) > Subject: Re: [MASSMAIL]Re: Priorize links in Fetching Step > Hi Lewis. > > Thanks for your answer, was very helpful; but I believe that these plugins > are used to schedule the refetching of URLs that already has fetched and > stored in the CrawlDB. > Correct. > > I need priorize the URLs discovered in crawls and stored in LinkDB (for > new crawls) using the extension of the resource; but before they are > fetched and stored in the CrawlDB. OK. It worth stating that the LinkDB [0] is a data structure which maintains an inverted link map, listing incoming links for each url. This is not generated until after an initial CrawlDB exists and can only be generated from parsed segments. Essentially, even if a URL has NOT been fetched, it is still present within the CrawlDB. What you are referring to here requires the generation of a new tool (or possibly extension to an existing tool CrawlDB???) which takes as input the CrawlDB and LinkDB (at minimum) and changes individual CrawlDatum scores with the output from an analysis of the CrawlDatum's characteristics from within the inverted LinkDB. Does this make sense? I hope I explained clearly. [0] http://nutch.apache.org/apidocs/apidocs-1.11/index.html?org/apache/nutch/crawl/LinkDb.html > The MimeType of a resource is identified after it are fetched, this is the > reazon because I believe the MimeAdaptiveFetchSchedule doesn't work in this > case. > You've described your task better now so yes I agree. > > Imagine this process: > 1- In the first crawl, the seed have 10 URLs of HTML web pages. > 2- In this crawl, 100 new URLs were detected by Nutch. From this quantity, > 30 URLs are images, agree with the resource extension. > 3- In the second crawl, Nutch is ready to fetch the 10 URLs of the seed, > and the other 100 URLs identified in the previous crawl. But, of all URLs, > Nutch is going to priorize the 30 URLs of images and after, the rest of > URLs. > > With this strategy, Nutch will ensure the collection of images in first > place and faster; also it will continue using the HTML web page for the > expansion method on the Web. > Please again consider that the scoring you are suggesting e.g. using the LinkDB, as above will give you an indication of how to score a page based upon the MimeType of it's inlinks as identified within the LinkDB. > > I think that I may use the extension points of the ScoringFilters to write > a plugin capable to filter the URLs by extensions and change the score of > these to priorize the new URLs in new crawls, agree my convenience. > Yes that might be a much better idea. I think using the LinkDB for your particular scenario is not required in all honesty. > > Do you have any idea how can I do this??? or Already there are one plugin > capable to do this???? > > Basically you want to affect scoring of CrawlDatum's based upon it's MimeType. If I were you I would just work off of the existing scoring-link plugin [1] and find the MimeType from the parse.getData().getContentMeta() and implement it within the passScoreAfterParsing @Override. [1] https://github.com/apache/nutch/blob/25e879afc9c48981e3daccb055b5389799fae464/src/plugin/scoring-link/src/java/org/apache/nutch/scoring/link/LinkAnalysisScoringFilter.java

