Re: [MASSMAIL]Re: Priorize links in Fetching Step

Lewis John Mcgibbney Tue, 03 May 2016 13:15:07 -0700

Hi Yulio,

On Tue, May 3, 2016 at 7:53 AM, <[email protected]> wrote:


>
> From: Yulio Aleman Jimenez <[email protected]>
> To: [email protected]
> Cc:
> Date: Sun, 1 May 2016 17:57:30 -0400 (CDT)
> Subject: Re: [MASSMAIL]Re: Priorize links in Fetching Step
> Hi Lewis.
>
> Thanks for your answer, was very helpful; but I believe that these plugins
> are used to schedule the refetching of URLs that already has fetched and
> stored in the CrawlDB.
>

Correct.


>
> I need priorize the URLs discovered in crawls and stored in LinkDB (for
> new crawls) using the extension of the resource; but before they are
> fetched and stored in the CrawlDB.


OK. It worth stating that the LinkDB [0] is a data structure which
maintains an inverted link map, listing incoming links for each url. This
is not generated until after an initial CrawlDB exists and can only be
generated from parsed segments. Essentially, even if a URL has NOT been
fetched, it is still present within the CrawlDB.
What you are referring to here requires the generation of a new tool (or
possibly extension to an existing tool CrawlDB???) which takes as input the
CrawlDB and LinkDB (at minimum) and changes individual CrawlDatum scores
with the output from an analysis of the CrawlDatum's characteristics from
within the inverted LinkDB.
Does this make sense? I hope I explained clearly.

[0]
http://nutch.apache.org/apidocs/apidocs-1.11/index.html?org/apache/nutch/crawl/LinkDb.html


> The MimeType of a resource is identified after it are fetched, this is the
> reazon because I believe the MimeAdaptiveFetchSchedule doesn't work in this
> case.
>

You've described your task better now so yes I agree.


>
> Imagine this process:
> 1- In the first crawl, the seed have 10 URLs of HTML web pages.
> 2- In this crawl, 100 new URLs were detected by Nutch. From this quantity,
> 30 URLs are images, agree with the resource extension.
> 3- In the second crawl, Nutch is ready to fetch the 10 URLs of the seed,
> and the other 100 URLs identified in the previous crawl. But, of all URLs,
> Nutch is going to priorize the 30 URLs of images and after, the rest of
> URLs.
>
> With this strategy, Nutch will ensure the collection of images in first
> place and faster; also it will continue using the HTML web page for the
> expansion method on the Web.
>

Please again consider that the scoring you are suggesting e.g. using the
LinkDB, as above will give you an indication of how to score a page based
upon the MimeType of it's inlinks as identified within the LinkDB.


>
> I think that I may use the extension points of the ScoringFilters to write
> a plugin capable to filter the URLs by extensions and change the score of
> these to priorize the new URLs in new crawls, agree my convenience.
>

Yes that might be a much better idea. I think using the LinkDB for your
particular scenario is not required in all honesty.


>
> Do you have any idea how can I do this??? or Already there are one plugin
> capable to do this????
>
>
Basically you want to affect scoring of CrawlDatum's based upon it's
MimeType.
If I were you I would just work off of the existing scoring-link plugin [1]
and find the MimeType from the parse.getData().getContentMeta() and
implement it within the passScoreAfterParsing @Override.

[1]
https://github.com/apache/nutch/blob/25e879afc9c48981e3daccb055b5389799fae464/src/plugin/scoring-link/src/java/org/apache/nutch/scoring/link/LinkAnalysisScoringFilter.java

Re: [MASSMAIL]Re: Priorize links in Fetching Step

Reply via email to