Re: Filter Urls Only At Generation Time Or Fetch Time

Manish Verma Tue, 02 Feb 2016 15:56:19 -0800

Your Understanding is absolutely correct, but if we go this way the crawling 
will take as much time as like we are crawling everything without filter.
Why we want to do this, to save crawling time we decided to crawl important 
URLs separately then others. we would have 2 nutch running on independent 
machine , one will crawl everything and other will crawl URLs matching some 
regex only and will run more frequently.


Yes looks like this is possible only at indexing time, may be we would drop 
idea of indexing specific URL and just crawl everything.

Thanks
Manish 

> On Feb 2, 2016, at 3:16 PM, Lewis John Mcgibbney <[email protected]> 
> wrote:
> 
> Hi Manish,
> 
> On Fri, Jan 29, 2016 at 10:20 PM, <[email protected]> wrote:
> 
>> I am using Nutch 1.10 and we are planing to crawl just some url which
>> match some pattern.
>> The problem is we can not do it using regex-urlfilter.txt as this way the
>> seeds itself would be rejected.
>> 
>> For e.g seed is apple.com <http://apple.com/> and we want to crawl just
>> urls which has /mac/ in url string. May be we have to filter the urls at
>> Generate or fetch time .
>> Any thoughts ? Can we customize Generate or Fetch phases ?
>> 
>> 
> You 'kind of' hit a chicken-and-egg problem here.
> You need to fetch URLs in order to cover more of the Webgraph, however you
> are trying to completely restrict your crawl to those which contain the
> appropriate regex expression(s).
> I would suggest that you probably DO want to fetch a fair number of initial
> seeds in your initial exploration. For example maybe 5 rounds... or a depth
> of 5.
> You can always filter what is being indexed, this may be what you are
> trying to achieve?
> Lewis

Re: Filter Urls Only At Generation Time Or Fetch Time

Reply via email to