Re: Filter Urls Only At Generation Time Or Fetch Time

Lewis John Mcgibbney Tue, 02 Feb 2016 15:17:09 -0800

Hi Manish,

On Fri, Jan 29, 2016 at 10:20 PM, <[email protected]> wrote:


> I am using Nutch 1.10 and we are planing to crawl just some url which
> match some pattern.
> The problem is we can not do it using regex-urlfilter.txt as this way the
> seeds itself would be rejected.
>
> For e.g seed is apple.com <http://apple.com/> and we want to crawl just
> urls which has /mac/ in url string. May be we have to filter the urls at
> Generate or fetch time .
> Any thoughts ? Can we customize Generate or Fetch phases ?
>
>
You 'kind of' hit a chicken-and-egg problem here.
You need to fetch URLs in order to cover more of the Webgraph, however you
are trying to completely restrict your crawl to those which contain the
appropriate regex expression(s).
I would suggest that you probably DO want to fetch a fair number of initial
seeds in your initial exploration. For example maybe 5 rounds... or a depth
of 5.
You can always filter what is being indexed, this may be what you are
trying to achieve?
Lewis

Re: Filter Urls Only At Generation Time Or Fetch Time

Reply via email to