Re: How to restrict nutch to crawl only seed urls and links contained in the seed pages

Shameema Umer Mon, 02 Jul 2012 04:21:52 -0700

I had already tried this. But when we restrict depth to 1, the crawler will
not even crawl http://www.abc.com/category/apple because , the url link
depth is 3 for this.


Any other suggestion?



On Mon, Jul 2, 2012 at 3:12 PM, shekhar sharma <[email protected]>wrote:

> I think you need to specify the depth parameter as 1.
>
> bin/nutch crawl seedDir -dir crawl -depth 1.
>
> It will crawl only the seed links given. And if you want to see the out
> links from each seed you can read the segments.
> Is this what you are looking for?
>
> Regards,
> Som
>
> On Mon, Jul 2, 2012 at 1:38 PM, Shameema Umer <[email protected]> wrote:
>
> > Hi there,
> >
> > How to restrict nutch to crawl only seed urls and links contained in the
> > seed pages.
> >
> > For example.
> > If seed.txt contains:
> >
> > http://www.abc.com/category/apple
> > http://www.abc.com/category/orange
> >
> > I need to parse http://www.abc.com/category/apple and
> > http://www.abc.com/category/orange and the toUrls collected from these
> > pages. Please help.
> >
> > Thanks
> > Shameema
> >
>

Re: How to restrict nutch to crawl only seed urls and links contained in the seed pages

Reply via email to