Re: How to restrict nutch to crawl only seed urls and links contained in the seed pages

shekhar sharma Mon, 02 Jul 2012 21:32:06 -0700

Can you tell me what is the command are u running?

On Mon, Jul 2, 2012 at 4:51 PM, Shameema Umer <[email protected]> wrote:


> I had already tried this. But when we restrict depth to 1, the crawler will
> not even crawl http://www.abc.com/category/apple because , the url link
> depth is 3 for this.
>
> Any other suggestion?
>
>
>
> On Mon, Jul 2, 2012 at 3:12 PM, shekhar sharma <[email protected]
> >wrote:
>
> > I think you need to specify the depth parameter as 1.
> >
> > bin/nutch crawl seedDir -dir crawl -depth 1.
> >
> > It will crawl only the seed links given. And if you want to see the out
> > links from each seed you can read the segments.
> > Is this what you are looking for?
> >
> > Regards,
> > Som
> >
> > On Mon, Jul 2, 2012 at 1:38 PM, Shameema Umer <[email protected]> wrote:
> >
> > > Hi there,
> > >
> > > How to restrict nutch to crawl only seed urls and links contained in
> the
> > > seed pages.
> > >
> > > For example.
> > > If seed.txt contains:
> > >
> > > http://www.abc.com/category/apple
> > > http://www.abc.com/category/orange
> > >
> > > I need to parse http://www.abc.com/category/apple and
> > > http://www.abc.com/category/orange and the toUrls collected from these
> > > pages. Please help.
> > >
> > > Thanks
> > > Shameema
> > >
> >
>

Re: How to restrict nutch to crawl only seed urls and links contained in the seed pages

Reply via email to