RE: How to restrict nutch to crawl only seed urls and links contained in the seed pages

Eyeris Rodriguez Rueda Mon, 02 Jul 2012 10:52:53 -0700

Hi.
You need to configure your nutch-site.xml, specifically this option
db.ignore.external.links
you need to change to true for limit the crawl to include
  only initially injected hosts.

_____________________________________________________________________
Ing. Eyeris Rodriguez Rueda
Teléfono:837-3370
Universidad de las Ciencias Informáticas
_____________________________________________________________________

-----Mensaje original-----
De: Shameema Umer [mailto:[email protected]] 
Enviado el: lunes, 02 de julio de 2012 6:21 AM
Para: [email protected]
Asunto: Re: How to restrict nutch to crawl only seed urls and links
contained in the seed pages

I had already tried this. But when we restrict depth to 1, the crawler will
not even crawl http://www.abc.com/category/apple because , the url link
depth is 3 for this.

Any other suggestion?

On Mon, Jul 2, 2012 at 3:12 PM, shekhar sharma <[email protected]>wrote:

> I think you need to specify the depth parameter as 1.
>
> bin/nutch crawl seedDir -dir crawl -depth 1.
>
> It will crawl only the seed links given. And if you want to see the 
> out links from each seed you can read the segments.
> Is this what you are looking for?
>
> Regards,
> Som
>
> On Mon, Jul 2, 2012 at 1:38 PM, Shameema Umer <[email protected]> wrote:
>
> > Hi there,
> >
> > How to restrict nutch to crawl only seed urls and links contained in 
> > the seed pages.
> >
> > For example.
> > If seed.txt contains:
> >
> > http://www.abc.com/category/apple
> > http://www.abc.com/category/orange
> >
> > I need to parse http://www.abc.com/category/apple and 
> > http://www.abc.com/category/orange and the toUrls collected from 
> > these pages. Please help.
> >
> > Thanks
> > Shameema
> >
>

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

RE: How to restrict nutch to crawl only seed urls and links contained in the seed pages

Reply via email to