Hi. You need to configure your nutch-site.xml, specifically this option db.ignore.external.links you need to change to true for limit the crawl to include only initially injected hosts.
_____________________________________________________________________ Ing. Eyeris Rodriguez Rueda Teléfono:837-3370 Universidad de las Ciencias Informáticas _____________________________________________________________________ -----Mensaje original----- De: Shameema Umer [mailto:[email protected]] Enviado el: lunes, 02 de julio de 2012 6:21 AM Para: [email protected] Asunto: Re: How to restrict nutch to crawl only seed urls and links contained in the seed pages I had already tried this. But when we restrict depth to 1, the crawler will not even crawl http://www.abc.com/category/apple because , the url link depth is 3 for this. Any other suggestion? On Mon, Jul 2, 2012 at 3:12 PM, shekhar sharma <[email protected]>wrote: > I think you need to specify the depth parameter as 1. > > bin/nutch crawl seedDir -dir crawl -depth 1. > > It will crawl only the seed links given. And if you want to see the > out links from each seed you can read the segments. > Is this what you are looking for? > > Regards, > Som > > On Mon, Jul 2, 2012 at 1:38 PM, Shameema Umer <[email protected]> wrote: > > > Hi there, > > > > How to restrict nutch to crawl only seed urls and links contained in > > the seed pages. > > > > For example. > > If seed.txt contains: > > > > http://www.abc.com/category/apple > > http://www.abc.com/category/orange > > > > I need to parse http://www.abc.com/category/apple and > > http://www.abc.com/category/orange and the toUrls collected from > > these pages. Please help. > > > > Thanks > > Shameema > > > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci

