Hi Germán, I think you've already picked up on the best ways to do what you want to do. I'd probably limit your first crawl to just your 20 sites using crawl-urlfilter.txt, and then (if you want external sites) move outside of that with a larger, unbound crawl. The other thing you could do is increase your topN - in my experience, 1000 is pretty tiny. A depth of 20 suggests to me that your goal is to move outside of your 20 initial sites; if you assume that you double your number of domain names at every level after one, by level 4 you have 8000 (20^3) top level domains - you've outstripped your topN by leaps and bounds before you're a quarter of the way through.
Hope this helps, Rob -----Original Message----- From: Germán Biozzoli [mailto:[email protected]] Sent: Saturday, December 11, 2010 11:55 AM To: [email protected] Subject: Difficult crawling Hi people I'm semi newbie to nutch and starting to use in a production site. My problem comes when I define the URL's seed with almost 20 urls. Also I need that the crawler explores all of them and discover new ones. All of this is working OK, I mean the crawler obtains news URLs and explore it. I defined that the depth is 20 and uses the topN with 1000. My problem is that every crawl process reach a time or a number of URLs and never complete the seed list. Because of this, the actual index has a lot of new discovered URLs but the core URL sites, uncompleted. I repeat the crawl and of course takes the uncompleted links from the previous level...But never could go top and complete my seed list. I need if someone could suggest me a way to make the type of crawl that I need. Perhaps the best could be to complete the seed list before to go further with discovered URLs, is there a way to do so? Perhaps another approach could be to use two nutch´es, one for the seed list, to complete this sites and another to go for external URL's. After that merging segments and invert. Someone could give me a tip about this. Thanks a lot Germán

