Hi people I'm semi newbie to nutch and starting to use in a production site. My problem comes when I define the URL's seed with almost 20 urls. Also I need that the crawler explores all of them and discover new ones. All of this is working OK, I mean the crawler obtains news URLs and explore it. I defined that the depth is 20 and uses the topN with 1000. My problem is that every crawl process reach a time or a number of URLs and never complete the seed list. Because of this, the actual index has a lot of new discovered URLs but the core URL sites, uncompleted. I repeat the crawl and of course takes the uncompleted links from the previous level...But never could go top and complete my seed list.
I need if someone could suggest me a way to make the type of crawl that I need. Perhaps the best could be to complete the seed list before to go further with discovered URLs, is there a way to do so? Perhaps another approach could be to use two nutch´es, one for the seed list, to complete this sites and another to go for external URL's. After that merging segments and invert. Someone could give me a tip about this. Thanks a lot Germán

