Hi German, The best way of doing this is be to track the depth of the URLs since the injection and give a priority to the lowest depths during the generation step. This is done by implementing a custom scoring filter implementing the interface ScoringFilter<http://nutch.apache.org/apidocs-1.2/org/apache/nutch/scoring/ScoringFilter.html>. Not trivial but that would definitely work.
HTH Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com On 11 December 2010 19:55, Germán Biozzoli <[email protected]> wrote: > Hi people > > I'm semi newbie to nutch and starting to use in a production site. My > problem comes when I define the URL's seed with almost 20 urls. Also I > need that the crawler explores all of them and discover new ones. All > of this is working OK, I mean the crawler obtains news URLs and > explore it. I defined that the depth is 20 and uses the topN with > 1000. My problem is that every crawl process reach a time or a number > of URLs and never complete the seed list. Because of this, the actual > index has a lot of new discovered URLs but the core URL sites, > uncompleted. I repeat the crawl and of course takes the uncompleted > links from the previous level...But never could go top and complete my > seed list. > > I need if someone could suggest me a way to make the type of crawl > that I need. Perhaps the best could be to complete the seed list > before to go further with discovered URLs, is there a way to do so? > Perhaps another approach could be to use two nutch´es, one for the > seed list, to complete this sites and another to go for external > URL's. After that merging segments and invert. > > Someone could give me a tip about this. > Thanks a lot > Germán >

