Hi Marek, As were talking about automating the task were immediately looking at implementing a bash script. In the situation we have described, we wish Nutch to adopt a breadth first search BFS behaviour when crawling. Between us can we suggest any methods for best practice relating to BFS?
As you have highlighted we can check the crawldb after every updatedb command to determine whether there are any status (?) unfetched urls, and ideally we wish to continue until this number is non existent when we either dump stats or read them via stdout. I would suggest that we discuss a method for obtaining the dbunfecthed value and creating a loop based on whether or not it is = 0. Is this possible? On Wed, Jul 20, 2011 at 2:05 PM, Marek Bachmann <[email protected]>wrote: > Hi all, > > has anyone suggestions how I could solve following task: > > I want to crawl a sub-domain of our network completely. I always did it by > multiple fetch / parse / update cycles manually. After a few cycles I > checked if there are unfetched pages in the crawldb. If so, I started the > cycle over again. I repeated that until no new pages were discovered. > But that is annoying me and that is why I am looking for a way to do this > steps automatic until no unfetched pages are left. > > Any ideas? > -- *Lewis*

