Christian Hi, URLs does not matter actually. Same URLs may do it. Just try to do the crawling operation once more, just as in the first run. The thing is I am not out of disk space (for esp. tmp) and I can sometimes get it done without problems in this manner (yes I have some other problems such redirection).
But if I get once this error, when try to rerun the crawling: # bin/nutch crawl -dir crawlIntoDir urlsDir -depth 2 -threads 25 get same error. One more thing, will you share the stats as: *$ bin/nutch readdb crawl-dir/crawldb -stats* CrawlDb statistics start: crawl-dir/crawldb Statistics for CrawlDb: crawl-dir/crawldb TOTAL urls: 956 retry 0: 956 min score: 0.0 avg score: 0.009015691 max score: 1.339 status 1 (db_unfetched): 790 status 2 (db_fetched): 126 status 4 (db_redir_temp): 19 status 5 (db_redir_perm): 21 CrawlDb statistics: done and *$ bin/nutch readseg -list crawl-dir/segments/** NAME GENERATED FETCHER START FETCHER END FETCHED PARSED 20110730005815 3 2011-07-30T00:58:18 2011-07-30T00:58:18 3 3 20110730005828 163 2011-07-30T00:58:30 2011-07-30T01:05:32 201 123 When I got that error, the latter list shows that one (or more) segments is not finished well. But now you can see my segments seem ok. What about yours? Dinçer 2011/8/1 Christian Weiske <[email protected]> > Hello Dinçer, > > > > > Somewhere during the crawling process I get an error that stops > > > everything: > > > > file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801090707 > > > Exception in thread "main" > > > org.apache.hadoop.mapred.InvalidInputException: Input path does not > > > exist: > > > I have had same problem in one of my instances. Let's dig together, at > > least. I have tried to re-crawl the url list into same crawl directory > > (crawl-301 in your case) and got the same error, will you confirm for > > your case? > > How do you re-crawl the list? Is there a specific URL list in the > segment? > > -- > Viele Grüße > Christian Weiske >

