Christian Hi,

URLs does not matter actually. Same URLs may do it. Just try to do the
crawling operation once more, just as in the first run. The thing is I am
not out of disk space (for esp. tmp) and I can sometimes get it done without
problems in this manner (yes I have some other problems such redirection).

But if I get once this error, when try to rerun the crawling:
# bin/nutch crawl -dir crawlIntoDir urlsDir -depth 2 -threads 25
get same error.

One more thing, will you share the stats as:

*$ bin/nutch readdb crawl-dir/crawldb -stats*
CrawlDb statistics start: crawl-dir/crawldb
Statistics for CrawlDb: crawl-dir/crawldb
TOTAL urls: 956
retry 0: 956
min score: 0.0
avg score: 0.009015691
max score: 1.339
status 1 (db_unfetched): 790
status 2 (db_fetched): 126
status 4 (db_redir_temp): 19
status 5 (db_redir_perm): 21
CrawlDb statistics: done


and

*$ bin/nutch readseg -list crawl-dir/segments/**
NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
20110730005815 3 2011-07-30T00:58:18 2011-07-30T00:58:18 3 3
20110730005828 163 2011-07-30T00:58:30 2011-07-30T01:05:32 201 123


When I got that error, the latter list shows that one (or more) segments is
not finished well. But now you can see my segments seem ok. What about
yours?

Dinçer


2011/8/1 Christian Weiske <[email protected]>

> Hello Dinçer,
>
>
> > > Somewhere during the crawling process I get an error that stops
> > > everything:
> > >
> file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801090707
> > > Exception in thread "main"
> > > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> > > exist:
>
> > I have had same problem in one of my instances. Let's dig together, at
> > least. I have tried to re-crawl the url list into same crawl directory
> > (crawl-301 in your case) and got the same error, will you confirm for
> > your case?
>
> How do you re-crawl the list? Is there a specific URL list in the
> segment?
>
> --
> Viele Grüße
> Christian Weiske
>

Reply via email to