Hey!
I've been trying out Nutch 1.0 and face an intermittent issue exactly as
this one https://issues.apache.org/jira/browse/NUTCH-503
I mean I can crawl certain websites without any problems but for some I end
up with this error (for both v1.0 and v1.1):
hardy8:~/devel/sware/dump/apache-nutch-1.1-bin$ bin/nutch crawl urls -dir
crawl -depth 100 -topN 999999999 -threads 50
crawl started in: crawl
rootUrlDir = urls
threads = 50
depth = 100
indexer=lucene
topN = 999999999
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 999999999
Generator: jobtracker is 'local', generating exactly one partition.
*Generator: 0 records selected for fetching, exiting ...*
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl
Debugging java/org/apache/nutch/crawl/Generator.java revealed that *
readers.length* was 1 (which is correct since I was crawling only one url),
but ----> *if (readers[num].next(new FloatWritable())) * condition below did
not evaluate correctly as it should have.
// check that we selected at least some entries ...
SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(job,
tempDir);
boolean empty = true;
if (readers != null && readers.length > 0) {
for (int num = 0; num < readers.length; num++) {
* if (readers[num].next(new FloatWritable())) {*
empty = false;
break;
}
}
}
I'm kind of stuck and was wondering if others have faced this too?
Gaurav
PS: I'm running nutch out of the box on a single machine and am not using
HDFS.