I get this error now whendoing crawls at 120k each run:
2014-05-04 11:56:44,549 INFO crawl.CrawlDb - CrawlDb update: starting at
2014-05-04 11:56:44
2014-05-04 11:56:44,549 INFO crawl.CrawlDb - CrawlDb update: db:
TestCrawl/crawldb
2014-05-04 11:56:44,549 INFO crawl.CrawlDb - CrawlDb update: segments:
[TestCrawl/segments/20140504110143]
2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: additions
allowed: true
2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: URL
normalizing: false
2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: URL
filtering: false
2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: 404 purging:
false
2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: Merging
segment data into db.
2014-05-04 11:57:49,615 ERROR mapred.MapTask - IO error in map input file
file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
2014-05-04 11:58:36,732 WARN mapred.LocalJobRunner -
job_local385844795_0001
java.lang.Exception: java.io.IOException: IO error in map input file
file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.io.IOException: IO error in map input file
file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
at 55756800
at
org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
at
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1992)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2124)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
... 10 more
2014-05-04 11:58:36,797 ERROR crawl.CrawlDb - CrawlDb update:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:105)
at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:207)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:166)