Hi,
It seems the problem is solved now although it looks like i cannot completely reproduce it under all circumstances. It has everything to do with the hadoop.tmp.dir setting and running multiple jobs in the local machine. Whenever i run a fetch job, it stores data in the tmp dir. If i, in the meanwhile, also run e.g. a readdb job, the fetch job's data in the tmp dir is lost, hench the error. Maybe i could have known this if i read much more on Hadoop's behavior but i haven't. It also is, in my case, a bit unexpected as i assume processes not to mess around with other processes' tmp data. So, don't run multiple jobs on the local machine using the same hadoop.tmp.dir setting. Cheers, -----Original message----- From: Markus Jelsma <markus.jel...@buyways.nl> Sent: Fri 10-09-2010 15:52 To: user@nutch.apache.org; Subject: RE: Input path does not exist revisited The first error in the sequence comes immediately when the fetcher is ready and before parsing the content. 2010-09-10 15:29:59,817 WARN mapred.LocalJobRunner - job_local_0001 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local_0001/attempt_local_0001_m_000000_0/output/spill0.out in any of the configured local directories at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:389) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138) at org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:94) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1443) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) 2010-09-10 15:30:00,638 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1145) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1116) I've still got no idea on why it happens, how it happens and when it happens. Disk space is not an issue and there is plenty of RAM. -----Original message----- From: Markus Jelsma <markus.jel...@buyways.nl> Sent: Thu 09-09-2010 17:53 To: user@nutch.apache.org; Subject: Input path does not exist revisited Hi, Well, today it happened again. I had quite a large fetch list and finally it all failed. I added a hadoop.tmp.dir setting to my nutch-site.xml file and point it to a large enough drive. Later, larger and larger fetch lists all went well until a fetch list of about 20k pages finally failed for unclear reasons. Madness! Can anyone try to explain what's really going on and why so many users suffer from this issue? FYI: I'm still running Nutch locally. A Hadoop cluster isn't set up yet. Cheers, Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350