Hi,

 

It seems the problem is solved now although it looks like i cannot completely 
reproduce it under all circumstances. It has everything to do with the 
hadoop.tmp.dir setting and running multiple jobs in the local machine. Whenever 
i run a fetch job, it stores data in the tmp dir. If i, in the meanwhile, also 
run e.g. a readdb job, the fetch job's data in the tmp dir is lost, hench the 
error.

 

Maybe i could have known this if i read much more on Hadoop's behavior but i 
haven't. It also is, in my case, a bit unexpected as i assume processes not to 
mess around with other processes' tmp data.

 

So, don't run multiple jobs on the local machine using the same hadoop.tmp.dir 
setting.

 

Cheers,
 
-----Original message-----
From: Markus Jelsma <markus.jel...@buyways.nl>
Sent: Fri 10-09-2010 15:52
To: user@nutch.apache.org; 
Subject: RE: Input path does not exist revisited

The first error in the sequence comes immediately when the fetcher is ready and 
before parsing the content. 

 

2010-09-10 15:29:59,817 WARN  mapred.LocalJobRunner - job_local_0001
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
taskTracker/jobcache/job_local_0001/attempt_local_0001_m_000000_0/output/spill0.out
 in any of the configured local directories
        at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:389)
        at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
        at 
org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:94)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1443)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
2010-09-10 15:30:00,638 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: 
Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1145)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1116)

 

I've still got no idea on why it happens, how it happens and when it happens. 
Disk space is not an issue and there is plenty of RAM.
 
-----Original message-----
From: Markus Jelsma <markus.jel...@buyways.nl>
Sent: Thu 09-09-2010 17:53
To: user@nutch.apache.org; 
Subject: Input path does not exist revisited

Hi,

Well, today it happened again. I had quite a large fetch list and finally it 
all failed. I added a hadoop.tmp.dir setting to my nutch-site.xml file and 
point it to a large enough drive. Later, larger and larger fetch lists all 
went well until a fetch list of about 20k pages finally failed for unclear 
reasons. Madness!

Can anyone try to explain what's really going on and why so many users suffer 
from this issue?

FYI: I'm still running Nutch locally. A Hadoop cluster isn't set up yet.

Cheers,

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to