Hi Os, great! Looks like you found the reason.
> To fix, increase the /tmp partition, or, > ... configure hadoop to write to another directory with: ... If you run Nutch in local mode (not in a Hadoop cluster), assigning each a crawl job its own tmp dir has two advantages: - you can run concurrent jobs - it's easy to clean-up the tmp dir after job has been run. Failed jobs may leave much data in tmp dir Sebastian 2013/8/7 Os Tyler <[email protected]> > I believe I got to the bottom of this one. > > I think it was a simple disk space issue in /tmp where hadoop writes its > data by default. It's a little hard to catch because once bin/crawl exits, > hadoop cleans up it's data, so when you look at disk usage, /tmp/ looks > like it has tons. > > The giveaway is in logs/hadoop.log and the error there is: > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any > valid local directory for output/file.out > > Afterwards, the segment directory has a crawl_generate directory, but no > others. > > To fix, increase the /tmp partition, or, (repeating info from a thread I > saw from earlier this year) configure hadoop to write to another directory > with: > Increase /tmp partition, or configure nutch to write the hadoop tmp > elsewhere: > <property> > <name>hadoop.tmp.dir</name> > <value>${path/to/hadoop/temp}</value> > </property> > > ________________________________________ > From: Os Tyler > Sent: Tuesday, August 06, 2013 10:30 AM > To: [email protected] > Subject: RE: Fetch "Read time out" and crawl_parse "Input path does not > exist" > > Thank you, Sebastian. > > If a segment is incomplete due to filling up the hard drive, should you > delete that segment? > > The other segments I deleted had already been indexed. I have read in a > few threads that once a segment is indexed it can be deleted, is that > correct? > > Exact error message for the "Read time out": > fetch of http://redacted.com/Talk:JRM_Equipment failed with: > java.net.SocketTimeoutException: Read timed out > -finishing thread FetcherThread, activeThreads=8 > > And ... there's no 'content' directory in the segment directory after > bin/crawl exits with error. > > ________________________________________ > From: Sebastian Nagel [[email protected]] > Sent: Tuesday, August 06, 2013 10:00 AM > To: [email protected] > Subject: Re: Fetch "Read time out" and crawl_parse "Input path does not > exist" > > Hi, > > > - To clear disk space I removed all segments > And the content is already indexed by Solr? > If not: Why you didn't also remove crawl db and link db. > If segments are removed you have to fetch all pages again, > no matter whether to start from the seeds or re-fetch URLs > from existing crawl db. > > > - Ever since, re-running bin/crawl fails at the fetch point with > multiple "Read time out" errors > Can you send concrete example (exact message)? > > > (The only directory in segments/xxxxxx is crawl_generate) > There should be also directory content/ which holds the raw page content. > > Sebastian > > On 08/06/2013 03:21 PM, Os Tyler wrote: > > Thanks in advance for any help you can provide. > > > > Not sure exactly what's relevant here, but I have not been able to > complete a full bin/crawl since I had a "No space left on device" error. > > > > Using nutch-1.6. > > - bin/crawl had been running as expected for 20+ iterations > > - On one run, the disk ran out of space and threw the "No space left on > device" error. > > - The db.fetch.interval.default is set at 80,000 (less than 24 hours) > > - To clear disk space I removed all segments > > - Ever since, re-running bin/crawl fails at the fetch point with > multiple "Read time out" errors > > bin/crawl exits when it attempts 'crawl parse' because the crawl_parse, > crawl_data, etc. directories do not exist. (The only directory in > segments/xxxxxx is crawl_generate) > > > > What might be the solution? > > > >

