Hi Os,

great! Looks like you found the reason.

> To fix, increase the /tmp partition, or,
> ... configure hadoop to write to another directory with: ...
If you run Nutch in local mode (not in a Hadoop cluster),
assigning each a crawl job its own tmp dir has two advantages:
- you can run concurrent jobs
- it's easy to clean-up the tmp dir after job has been run.
  Failed jobs may leave much data in tmp dir

Sebastian

2013/8/7 Os Tyler <[email protected]>

> I believe I got to the bottom of this one.
>
> I think it was a simple disk space issue in /tmp where hadoop writes its
> data by default. It's a little hard to catch because once bin/crawl exits,
> hadoop cleans up it's data, so when you look at disk usage, /tmp/ looks
> like it has tons.
>
> The giveaway is in logs/hadoop.log and the error there is:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
> valid local directory for output/file.out
>
> Afterwards, the segment directory has a crawl_generate directory, but no
> others.
>
> To fix, increase the /tmp partition, or, (repeating info from a thread I
> saw from earlier this year) configure hadoop to write to another directory
> with:
> Increase /tmp partition, or configure nutch to write the hadoop tmp
> elsewhere:
> <property>
> <name>hadoop.tmp.dir</name>
> <value>${path/to/hadoop/temp}</value>
> </property>
>
> ________________________________________
> From: Os Tyler
> Sent: Tuesday, August 06, 2013 10:30 AM
> To: [email protected]
> Subject: RE: Fetch "Read time out" and crawl_parse "Input path does not
> exist"
>
> Thank you, Sebastian.
>
> If a segment is incomplete due to filling up the hard drive, should you
> delete that segment?
>
> The other segments I deleted had already been indexed. I have read in a
> few threads that once a segment is indexed it can be deleted, is that
> correct?
>
> Exact error message for the "Read time out":
> fetch of http://redacted.com/Talk:JRM_Equipment failed with:
> java.net.SocketTimeoutException: Read timed out
> -finishing thread FetcherThread, activeThreads=8
>
> And ... there's no 'content' directory in the segment directory after
> bin/crawl exits with error.
>
> ________________________________________
> From: Sebastian Nagel [[email protected]]
> Sent: Tuesday, August 06, 2013 10:00 AM
> To: [email protected]
> Subject: Re: Fetch "Read time out" and crawl_parse "Input path does not
> exist"
>
> Hi,
>
> > - To clear disk space I removed all segments
> And the content is already indexed by Solr?
> If not: Why you didn't also remove crawl db and link db.
> If segments are removed you have to fetch all pages again,
> no matter whether to start from the seeds or re-fetch URLs
> from existing crawl db.
>
> > - Ever since, re-running bin/crawl fails at the fetch point with
> multiple "Read time out" errors
> Can you send concrete example (exact message)?
>
> > (The only directory in segments/xxxxxx is crawl_generate)
> There should be also directory content/ which holds the raw page content.
>
> Sebastian
>
> On 08/06/2013 03:21 PM, Os Tyler wrote:
> > Thanks in advance for any help you can provide.
> >
> > Not sure exactly what's relevant here, but I have not been able to
> complete a full bin/crawl since I had a "No space left on device" error.
> >
> > Using nutch-1.6.
> > - bin/crawl had been running as expected for 20+ iterations
> > - On one run, the disk ran out of space and threw the "No space left on
> device" error.
> > - The db.fetch.interval.default is set at 80,000 (less than 24 hours)
> > - To clear disk space I removed all segments
> > - Ever since, re-running bin/crawl fails at the fetch point with
> multiple "Read time out" errors
> > bin/crawl exits when it attempts 'crawl parse' because the crawl_parse,
> crawl_data, etc. directories do not exist. (The only directory in
> segments/xxxxxx is crawl_generate)
> >
> > What might be the solution?
> >
>
>

Reply via email to