Re: Fetch "Read time out" and crawl_parse "Input path does not exist"

Sebastian Nagel Tue, 06 Aug 2013 10:01:42 -0700

Hi,

> - To clear disk space I removed all segments
And the content is already indexed by Solr?
If not: Why you didn't also remove crawl db and link db.
If segments are removed you have to fetch all pages again,
no matter whether to start from the seeds or re-fetch URLs
from existing crawl db.


> - Ever since, re-running bin/crawl fails at the fetch point with multiple 
> "Read time out" errors
Can you send concrete example (exact message)?

> (The only directory in segments/xxxxxx is crawl_generate)
There should be also directory content/ which holds the raw page content.

Sebastian

On 08/06/2013 03:21 PM, Os Tyler wrote:
> Thanks in advance for any help you can provide.
> 
> Not sure exactly what's relevant here, but I have not been able to complete a 
> full bin/crawl since I had a "No space left on device" error.
> 
> Using nutch-1.6.
> - bin/crawl had been running as expected for 20+ iterations
> - On one run, the disk ran out of space and threw the "No space left on 
> device" error.
> - The db.fetch.interval.default is set at 80,000 (less than 24 hours)
> - To clear disk space I removed all segments
> - Ever since, re-running bin/crawl fails at the fetch point with multiple 
> "Read time out" errors
> bin/crawl exits when it attempts 'crawl parse' because the crawl_parse, 
> crawl_data, etc. directories do not exist. (The only directory in 
> segments/xxxxxx is crawl_generate)
> 
> What might be the solution?
>

Re: Fetch "Read time out" and crawl_parse "Input path does not exist"

Reply via email to