Re: How to handle failures in nutch?

Markus Jelsma Mon, 09 Apr 2012 23:40:49 -0700

hi,

On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "[email protected]"<[email protected]> wrote:

Hi
There are some scenarios of failure in nutch which I'm not sure howto
handle.
1. I run nutch on a huge amount of urls and some kind of OOMexception ifthrown, or one of those "cannot allocate memory". The result is thatmy
segment is half complete.

A segment cannot be halve complete, it either fails or succeedsentirely.

How can I recover from this? Do I have to recrawl all the urls thatwere in
the segment?
If so, how do I mark them for recrawl in crawldb?

You can just generate new segments or remove all directories from thesegment except crawl_generate and start fetching again.


2. I run nutch on a huge amount of urls and some urls are not parsed
sucessfully.

I get an index which has all the urls that worked and doesnt have theones

that didnt work.
How can I handle them without having to recrawl the whole thing?


I don't understand.



Thanks.

--
View this message in context:

http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to handle failures in nutch?

Reply via email to