Re: How to handle failures in nutch?

Markus Jelsma Tue, 10 Apr 2012 02:38:22 -0700

On Tue, 10 Apr 2012 01:37:51 -0700 (PDT), "[email protected]"<[email protected]> wrote:

You can just generate new segments or remove all directories fromthe
segment except crawl_generate and start fetching again.
but in such case, the url is already present in crawldb in fetchedstatus.so if I run crawling on this urls, won't the file be considered as"already
fetched"?

No, because you didn't update the CrawlDB with the failed segment sothe state remains the same as before the generating failing segmentassuming generate.update.crawldb is false.

A segment cannot be halve complete, it either fails or succeeds
entirely.
Well doesn't it encourage us to put a small amount of urls in asegment andthus make sure that if the segment fails because of a single url,I'll have
to re-crawl only a "reasonable" amount of urls?

Indeed, having fewer records per segment reduces the problem of afailed segment. But it would be best to avoid running out of memory inthe first place. Parsing fetchers are prone to running out of memory butwith parsing disabled it's actually quite hard to run out of memory.

About the second question:
assume that I run nutch on a huge amount of files. one file's parsingfails,lets say because the file was corrupted. I fix the file and now Iwant to
re-crawl it.
If I just run crawling again, the file won't be fetched because ofthe
interval.
I can try run nutch on this file aside, and then merge, but well, itdoesn'tworks as it should. while merging something goes wrong and I findmyself
with a smaller index than I had before.
So the question is - how can I recrawl a single url while runningnutch
crawl on an existing input.

Ah i see. Use the FreeGenerator tool to generate a segment from a plaintext input file.


Any other insights on these issues will be appreciated



Markus Jelsma-2 wrote


hi,

 On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
 <nutch.buddy@> wrote:

Hi
There are some scenarios of failure in nutch which I'm not sure how
to
handle.

1. I run nutch on a huge amount of urls and some kind of OOM
exception if

thrown, or one of those "cannot allocate memory". The result isthat

my
segment is half complete.


 A segment cannot be halve complete, it either fails or succeeds
 entirely.

How can I recover from this? Do I have to recrawl all the urls that
were in
the segment?
If so, how do I mark them for recrawl in crawldb?

You can just generate new segments or remove all directories fromthe

 segment except crawl_generate and start fetching again.

2. I run nutch on a huge amount of urls and some urls are notparsed
sucessfully.
I get an index which has all the urls that worked and doesnt havethe
ones
that didnt work.
How can I handle them without having to recrawl the whole thing?


 I don't understand.



Thanks.

--
View this message in context:


http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
Sent from the Nutch - User mailing list archive at Nabble.com.

hi,

 On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
 <nutch.buddy@> wrote:

Hi
There are some scenarios of failure in nutch which I'm not sure how
to
handle.

1. I run nutch on a huge amount of urls and some kind of OOM
exception if
thrown, or one of those "cannot allocate memory". The result is that
my
segment is half complete.


 A segment cannot be halve complete, it either fails or succeeds
 entirely.

How can I recover from this? Do I have to recrawl all the urls that
were in
the segment?
If so, how do I mark them for recrawl in crawldb?

You can just generate new segments or remove all directories fromthe

 segment except crawl_generate and start fetching again.


2. I run nutch on a huge amount of urls and some urls are not parsed
sucessfully.

I get an index which has all the urls that worked and doesnt havethe

ones
that didnt work.
How can I handle them without having to recrawl the whole thing?


 I don't understand.



Thanks.

--
View this message in context:


http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
Sent from the Nutch - User mailing list archive at Nabble.com.



hi,

 On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
 <nutch.buddy@> wrote:

Hi
There are some scenarios of failure in nutch which I'm not sure how
to
handle.

1. I run nutch on a huge amount of urls and some kind of OOM
exception if
thrown, or one of those "cannot allocate memory". The result is that
my
segment is half complete.


 A segment cannot be halve complete, it either fails or succeeds
 entirely.

How can I recover from this? Do I have to recrawl all the urls that
were in
the segment?
If so, how do I mark them for recrawl in crawldb?

You can just generate new segments or remove all directories fromthe

 segment except crawl_generate and start fetching again.


2. I run nutch on a huge amount of urls and some urls are not parsed
sucessfully.

I get an index which has all the urls that worked and doesnt havethe

ones
that didnt work.
How can I handle them without having to recrawl the whole thing?


 I don't understand.



Thanks.

--
View this message in context:


http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
Sent from the Nutch - User mailing list archive at Nabble.com.



hi,

 On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
 <nutch.buddy@> wrote:

Hi
There are some scenarios of failure in nutch which I'm not sure how
to
handle.

1. I run nutch on a huge amount of urls and some kind of OOM
exception if
thrown, or one of those "cannot allocate memory". The result is that
my
segment is half complete.


 A segment cannot be halve complete, it either fails or succeeds
 entirely.

How can I recover from this? Do I have to recrawl all the urls that
were in
the segment?
If so, how do I mark them for recrawl in crawldb?

You can just generate new segments or remove all directories fromthe

 segment except crawl_generate and start fetching again.


2. I run nutch on a huge amount of urls and some urls are not parsed
sucessfully.

I get an index which has all the urls that worked and doesnt havethe

ones
that didnt work.
How can I handle them without having to recrawl the whole thing?


 I don't understand.



Thanks.

--
View this message in context:


http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
Sent from the Nutch - User mailing list archive at Nabble.com.



--
View this message in context:

http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899044.html
Sent from the Nutch - User mailing list archive at Nabble.com.

--

Re: How to handle failures in nutch?

Reply via email to