Re: How to handle failures in nutch?

[email protected] Tue, 10 Apr 2012 01:38:23 -0700

> You can just generate new segments or remove all directories from the 
> segment except crawl_generate and start fetching again.


but in such case, the url is already present in crawldb in fetched status.
so if I run crawling on this urls, won't the file be considered as "already
fetched"?

> A segment cannot be halve complete, it either fails or succeeds 
> entirely.

Well doesn't it encourage us to put a small amount of urls in a segment and
thus make sure that if the segment fails because of a single url,  I'll have
to re-crawl only a "reasonable" amount of urls?

About the second question:
assume that I run nutch on a huge amount of files. one file's parsing fails,
lets say because the file was corrupted. I fix the file and now I want  to
re-crawl it.
If I just run crawling again, the file won't be fetched because of the
interval. 
I can try run nutch on this file aside, and then merge, but well, it doesn't
works as it should. while merging something goes wrong and I find myself
with a smaller index than I had before.
So the question is - how can I recrawl a single url while running nutch
crawl on an existing input.

Any other insights on these issues will be appreciated 



Markus Jelsma-2 wrote
> 
> hi,
> 
>  On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@" 
>  &lt;nutch.buddy@&gt; wrote:
>> Hi
>> There are some scenarios of failure in nutch which I'm not sure how 
>> to
>> handle.
>>
>> 1. I run nutch on a huge amount of urls and some kind of OOM 
>> exception if
>> thrown, or one of those "cannot allocate memory". The result is that 
>> my
>> segment is half complete.
> 
>  A segment cannot be halve complete, it either fails or succeeds 
>  entirely.
> 
>> How can I recover from this? Do I have to recrawl all the urls that 
>> were in
>> the segment?
>> If so, how do I mark them for recrawl in crawldb?
> 
>  You can just generate new segments or remove all directories from the 
>  segment except crawl_generate and start fetching again.
> 
>>
>> 2. I run nutch on a huge amount of urls and some urls are not parsed
>> sucessfully.
>> I get an index which has all the urls that worked and doesnt have the 
>> ones
>> that didnt work.
>> How can I handle them without having to recrawl the whole thing?
> 
>  I don't understand.
> 
> 
> 
>>
>>
>> Thanks.
>>
>> --
>> View this message in context:
>> 
>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
> 
hi,

 On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@" 
 &lt;nutch.buddy@&gt; wrote:
> Hi
> There are some scenarios of failure in nutch which I'm not sure how 
> to
> handle.
>
> 1. I run nutch on a huge amount of urls and some kind of OOM 
> exception if
> thrown, or one of those "cannot allocate memory". The result is that 
> my
> segment is half complete.

 A segment cannot be halve complete, it either fails or succeeds 
 entirely.

> How can I recover from this? Do I have to recrawl all the urls that 
> were in
> the segment?
> If so, how do I mark them for recrawl in crawldb?

 You can just generate new segments or remove all directories from the 
 segment except crawl_generate and start fetching again.

>
> 2. I run nutch on a huge amount of urls and some urls are not parsed
> sucessfully.
> I get an index which has all the urls that worked and doesnt have the 
> ones
> that didnt work.
> How can I handle them without having to recrawl the whole thing?

 I don't understand.



>
>
> Thanks.
>
> --
> View this message in context:
> 
> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
> Sent from the Nutch - User mailing list archive at Nabble.com.


hi,

 On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@" 
 &lt;nutch.buddy@&gt; wrote:
> Hi
> There are some scenarios of failure in nutch which I'm not sure how 
> to
> handle.
>
> 1. I run nutch on a huge amount of urls and some kind of OOM 
> exception if
> thrown, or one of those "cannot allocate memory". The result is that 
> my
> segment is half complete.

 A segment cannot be halve complete, it either fails or succeeds 
 entirely.

> How can I recover from this? Do I have to recrawl all the urls that 
> were in
> the segment?
> If so, how do I mark them for recrawl in crawldb?

 You can just generate new segments or remove all directories from the 
 segment except crawl_generate and start fetching again.

>
> 2. I run nutch on a huge amount of urls and some urls are not parsed
> sucessfully.
> I get an index which has all the urls that worked and doesnt have the 
> ones
> that didnt work.
> How can I handle them without having to recrawl the whole thing?

 I don't understand.



>
>
> Thanks.
>
> --
> View this message in context:
> 
> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
> Sent from the Nutch - User mailing list archive at Nabble.com.


hi,

 On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@" 
 &lt;nutch.buddy@&gt; wrote:
> Hi
> There are some scenarios of failure in nutch which I'm not sure how 
> to
> handle.
>
> 1. I run nutch on a huge amount of urls and some kind of OOM 
> exception if
> thrown, or one of those "cannot allocate memory". The result is that 
> my
> segment is half complete.

 A segment cannot be halve complete, it either fails or succeeds 
 entirely.

> How can I recover from this? Do I have to recrawl all the urls that 
> were in
> the segment?
> If so, how do I mark them for recrawl in crawldb?

 You can just generate new segments or remove all directories from the 
 segment except crawl_generate and start fetching again.

>
> 2. I run nutch on a huge amount of urls and some urls are not parsed
> sucessfully.
> I get an index which has all the urls that worked and doesnt have the 
> ones
> that didnt work.
> How can I handle them without having to recrawl the whole thing?

 I don't understand.



>
>
> Thanks.
>
> --
> View this message in context:
> 
> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
> Sent from the Nutch - User mailing list archive at Nabble.com.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899044.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to handle failures in nutch?

Reply via email to