Re: How to handle failures in nutch?

[email protected] Tue, 10 Apr 2012 03:01:49 -0700

So if I understand you correctly, I end up having to merge the cralwdb, the
segments and the indexes..
right?




Markus Jelsma-2 wrote
> 
> The quickest method now is to use the FreeGenerator tool, fetch the 
>  small segment, update a new temporary CrawlDB with that segment and run 
>  the indexer pointing to that temporary CrawlDB and segment. You can 
>  clear the temporary CrawlDB and update the bigger CrawlDB later with 
>  other segments in one go.
> 
> 
>  On Tue, 10 Apr 2012 02:51:55 -0700 (PDT), "nutch.buddy@" 
>  &lt;nutch.buddy@&gt; wrote:
>> Ok, now what about a scenario in which I want to add a new url?
>> then I must run it through the whole nutch process because it should 
>> be
>> inserted into the crawldb.
>> But how can I avoid all the existent crawldb being reprocessed?
>>
>> Assume that I want to add a new url to the index and right give 
>> deploy it in
>> the search, without having to wait for all the urls in crawldb that 
>> need to
>> be refetched to do so?
>>
>>
>>
>>
>> Markus Jelsma-2 wrote
>>>
>>> On Tue, 10 Apr 2012 01:37:51 -0700 (PDT), "nutch.buddy@"
>>>  &lt;nutch.buddy@&gt; wrote:
>>>>> You can just generate new segments or remove all directories from
>>>>> the
>>>>> segment except crawl_generate and start fetching again.
>>>>
>>>> but in such case, the url is already present in crawldb in fetched
>>>> status.
>>>> so if I run crawling on this urls, won't the file be considered as
>>>> "already
>>>> fetched"?
>>>
>>>  No, because you didn't update the CrawlDB with the failed segment 
>>> so
>>>  the state remains the same as before the generating failing segment
>>>  assuming generate.update.crawldb is false.
>>>
>>>>
>>>>> A segment cannot be halve complete, it either fails or succeeds
>>>>> entirely.
>>>>
>>>> Well doesn't it encourage us to put a small amount of urls in a
>>>> segment and
>>>> thus make sure that if the segment fails because of a single url,
>>>> I'll have
>>>> to re-crawl only a "reasonable" amount of urls?
>>>
>>>  Indeed, having fewer records per segment reduces the problem of a
>>>  failed segment. But it would be best to avoid running out of memory 
>>> in
>>>  the first place. Parsing fetchers are prone to running out of 
>>> memory but
>>>  with parsing disabled it's actually quite hard to run out of 
>>> memory.
>>>
>>>>
>>>> About the second question:
>>>> assume that I run nutch on a huge amount of files. one file's 
>>>> parsing
>>>> fails,
>>>> lets say because the file was corrupted. I fix the file and now I
>>>> want  to
>>>> re-crawl it.
>>>> If I just run crawling again, the file won't be fetched because of
>>>> the
>>>> interval.
>>>> I can try run nutch on this file aside, and then merge, but well, 
>>>> it
>>>> doesn't
>>>> works as it should. while merging something goes wrong and I find
>>>> myself
>>>> with a smaller index than I had before.
>>>> So the question is - how can I recrawl a single url while running
>>>> nutch
>>>> crawl on an existing input.
>>>
>>>  Ah i see. Use the FreeGenerator tool to generate a segment from a 
>>> plain
>>>  text input file.
>>>
>>>>
>>>> Any other insights on these issues will be appreciated
>>>>
>>>>
>>>>
>>>> Markus Jelsma-2 wrote
>>>>>
>>>>> hi,
>>>>>
>>>>>  On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
>>>>>  &lt;nutch.buddy@&gt; wrote:
>>>>>> Hi
>>>>>> There are some scenarios of failure in nutch which I'm not sure 
>>>>>> how
>>>>>> to
>>>>>> handle.
>>>>>>
>>>>>> 1. I run nutch on a huge amount of urls and some kind of OOM
>>>>>> exception if
>>>>>> thrown, or one of those "cannot allocate memory". The result is
>>>>>> that
>>>>>> my
>>>>>> segment is half complete.
>>>>>
>>>>>  A segment cannot be halve complete, it either fails or succeeds
>>>>>  entirely.
>>>>>
>>>>>> How can I recover from this? Do I have to recrawl all the urls 
>>>>>> that
>>>>>> were in
>>>>>> the segment?
>>>>>> If so, how do I mark them for recrawl in crawldb?
>>>>>
>>>>>  You can just generate new segments or remove all directories from
>>>>> the
>>>>>  segment except crawl_generate and start fetching again.
>>>>>
>>>>>>
>>>>>> 2. I run nutch on a huge amount of urls and some urls are not
>>>>>> parsed
>>>>>> sucessfully.
>>>>>> I get an index which has all the urls that worked and doesnt have
>>>>>> the
>>>>>> ones
>>>>>> that didnt work.
>>>>>> How can I handle them without having to recrawl the whole thing?
>>>>>
>>>>>  I don't understand.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>>
>>>>>>
>>>>>> 
>>>>>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
>>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>>
>>>> hi,
>>>>
>>>>  On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
>>>>  &lt;nutch.buddy@&gt; wrote:
>>>>> Hi
>>>>> There are some scenarios of failure in nutch which I'm not sure 
>>>>> how
>>>>> to
>>>>> handle.
>>>>>
>>>>> 1. I run nutch on a huge amount of urls and some kind of OOM
>>>>> exception if
>>>>> thrown, or one of those "cannot allocate memory". The result is 
>>>>> that
>>>>> my
>>>>> segment is half complete.
>>>>
>>>>  A segment cannot be halve complete, it either fails or succeeds
>>>>  entirely.
>>>>
>>>>> How can I recover from this? Do I have to recrawl all the urls 
>>>>> that
>>>>> were in
>>>>> the segment?
>>>>> If so, how do I mark them for recrawl in crawldb?
>>>>
>>>>  You can just generate new segments or remove all directories from
>>>> the
>>>>  segment except crawl_generate and start fetching again.
>>>>
>>>>>
>>>>> 2. I run nutch on a huge amount of urls and some urls are not 
>>>>> parsed
>>>>> sucessfully.
>>>>> I get an index which has all the urls that worked and doesnt have
>>>>> the
>>>>> ones
>>>>> that didnt work.
>>>>> How can I handle them without having to recrawl the whole thing?
>>>>
>>>>  I don't understand.
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>>
>>>>>
>>>>> 
>>>>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>
>>>>
>>>> hi,
>>>>
>>>>  On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
>>>>  &lt;nutch.buddy@&gt; wrote:
>>>>> Hi
>>>>> There are some scenarios of failure in nutch which I'm not sure 
>>>>> how
>>>>> to
>>>>> handle.
>>>>>
>>>>> 1. I run nutch on a huge amount of urls and some kind of OOM
>>>>> exception if
>>>>> thrown, or one of those "cannot allocate memory". The result is 
>>>>> that
>>>>> my
>>>>> segment is half complete.
>>>>
>>>>  A segment cannot be halve complete, it either fails or succeeds
>>>>  entirely.
>>>>
>>>>> How can I recover from this? Do I have to recrawl all the urls 
>>>>> that
>>>>> were in
>>>>> the segment?
>>>>> If so, how do I mark them for recrawl in crawldb?
>>>>
>>>>  You can just generate new segments or remove all directories from
>>>> the
>>>>  segment except crawl_generate and start fetching again.
>>>>
>>>>>
>>>>> 2. I run nutch on a huge amount of urls and some urls are not 
>>>>> parsed
>>>>> sucessfully.
>>>>> I get an index which has all the urls that worked and doesnt have
>>>>> the
>>>>> ones
>>>>> that didnt work.
>>>>> How can I handle them without having to recrawl the whole thing?
>>>>
>>>>  I don't understand.
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>>
>>>>>
>>>>> 
>>>>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>
>>>>
>>>> hi,
>>>>
>>>>  On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
>>>>  &lt;nutch.buddy@&gt; wrote:
>>>>> Hi
>>>>> There are some scenarios of failure in nutch which I'm not sure 
>>>>> how
>>>>> to
>>>>> handle.
>>>>>
>>>>> 1. I run nutch on a huge amount of urls and some kind of OOM
>>>>> exception if
>>>>> thrown, or one of those "cannot allocate memory". The result is 
>>>>> that
>>>>> my
>>>>> segment is half complete.
>>>>
>>>>  A segment cannot be halve complete, it either fails or succeeds
>>>>  entirely.
>>>>
>>>>> How can I recover from this? Do I have to recrawl all the urls 
>>>>> that
>>>>> were in
>>>>> the segment?
>>>>> If so, how do I mark them for recrawl in crawldb?
>>>>
>>>>  You can just generate new segments or remove all directories from
>>>> the
>>>>  segment except crawl_generate and start fetching again.
>>>>
>>>>>
>>>>> 2. I run nutch on a huge amount of urls and some urls are not 
>>>>> parsed
>>>>> sucessfully.
>>>>> I get an index which has all the urls that worked and doesnt have
>>>>> the
>>>>> ones
>>>>> that didnt work.
>>>>> How can I handle them without having to recrawl the whole thing?
>>>>
>>>>  I don't understand.
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>>
>>>>>
>>>>> 
>>>>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>>
>>>> 
>>>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899044.html
>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>> --
>>>
>>
>>
>> --
>> View this message in context:
>> 
>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899171.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899191.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to handle failures in nutch?

Reply via email to