Re: How to handle failures in nutch?

[email protected] Tue, 10 Apr 2012 05:29:47 -0700

And what about the index?
after updating the crawldb I'll run the indexer on both segments and the
updated crawldb?



remi tassing wrote
> 
> I don't think so!
> 
> freegen will generate a new segment and you don't need to merge it with
> the
> others.
> 
> Then you can (fetch and) parse the content from that new segment.
> 
> Finally you just need to update your crawldb (with updatedb)
> 
> Remi
> 
> On Tue, Apr 10, 2012 at 6:01 PM, nutch.buddy@ <
> nutch.buddy@> wrote:
> 
>> So if I understand you correctly, I end up having to merge the cralwdb,
>> the
>> segments and the indexes..
>> right?
>>
>>
>>
>> Markus Jelsma-2 wrote
>> >
>> > The quickest method now is to use the FreeGenerator tool, fetch the
>> >  small segment, update a new temporary CrawlDB with that segment and
>> run
>> >  the indexer pointing to that temporary CrawlDB and segment. You can
>> >  clear the temporary CrawlDB and update the bigger CrawlDB later with
>> >  other segments in one go.
>> >
>> >
>> >  On Tue, 10 Apr 2012 02:51:55 -0700 (PDT), "nutch.buddy@"
>> >  &lt;nutch.buddy@&gt; wrote:
>> >> Ok, now what about a scenario in which I want to add a new url?
>> >> then I must run it through the whole nutch process because it should
>> >> be
>> >> inserted into the crawldb.
>> >> But how can I avoid all the existent crawldb being reprocessed?
>> >>
>> >> Assume that I want to add a new url to the index and right give
>> >> deploy it in
>> >> the search, without having to wait for all the urls in crawldb that
>> >> need to
>> >> be refetched to do so?
>> >>
>> >>
>> >>
>> >>
>> >> Markus Jelsma-2 wrote
>> >>>
>> >>> On Tue, 10 Apr 2012 01:37:51 -0700 (PDT), "nutch.buddy@"
>> >>>  &lt;nutch.buddy@&gt; wrote:
>> >>>>> You can just generate new segments or remove all directories from
>> >>>>> the
>> >>>>> segment except crawl_generate and start fetching again.
>> >>>>
>> >>>> but in such case, the url is already present in crawldb in fetched
>> >>>> status.
>> >>>> so if I run crawling on this urls, won't the file be considered as
>> >>>> "already
>> >>>> fetched"?
>> >>>
>> >>>  No, because you didn't update the CrawlDB with the failed segment
>> >>> so
>> >>>  the state remains the same as before the generating failing segment
>> >>>  assuming generate.update.crawldb is false.
>> >>>
>> >>>>
>> >>>>> A segment cannot be halve complete, it either fails or succeeds
>> >>>>> entirely.
>> >>>>
>> >>>> Well doesn't it encourage us to put a small amount of urls in a
>> >>>> segment and
>> >>>> thus make sure that if the segment fails because of a single url,
>> >>>> I'll have
>> >>>> to re-crawl only a "reasonable" amount of urls?
>> >>>
>> >>>  Indeed, having fewer records per segment reduces the problem of a
>> >>>  failed segment. But it would be best to avoid running out of memory
>> >>> in
>> >>>  the first place. Parsing fetchers are prone to running out of
>> >>> memory but
>> >>>  with parsing disabled it's actually quite hard to run out of
>> >>> memory.
>> >>>
>> >>>>
>> >>>> About the second question:
>> >>>> assume that I run nutch on a huge amount of files. one file's
>> >>>> parsing
>> >>>> fails,
>> >>>> lets say because the file was corrupted. I fix the file and now I
>> >>>> want  to
>> >>>> re-crawl it.
>> >>>> If I just run crawling again, the file won't be fetched because of
>> >>>> the
>> >>>> interval.
>> >>>> I can try run nutch on this file aside, and then merge, but well,
>> >>>> it
>> >>>> doesn't
>> >>>> works as it should. while merging something goes wrong and I find
>> >>>> myself
>> >>>> with a smaller index than I had before.
>> >>>> So the question is - how can I recrawl a single url while running
>> >>>> nutch
>> >>>> crawl on an existing input.
>> >>>
>> >>>  Ah i see. Use the FreeGenerator tool to generate a segment from a
>> >>> plain
>> >>>  text input file.
>> >>>
>> >>>>
>> >>>> Any other insights on these issues will be appreciated
>> >>>>
>> >>>>
>> >>>>
>> >>>> Markus Jelsma-2 wrote
>> >>>>>
>> >>>>> hi,
>> >>>>>
>> >>>>>  On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
>> >>>>>  &lt;nutch.buddy@&gt; wrote:
>> >>>>>> Hi
>> >>>>>> There are some scenarios of failure in nutch which I'm not sure
>> >>>>>> how
>> >>>>>> to
>> >>>>>> handle.
>> >>>>>>
>> >>>>>> 1. I run nutch on a huge amount of urls and some kind of OOM
>> >>>>>> exception if
>> >>>>>> thrown, or one of those "cannot allocate memory". The result is
>> >>>>>> that
>> >>>>>> my
>> >>>>>> segment is half complete.
>> >>>>>
>> >>>>>  A segment cannot be halve complete, it either fails or succeeds
>> >>>>>  entirely.
>> >>>>>
>> >>>>>> How can I recover from this? Do I have to recrawl all the urls
>> >>>>>> that
>> >>>>>> were in
>> >>>>>> the segment?
>> >>>>>> If so, how do I mark them for recrawl in crawldb?
>> >>>>>
>> >>>>>  You can just generate new segments or remove all directories from
>> >>>>> the
>> >>>>>  segment except crawl_generate and start fetching again.
>> >>>>>
>> >>>>>>
>> >>>>>> 2. I run nutch on a huge amount of urls and some urls are not
>> >>>>>> parsed
>> >>>>>> sucessfully.
>> >>>>>> I get an index which has all the urls that worked and doesnt have
>> >>>>>> the
>> >>>>>> ones
>> >>>>>> that didnt work.
>> >>>>>> How can I handle them without having to recrawl the whole thing?
>> >>>>>
>> >>>>>  I don't understand.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Thanks.
>> >>>>>>
>> >>>>>> --
>> >>>>>> View this message in context:
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
>> >>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >>>>>
>> >>>> hi,
>> >>>>
>> >>>>  On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
>> >>>>  &lt;nutch.buddy@&gt; wrote:
>> >>>>> Hi
>> >>>>> There are some scenarios of failure in nutch which I'm not sure
>> >>>>> how
>> >>>>> to
>> >>>>> handle.
>> >>>>>
>> >>>>> 1. I run nutch on a huge amount of urls and some kind of OOM
>> >>>>> exception if
>> >>>>> thrown, or one of those "cannot allocate memory". The result is
>> >>>>> that
>> >>>>> my
>> >>>>> segment is half complete.
>> >>>>
>> >>>>  A segment cannot be halve complete, it either fails or succeeds
>> >>>>  entirely.
>> >>>>
>> >>>>> How can I recover from this? Do I have to recrawl all the urls
>> >>>>> that
>> >>>>> were in
>> >>>>> the segment?
>> >>>>> If so, how do I mark them for recrawl in crawldb?
>> >>>>
>> >>>>  You can just generate new segments or remove all directories from
>> >>>> the
>> >>>>  segment except crawl_generate and start fetching again.
>> >>>>
>> >>>>>
>> >>>>> 2. I run nutch on a huge amount of urls and some urls are not
>> >>>>> parsed
>> >>>>> sucessfully.
>> >>>>> I get an index which has all the urls that worked and doesnt have
>> >>>>> the
>> >>>>> ones
>> >>>>> that didnt work.
>> >>>>> How can I handle them without having to recrawl the whole thing?
>> >>>>
>> >>>>  I don't understand.
>> >>>>
>> >>>>
>> >>>>
>> >>>>>
>> >>>>>
>> >>>>> Thanks.
>> >>>>>
>> >>>>> --
>> >>>>> View this message in context:
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
>> >>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >>>>
>> >>>>
>> >>>> hi,
>> >>>>
>> >>>>  On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
>> >>>>  &lt;nutch.buddy@&gt; wrote:
>> >>>>> Hi
>> >>>>> There are some scenarios of failure in nutch which I'm not sure
>> >>>>> how
>> >>>>> to
>> >>>>> handle.
>> >>>>>
>> >>>>> 1. I run nutch on a huge amount of urls and some kind of OOM
>> >>>>> exception if
>> >>>>> thrown, or one of those "cannot allocate memory". The result is
>> >>>>> that
>> >>>>> my
>> >>>>> segment is half complete.
>> >>>>
>> >>>>  A segment cannot be halve complete, it either fails or succeeds
>> >>>>  entirely.
>> >>>>
>> >>>>> How can I recover from this? Do I have to recrawl all the urls
>> >>>>> that
>> >>>>> were in
>> >>>>> the segment?
>> >>>>> If so, how do I mark them for recrawl in crawldb?
>> >>>>
>> >>>>  You can just generate new segments or remove all directories from
>> >>>> the
>> >>>>  segment except crawl_generate and start fetching again.
>> >>>>
>> >>>>>
>> >>>>> 2. I run nutch on a huge amount of urls and some urls are not
>> >>>>> parsed
>> >>>>> sucessfully.
>> >>>>> I get an index which has all the urls that worked and doesnt have
>> >>>>> the
>> >>>>> ones
>> >>>>> that didnt work.
>> >>>>> How can I handle them without having to recrawl the whole thing?
>> >>>>
>> >>>>  I don't understand.
>> >>>>
>> >>>>
>> >>>>
>> >>>>>
>> >>>>>
>> >>>>> Thanks.
>> >>>>>
>> >>>>> --
>> >>>>> View this message in context:
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
>> >>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >>>>
>> >>>>
>> >>>> hi,
>> >>>>
>> >>>>  On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
>> >>>>  &lt;nutch.buddy@&gt; wrote:
>> >>>>> Hi
>> >>>>> There are some scenarios of failure in nutch which I'm not sure
>> >>>>> how
>> >>>>> to
>> >>>>> handle.
>> >>>>>
>> >>>>> 1. I run nutch on a huge amount of urls and some kind of OOM
>> >>>>> exception if
>> >>>>> thrown, or one of those "cannot allocate memory". The result is
>> >>>>> that
>> >>>>> my
>> >>>>> segment is half complete.
>> >>>>
>> >>>>  A segment cannot be halve complete, it either fails or succeeds
>> >>>>  entirely.
>> >>>>
>> >>>>> How can I recover from this? Do I have to recrawl all the urls
>> >>>>> that
>> >>>>> were in
>> >>>>> the segment?
>> >>>>> If so, how do I mark them for recrawl in crawldb?
>> >>>>
>> >>>>  You can just generate new segments or remove all directories from
>> >>>> the
>> >>>>  segment except crawl_generate and start fetching again.
>> >>>>
>> >>>>>
>> >>>>> 2. I run nutch on a huge amount of urls and some urls are not
>> >>>>> parsed
>> >>>>> sucessfully.
>> >>>>> I get an index which has all the urls that worked and doesnt have
>> >>>>> the
>> >>>>> ones
>> >>>>> that didnt work.
>> >>>>> How can I handle them without having to recrawl the whole thing?
>> >>>>
>> >>>>  I don't understand.
>> >>>>
>> >>>>
>> >>>>
>> >>>>>
>> >>>>>
>> >>>>> Thanks.
>> >>>>>
>> >>>>> --
>> >>>>> View this message in context:
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
>> >>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >>>>
>> >>>>
>> >>>> --
>> >>>> View this message in context:
>> >>>>
>> >>>>
>> >>>>
>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899044.html
>> >>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >>>
>> >>> --
>> >>>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> >>
>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899171.html
>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899191.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
> 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899453.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to handle failures in nutch?

Reply via email to