Re: How to handle failures in nutch?

[email protected] Wed, 18 Apr 2012 00:50:26 -0700

I keep failing at the indexing stage.
after running freegenerator, fetch, parse and updatedb, I run linkdb on both
segments - the old one and the new one.
Then I run index on all the updated folders, including both segment folder
and I get an index that is smaller than the original index that I had.


I try running merge on this newly created smaller index and the original
one, and the result of this merge is something of the size of the smaller
one.

Any ideas?
Thanks.



[email protected] wrote
> 
> And what about the index?
> after updating the crawldb I'll run the indexer on both segments and the
> updated crawldb?
> 
> 
> remi tassing wrote
>> 
>> I don't think so!
>> 
>> freegen will generate a new segment and you don't need to merge it with
>> the
>> others.
>> 
>> Then you can (fetch and) parse the content from that new segment.
>> 
>> Finally you just need to update your crawldb (with updatedb)
>> 
>> Remi
>> 
>> On Tue, Apr 10, 2012 at 6:01 PM, nutch.buddy@ <
>> nutch.buddy@> wrote:
>> 
>>> So if I understand you correctly, I end up having to merge the cralwdb,
>>> the
>>> segments and the indexes..
>>> right?
>>>
>>>
>>>
>>> Markus Jelsma-2 wrote
>>> >
>>> > The quickest method now is to use the FreeGenerator tool, fetch the
>>> >  small segment, update a new temporary CrawlDB with that segment and
>>> run
>>> >  the indexer pointing to that temporary CrawlDB and segment. You can
>>> >  clear the temporary CrawlDB and update the bigger CrawlDB later with
>>> >  other segments in one go.
>>> >
>>> >
>>> >  On Tue, 10 Apr 2012 02:51:55 -0700 (PDT), "nutch.buddy@"
>>> >  &lt;nutch.buddy@&gt; wrote:
>>> >> Ok, now what about a scenario in which I want to add a new url?
>>> >> then I must run it through the whole nutch process because it should
>>> >> be
>>> >> inserted into the crawldb.
>>> >> But how can I avoid all the existent crawldb being reprocessed?
>>> >>
>>> >> Assume that I want to add a new url to the index and right give
>>> >> deploy it in
>>> >> the search, without having to wait for all the urls in crawldb that
>>> >> need to
>>> >> be refetched to do so?
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Markus Jelsma-2 wrote
>>> >>>
>>> >>> On Tue, 10 Apr 2012 01:37:51 -0700 (PDT), "nutch.buddy@"
>>> >>>  &lt;nutch.buddy@&gt; wrote:
>>> >>>>> You can just generate new segments or remove all directories from
>>> >>>>> the
>>> >>>>> segment except crawl_generate and start fetching again.
>>> >>>>
>>> >>>> but in such case, the url is already present in crawldb in fetched
>>> >>>> status.
>>> >>>> so if I run crawling on this urls, won't the file be considered as
>>> >>>> "already
>>> >>>> fetched"?
>>> >>>
>>> >>>  No, because you didn't update the CrawlDB with the failed segment
>>> >>> so
>>> >>>  the state remains the same as before the generating failing segment
>>> >>>  assuming generate.update.crawldb is false.
>>> >>>
>>> >>>>
>>> >>>>> A segment cannot be halve complete, it either fails or succeeds
>>> >>>>> entirely.
>>> >>>>
>>> >>>> Well doesn't it encourage us to put a small amount of urls in a
>>> >>>> segment and
>>> >>>> thus make sure that if the segment fails because of a single url,
>>> >>>> I'll have
>>> >>>> to re-crawl only a "reasonable" amount of urls?
>>> >>>
>>> >>>  Indeed, having fewer records per segment reduces the problem of a
>>> >>>  failed segment. But it would be best to avoid running out of memory
>>> >>> in
>>> >>>  the first place. Parsing fetchers are prone to running out of
>>> >>> memory but
>>> >>>  with parsing disabled it's actually quite hard to run out of
>>> >>> memory.
>>> >>>
>>> >>>>
>>> >>>> About the second question:
>>> >>>> assume that I run nutch on a huge amount of files. one file's
>>> >>>> parsing
>>> >>>> fails,
>>> >>>> lets say because the file was corrupted. I fix the file and now I
>>> >>>> want  to
>>> >>>> re-crawl it.
>>> >>>> If I just run crawling again, the file won't be fetched because of
>>> >>>> the
>>> >>>> interval.
>>> >>>> I can try run nutch on this file aside, and then merge, but well,
>>> >>>> it
>>> >>>> doesn't
>>> >>>> works as it should. while merging something goes wrong and I find
>>> >>>> myself
>>> >>>> with a smaller index than I had before.
>>> >>>> So the question is - how can I recrawl a single url while running
>>> >>>> nutch
>>> >>>> crawl on an existing input.
>>> >>>
>>> >>>  Ah i see. Use the FreeGenerator tool to generate a segment from a
>>> >>> plain
>>> >>>  text input file.
>>> >>>
>>> >>>>
>>> >>>> Any other insights on these issues will be appreciated
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> Markus Jelsma-2 wrote
>>> >>>>>
>>> >>>>> hi,
>>> >>>>>
>>> >>>>>  On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
>>> >>>>>  &lt;nutch.buddy@&gt; wrote:
>>> >>>>>> Hi
>>> >>>>>> There are some scenarios of failure in nutch which I'm not sure
>>> >>>>>> how
>>> >>>>>> to
>>> >>>>>> handle.
>>> >>>>>>
>>> >>>>>> 1. I run nutch on a huge amount of urls and some kind of OOM
>>> >>>>>> exception if
>>> >>>>>> thrown, or one of those "cannot allocate memory". The result is
>>> >>>>>> that
>>> >>>>>> my
>>> >>>>>> segment is half complete.
>>> >>>>>
>>> >>>>>  A segment cannot be halve complete, it either fails or succeeds
>>> >>>>>  entirely.
>>> >>>>>
>>> >>>>>> How can I recover from this? Do I have to recrawl all the urls
>>> >>>>>> that
>>> >>>>>> were in
>>> >>>>>> the segment?
>>> >>>>>> If so, how do I mark them for recrawl in crawldb?
>>> >>>>>
>>> >>>>>  You can just generate new segments or remove all directories from
>>> >>>>> the
>>> >>>>>  segment except crawl_generate and start fetching again.
>>> >>>>>
>>> >>>>>>
>>> >>>>>> 2. I run nutch on a huge amount of urls and some urls are not
>>> >>>>>> parsed
>>> >>>>>> sucessfully.
>>> >>>>>> I get an index which has all the urls that worked and doesnt have
>>> >>>>>> the
>>> >>>>>> ones
>>> >>>>>> that didnt work.
>>> >>>>>> How can I handle them without having to recrawl the whole thing?
>>> >>>>>
>>> >>>>>  I don't understand.
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Thanks.
>>> >>>>>>
>>> >>>>>> --
>>> >>>>>> View this message in context:
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
>>> >>>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>> >>>>>
>>> >>>> hi,
>>> >>>>
>>> >>>>  On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
>>> >>>>  &lt;nutch.buddy@&gt; wrote:
>>> >>>>> Hi
>>> >>>>> There are some scenarios of failure in nutch which I'm not sure
>>> >>>>> how
>>> >>>>> to
>>> >>>>> handle.
>>> >>>>>
>>> >>>>> 1. I run nutch on a huge amount of urls and some kind of OOM
>>> >>>>> exception if
>>> >>>>> thrown, or one of those "cannot allocate memory". The result is
>>> >>>>> that
>>> >>>>> my
>>> >>>>> segment is half complete.
>>> >>>>
>>> >>>>  A segment cannot be halve complete, it either fails or succeeds
>>> >>>>  entirely.
>>> >>>>
>>> >>>>> How can I recover from this? Do I have to recrawl all the urls
>>> >>>>> that
>>> >>>>> were in
>>> >>>>> the segment?
>>> >>>>> If so, how do I mark them for recrawl in crawldb?
>>> >>>>
>>> >>>>  You can just generate new segments or remove all directories from
>>> >>>> the
>>> >>>>  segment except crawl_generate and start fetching again.
>>> >>>>
>>> >>>>>
>>> >>>>> 2. I run nutch on a huge amount of urls and some urls are not
>>> >>>>> parsed
>>> >>>>> sucessfully.
>>> >>>>> I get an index which has all the urls that worked and doesnt have
>>> >>>>> the
>>> >>>>> ones
>>> >>>>> that didnt work.
>>> >>>>> How can I handle them without having to recrawl the whole thing?
>>> >>>>
>>> >>>>  I don't understand.
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> Thanks.
>>> >>>>>
>>> >>>>> --
>>> >>>>> View this message in context:
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
>>> >>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>> >>>>
>>> >>>>
>>> >>>> hi,
>>> >>>>
>>> >>>>  On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
>>> >>>>  &lt;nutch.buddy@&gt; wrote:
>>> >>>>> Hi
>>> >>>>> There are some scenarios of failure in nutch which I'm not sure
>>> >>>>> how
>>> >>>>> to
>>> >>>>> handle.
>>> >>>>>
>>> >>>>> 1. I run nutch on a huge amount of urls and some kind of OOM
>>> >>>>> exception if
>>> >>>>> thrown, or one of those "cannot allocate memory". The result is
>>> >>>>> that
>>> >>>>> my
>>> >>>>> segment is half complete.
>>> >>>>
>>> >>>>  A segment cannot be halve complete, it either fails or succeeds
>>> >>>>  entirely.
>>> >>>>
>>> >>>>> How can I recover from this? Do I have to recrawl all the urls
>>> >>>>> that
>>> >>>>> were in
>>> >>>>> the segment?
>>> >>>>> If so, how do I mark them for recrawl in crawldb?
>>> >>>>
>>> >>>>  You can just generate new segments or remove all directories from
>>> >>>> the
>>> >>>>  segment except crawl_generate and start fetching again.
>>> >>>>
>>> >>>>>
>>> >>>>> 2. I run nutch on a huge amount of urls and some urls are not
>>> >>>>> parsed
>>> >>>>> sucessfully.
>>> >>>>> I get an index which has all the urls that worked and doesnt have
>>> >>>>> the
>>> >>>>> ones
>>> >>>>> that didnt work.
>>> >>>>> How can I handle them without having to recrawl the whole thing?
>>> >>>>
>>> >>>>  I don't understand.
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> Thanks.
>>> >>>>>
>>> >>>>> --
>>> >>>>> View this message in context:
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
>>> >>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>> >>>>
>>> >>>>
>>> >>>> hi,
>>> >>>>
>>> >>>>  On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@"
>>> >>>>  &lt;nutch.buddy@&gt; wrote:
>>> >>>>> Hi
>>> >>>>> There are some scenarios of failure in nutch which I'm not sure
>>> >>>>> how
>>> >>>>> to
>>> >>>>> handle.
>>> >>>>>
>>> >>>>> 1. I run nutch on a huge amount of urls and some kind of OOM
>>> >>>>> exception if
>>> >>>>> thrown, or one of those "cannot allocate memory". The result is
>>> >>>>> that
>>> >>>>> my
>>> >>>>> segment is half complete.
>>> >>>>
>>> >>>>  A segment cannot be halve complete, it either fails or succeeds
>>> >>>>  entirely.
>>> >>>>
>>> >>>>> How can I recover from this? Do I have to recrawl all the urls
>>> >>>>> that
>>> >>>>> were in
>>> >>>>> the segment?
>>> >>>>> If so, how do I mark them for recrawl in crawldb?
>>> >>>>
>>> >>>>  You can just generate new segments or remove all directories from
>>> >>>> the
>>> >>>>  segment except crawl_generate and start fetching again.
>>> >>>>
>>> >>>>>
>>> >>>>> 2. I run nutch on a huge amount of urls and some urls are not
>>> >>>>> parsed
>>> >>>>> sucessfully.
>>> >>>>> I get an index which has all the urls that worked and doesnt have
>>> >>>>> the
>>> >>>>> ones
>>> >>>>> that didnt work.
>>> >>>>> How can I handle them without having to recrawl the whole thing?
>>> >>>>
>>> >>>>  I don't understand.
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> Thanks.
>>> >>>>>
>>> >>>>> --
>>> >>>>> View this message in context:
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html
>>> >>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> View this message in context:
>>> >>>>
>>> >>>>
>>> >>>>
>>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899044.html
>>> >>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>> >>>
>>> >>> --
>>> >>>
>>> >>
>>> >>
>>> >> --
>>> >> View this message in context:
>>> >>
>>> >>
>>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899171.html
>>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>>> >
>>>
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899191.html
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>> 
> 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3919316.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to handle failures in nutch?

Reply via email to