So if I understand you correctly, I end up having to merge the cralwdb, the segments and the indexes.. right?
Markus Jelsma-2 wrote > > The quickest method now is to use the FreeGenerator tool, fetch the > small segment, update a new temporary CrawlDB with that segment and run > the indexer pointing to that temporary CrawlDB and segment. You can > clear the temporary CrawlDB and update the bigger CrawlDB later with > other segments in one go. > > > On Tue, 10 Apr 2012 02:51:55 -0700 (PDT), "nutch.buddy@" > <nutch.buddy@> wrote: >> Ok, now what about a scenario in which I want to add a new url? >> then I must run it through the whole nutch process because it should >> be >> inserted into the crawldb. >> But how can I avoid all the existent crawldb being reprocessed? >> >> Assume that I want to add a new url to the index and right give >> deploy it in >> the search, without having to wait for all the urls in crawldb that >> need to >> be refetched to do so? >> >> >> >> >> Markus Jelsma-2 wrote >>> >>> On Tue, 10 Apr 2012 01:37:51 -0700 (PDT), "nutch.buddy@" >>> <nutch.buddy@> wrote: >>>>> You can just generate new segments or remove all directories from >>>>> the >>>>> segment except crawl_generate and start fetching again. >>>> >>>> but in such case, the url is already present in crawldb in fetched >>>> status. >>>> so if I run crawling on this urls, won't the file be considered as >>>> "already >>>> fetched"? >>> >>> No, because you didn't update the CrawlDB with the failed segment >>> so >>> the state remains the same as before the generating failing segment >>> assuming generate.update.crawldb is false. >>> >>>> >>>>> A segment cannot be halve complete, it either fails or succeeds >>>>> entirely. >>>> >>>> Well doesn't it encourage us to put a small amount of urls in a >>>> segment and >>>> thus make sure that if the segment fails because of a single url, >>>> I'll have >>>> to re-crawl only a "reasonable" amount of urls? >>> >>> Indeed, having fewer records per segment reduces the problem of a >>> failed segment. But it would be best to avoid running out of memory >>> in >>> the first place. Parsing fetchers are prone to running out of >>> memory but >>> with parsing disabled it's actually quite hard to run out of >>> memory. >>> >>>> >>>> About the second question: >>>> assume that I run nutch on a huge amount of files. one file's >>>> parsing >>>> fails, >>>> lets say because the file was corrupted. I fix the file and now I >>>> want to >>>> re-crawl it. >>>> If I just run crawling again, the file won't be fetched because of >>>> the >>>> interval. >>>> I can try run nutch on this file aside, and then merge, but well, >>>> it >>>> doesn't >>>> works as it should. while merging something goes wrong and I find >>>> myself >>>> with a smaller index than I had before. >>>> So the question is - how can I recrawl a single url while running >>>> nutch >>>> crawl on an existing input. >>> >>> Ah i see. Use the FreeGenerator tool to generate a segment from a >>> plain >>> text input file. >>> >>>> >>>> Any other insights on these issues will be appreciated >>>> >>>> >>>> >>>> Markus Jelsma-2 wrote >>>>> >>>>> hi, >>>>> >>>>> On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@" >>>>> <nutch.buddy@> wrote: >>>>>> Hi >>>>>> There are some scenarios of failure in nutch which I'm not sure >>>>>> how >>>>>> to >>>>>> handle. >>>>>> >>>>>> 1. I run nutch on a huge amount of urls and some kind of OOM >>>>>> exception if >>>>>> thrown, or one of those "cannot allocate memory". The result is >>>>>> that >>>>>> my >>>>>> segment is half complete. >>>>> >>>>> A segment cannot be halve complete, it either fails or succeeds >>>>> entirely. >>>>> >>>>>> How can I recover from this? Do I have to recrawl all the urls >>>>>> that >>>>>> were in >>>>>> the segment? >>>>>> If so, how do I mark them for recrawl in crawldb? >>>>> >>>>> You can just generate new segments or remove all directories from >>>>> the >>>>> segment except crawl_generate and start fetching again. >>>>> >>>>>> >>>>>> 2. I run nutch on a huge amount of urls and some urls are not >>>>>> parsed >>>>>> sucessfully. >>>>>> I get an index which has all the urls that worked and doesnt have >>>>>> the >>>>>> ones >>>>>> that didnt work. >>>>>> How can I handle them without having to recrawl the whole thing? >>>>> >>>>> I don't understand. >>>>> >>>>> >>>>> >>>>>> >>>>>> >>>>>> Thanks. >>>>>> >>>>>> -- >>>>>> View this message in context: >>>>>> >>>>>> >>>>>> >>>>>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html >>>>>> Sent from the Nutch - User mailing list archive at Nabble.com. >>>>> >>>> hi, >>>> >>>> On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@" >>>> <nutch.buddy@> wrote: >>>>> Hi >>>>> There are some scenarios of failure in nutch which I'm not sure >>>>> how >>>>> to >>>>> handle. >>>>> >>>>> 1. I run nutch on a huge amount of urls and some kind of OOM >>>>> exception if >>>>> thrown, or one of those "cannot allocate memory". The result is >>>>> that >>>>> my >>>>> segment is half complete. >>>> >>>> A segment cannot be halve complete, it either fails or succeeds >>>> entirely. >>>> >>>>> How can I recover from this? Do I have to recrawl all the urls >>>>> that >>>>> were in >>>>> the segment? >>>>> If so, how do I mark them for recrawl in crawldb? >>>> >>>> You can just generate new segments or remove all directories from >>>> the >>>> segment except crawl_generate and start fetching again. >>>> >>>>> >>>>> 2. I run nutch on a huge amount of urls and some urls are not >>>>> parsed >>>>> sucessfully. >>>>> I get an index which has all the urls that worked and doesnt have >>>>> the >>>>> ones >>>>> that didnt work. >>>>> How can I handle them without having to recrawl the whole thing? >>>> >>>> I don't understand. >>>> >>>> >>>> >>>>> >>>>> >>>>> Thanks. >>>>> >>>>> -- >>>>> View this message in context: >>>>> >>>>> >>>>> >>>>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html >>>>> Sent from the Nutch - User mailing list archive at Nabble.com. >>>> >>>> >>>> hi, >>>> >>>> On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@" >>>> <nutch.buddy@> wrote: >>>>> Hi >>>>> There are some scenarios of failure in nutch which I'm not sure >>>>> how >>>>> to >>>>> handle. >>>>> >>>>> 1. I run nutch on a huge amount of urls and some kind of OOM >>>>> exception if >>>>> thrown, or one of those "cannot allocate memory". The result is >>>>> that >>>>> my >>>>> segment is half complete. >>>> >>>> A segment cannot be halve complete, it either fails or succeeds >>>> entirely. >>>> >>>>> How can I recover from this? Do I have to recrawl all the urls >>>>> that >>>>> were in >>>>> the segment? >>>>> If so, how do I mark them for recrawl in crawldb? >>>> >>>> You can just generate new segments or remove all directories from >>>> the >>>> segment except crawl_generate and start fetching again. >>>> >>>>> >>>>> 2. I run nutch on a huge amount of urls and some urls are not >>>>> parsed >>>>> sucessfully. >>>>> I get an index which has all the urls that worked and doesnt have >>>>> the >>>>> ones >>>>> that didnt work. >>>>> How can I handle them without having to recrawl the whole thing? >>>> >>>> I don't understand. >>>> >>>> >>>> >>>>> >>>>> >>>>> Thanks. >>>>> >>>>> -- >>>>> View this message in context: >>>>> >>>>> >>>>> >>>>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html >>>>> Sent from the Nutch - User mailing list archive at Nabble.com. >>>> >>>> >>>> hi, >>>> >>>> On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@" >>>> <nutch.buddy@> wrote: >>>>> Hi >>>>> There are some scenarios of failure in nutch which I'm not sure >>>>> how >>>>> to >>>>> handle. >>>>> >>>>> 1. I run nutch on a huge amount of urls and some kind of OOM >>>>> exception if >>>>> thrown, or one of those "cannot allocate memory". The result is >>>>> that >>>>> my >>>>> segment is half complete. >>>> >>>> A segment cannot be halve complete, it either fails or succeeds >>>> entirely. >>>> >>>>> How can I recover from this? Do I have to recrawl all the urls >>>>> that >>>>> were in >>>>> the segment? >>>>> If so, how do I mark them for recrawl in crawldb? >>>> >>>> You can just generate new segments or remove all directories from >>>> the >>>> segment except crawl_generate and start fetching again. >>>> >>>>> >>>>> 2. I run nutch on a huge amount of urls and some urls are not >>>>> parsed >>>>> sucessfully. >>>>> I get an index which has all the urls that worked and doesnt have >>>>> the >>>>> ones >>>>> that didnt work. >>>>> How can I handle them without having to recrawl the whole thing? >>>> >>>> I don't understand. >>>> >>>> >>>> >>>>> >>>>> >>>>> Thanks. >>>>> >>>>> -- >>>>> View this message in context: >>>>> >>>>> >>>>> >>>>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html >>>>> Sent from the Nutch - User mailing list archive at Nabble.com. >>>> >>>> >>>> -- >>>> View this message in context: >>>> >>>> >>>> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899044.html >>>> Sent from the Nutch - User mailing list archive at Nabble.com. >>> >>> -- >>> >> >> >> -- >> View this message in context: >> >> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899171.html >> Sent from the Nutch - User mailing list archive at Nabble.com. > -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899191.html Sent from the Nutch - User mailing list archive at Nabble.com.

