And what about the index? after updating the crawldb I'll run the indexer on both segments and the updated crawldb?
remi tassing wrote > > I don't think so! > > freegen will generate a new segment and you don't need to merge it with > the > others. > > Then you can (fetch and) parse the content from that new segment. > > Finally you just need to update your crawldb (with updatedb) > > Remi > > On Tue, Apr 10, 2012 at 6:01 PM, nutch.buddy@ < > nutch.buddy@> wrote: > >> So if I understand you correctly, I end up having to merge the cralwdb, >> the >> segments and the indexes.. >> right? >> >> >> >> Markus Jelsma-2 wrote >> > >> > The quickest method now is to use the FreeGenerator tool, fetch the >> > small segment, update a new temporary CrawlDB with that segment and >> run >> > the indexer pointing to that temporary CrawlDB and segment. You can >> > clear the temporary CrawlDB and update the bigger CrawlDB later with >> > other segments in one go. >> > >> > >> > On Tue, 10 Apr 2012 02:51:55 -0700 (PDT), "nutch.buddy@" >> > <nutch.buddy@> wrote: >> >> Ok, now what about a scenario in which I want to add a new url? >> >> then I must run it through the whole nutch process because it should >> >> be >> >> inserted into the crawldb. >> >> But how can I avoid all the existent crawldb being reprocessed? >> >> >> >> Assume that I want to add a new url to the index and right give >> >> deploy it in >> >> the search, without having to wait for all the urls in crawldb that >> >> need to >> >> be refetched to do so? >> >> >> >> >> >> >> >> >> >> Markus Jelsma-2 wrote >> >>> >> >>> On Tue, 10 Apr 2012 01:37:51 -0700 (PDT), "nutch.buddy@" >> >>> <nutch.buddy@> wrote: >> >>>>> You can just generate new segments or remove all directories from >> >>>>> the >> >>>>> segment except crawl_generate and start fetching again. >> >>>> >> >>>> but in such case, the url is already present in crawldb in fetched >> >>>> status. >> >>>> so if I run crawling on this urls, won't the file be considered as >> >>>> "already >> >>>> fetched"? >> >>> >> >>> No, because you didn't update the CrawlDB with the failed segment >> >>> so >> >>> the state remains the same as before the generating failing segment >> >>> assuming generate.update.crawldb is false. >> >>> >> >>>> >> >>>>> A segment cannot be halve complete, it either fails or succeeds >> >>>>> entirely. >> >>>> >> >>>> Well doesn't it encourage us to put a small amount of urls in a >> >>>> segment and >> >>>> thus make sure that if the segment fails because of a single url, >> >>>> I'll have >> >>>> to re-crawl only a "reasonable" amount of urls? >> >>> >> >>> Indeed, having fewer records per segment reduces the problem of a >> >>> failed segment. But it would be best to avoid running out of memory >> >>> in >> >>> the first place. Parsing fetchers are prone to running out of >> >>> memory but >> >>> with parsing disabled it's actually quite hard to run out of >> >>> memory. >> >>> >> >>>> >> >>>> About the second question: >> >>>> assume that I run nutch on a huge amount of files. one file's >> >>>> parsing >> >>>> fails, >> >>>> lets say because the file was corrupted. I fix the file and now I >> >>>> want to >> >>>> re-crawl it. >> >>>> If I just run crawling again, the file won't be fetched because of >> >>>> the >> >>>> interval. >> >>>> I can try run nutch on this file aside, and then merge, but well, >> >>>> it >> >>>> doesn't >> >>>> works as it should. while merging something goes wrong and I find >> >>>> myself >> >>>> with a smaller index than I had before. >> >>>> So the question is - how can I recrawl a single url while running >> >>>> nutch >> >>>> crawl on an existing input. >> >>> >> >>> Ah i see. Use the FreeGenerator tool to generate a segment from a >> >>> plain >> >>> text input file. >> >>> >> >>>> >> >>>> Any other insights on these issues will be appreciated >> >>>> >> >>>> >> >>>> >> >>>> Markus Jelsma-2 wrote >> >>>>> >> >>>>> hi, >> >>>>> >> >>>>> On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@" >> >>>>> <nutch.buddy@> wrote: >> >>>>>> Hi >> >>>>>> There are some scenarios of failure in nutch which I'm not sure >> >>>>>> how >> >>>>>> to >> >>>>>> handle. >> >>>>>> >> >>>>>> 1. I run nutch on a huge amount of urls and some kind of OOM >> >>>>>> exception if >> >>>>>> thrown, or one of those "cannot allocate memory". The result is >> >>>>>> that >> >>>>>> my >> >>>>>> segment is half complete. >> >>>>> >> >>>>> A segment cannot be halve complete, it either fails or succeeds >> >>>>> entirely. >> >>>>> >> >>>>>> How can I recover from this? Do I have to recrawl all the urls >> >>>>>> that >> >>>>>> were in >> >>>>>> the segment? >> >>>>>> If so, how do I mark them for recrawl in crawldb? >> >>>>> >> >>>>> You can just generate new segments or remove all directories from >> >>>>> the >> >>>>> segment except crawl_generate and start fetching again. >> >>>>> >> >>>>>> >> >>>>>> 2. I run nutch on a huge amount of urls and some urls are not >> >>>>>> parsed >> >>>>>> sucessfully. >> >>>>>> I get an index which has all the urls that worked and doesnt have >> >>>>>> the >> >>>>>> ones >> >>>>>> that didnt work. >> >>>>>> How can I handle them without having to recrawl the whole thing? >> >>>>> >> >>>>> I don't understand. >> >>>>> >> >>>>> >> >>>>> >> >>>>>> >> >>>>>> >> >>>>>> Thanks. >> >>>>>> >> >>>>>> -- >> >>>>>> View this message in context: >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html >> >>>>>> Sent from the Nutch - User mailing list archive at Nabble.com. >> >>>>> >> >>>> hi, >> >>>> >> >>>> On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@" >> >>>> <nutch.buddy@> wrote: >> >>>>> Hi >> >>>>> There are some scenarios of failure in nutch which I'm not sure >> >>>>> how >> >>>>> to >> >>>>> handle. >> >>>>> >> >>>>> 1. I run nutch on a huge amount of urls and some kind of OOM >> >>>>> exception if >> >>>>> thrown, or one of those "cannot allocate memory". The result is >> >>>>> that >> >>>>> my >> >>>>> segment is half complete. >> >>>> >> >>>> A segment cannot be halve complete, it either fails or succeeds >> >>>> entirely. >> >>>> >> >>>>> How can I recover from this? Do I have to recrawl all the urls >> >>>>> that >> >>>>> were in >> >>>>> the segment? >> >>>>> If so, how do I mark them for recrawl in crawldb? >> >>>> >> >>>> You can just generate new segments or remove all directories from >> >>>> the >> >>>> segment except crawl_generate and start fetching again. >> >>>> >> >>>>> >> >>>>> 2. I run nutch on a huge amount of urls and some urls are not >> >>>>> parsed >> >>>>> sucessfully. >> >>>>> I get an index which has all the urls that worked and doesnt have >> >>>>> the >> >>>>> ones >> >>>>> that didnt work. >> >>>>> How can I handle them without having to recrawl the whole thing? >> >>>> >> >>>> I don't understand. >> >>>> >> >>>> >> >>>> >> >>>>> >> >>>>> >> >>>>> Thanks. >> >>>>> >> >>>>> -- >> >>>>> View this message in context: >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html >> >>>>> Sent from the Nutch - User mailing list archive at Nabble.com. >> >>>> >> >>>> >> >>>> hi, >> >>>> >> >>>> On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@" >> >>>> <nutch.buddy@> wrote: >> >>>>> Hi >> >>>>> There are some scenarios of failure in nutch which I'm not sure >> >>>>> how >> >>>>> to >> >>>>> handle. >> >>>>> >> >>>>> 1. I run nutch on a huge amount of urls and some kind of OOM >> >>>>> exception if >> >>>>> thrown, or one of those "cannot allocate memory". The result is >> >>>>> that >> >>>>> my >> >>>>> segment is half complete. >> >>>> >> >>>> A segment cannot be halve complete, it either fails or succeeds >> >>>> entirely. >> >>>> >> >>>>> How can I recover from this? Do I have to recrawl all the urls >> >>>>> that >> >>>>> were in >> >>>>> the segment? >> >>>>> If so, how do I mark them for recrawl in crawldb? >> >>>> >> >>>> You can just generate new segments or remove all directories from >> >>>> the >> >>>> segment except crawl_generate and start fetching again. >> >>>> >> >>>>> >> >>>>> 2. I run nutch on a huge amount of urls and some urls are not >> >>>>> parsed >> >>>>> sucessfully. >> >>>>> I get an index which has all the urls that worked and doesnt have >> >>>>> the >> >>>>> ones >> >>>>> that didnt work. >> >>>>> How can I handle them without having to recrawl the whole thing? >> >>>> >> >>>> I don't understand. >> >>>> >> >>>> >> >>>> >> >>>>> >> >>>>> >> >>>>> Thanks. >> >>>>> >> >>>>> -- >> >>>>> View this message in context: >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html >> >>>>> Sent from the Nutch - User mailing list archive at Nabble.com. >> >>>> >> >>>> >> >>>> hi, >> >>>> >> >>>> On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@" >> >>>> <nutch.buddy@> wrote: >> >>>>> Hi >> >>>>> There are some scenarios of failure in nutch which I'm not sure >> >>>>> how >> >>>>> to >> >>>>> handle. >> >>>>> >> >>>>> 1. I run nutch on a huge amount of urls and some kind of OOM >> >>>>> exception if >> >>>>> thrown, or one of those "cannot allocate memory". The result is >> >>>>> that >> >>>>> my >> >>>>> segment is half complete. >> >>>> >> >>>> A segment cannot be halve complete, it either fails or succeeds >> >>>> entirely. >> >>>> >> >>>>> How can I recover from this? Do I have to recrawl all the urls >> >>>>> that >> >>>>> were in >> >>>>> the segment? >> >>>>> If so, how do I mark them for recrawl in crawldb? >> >>>> >> >>>> You can just generate new segments or remove all directories from >> >>>> the >> >>>> segment except crawl_generate and start fetching again. >> >>>> >> >>>>> >> >>>>> 2. I run nutch on a huge amount of urls and some urls are not >> >>>>> parsed >> >>>>> sucessfully. >> >>>>> I get an index which has all the urls that worked and doesnt have >> >>>>> the >> >>>>> ones >> >>>>> that didnt work. >> >>>>> How can I handle them without having to recrawl the whole thing? >> >>>> >> >>>> I don't understand. >> >>>> >> >>>> >> >>>> >> >>>>> >> >>>>> >> >>>>> Thanks. >> >>>>> >> >>>>> -- >> >>>>> View this message in context: >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html >> >>>>> Sent from the Nutch - User mailing list archive at Nabble.com. >> >>>> >> >>>> >> >>>> -- >> >>>> View this message in context: >> >>>> >> >>>> >> >>>> >> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899044.html >> >>>> Sent from the Nutch - User mailing list archive at Nabble.com. >> >>> >> >>> -- >> >>> >> >> >> >> >> >> -- >> >> View this message in context: >> >> >> >> >> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899171.html >> >> Sent from the Nutch - User mailing list archive at Nabble.com. >> > >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899191.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> > -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899453.html Sent from the Nutch - User mailing list archive at Nabble.com.

