> You can just generate new segments or remove all directories from the > segment except crawl_generate and start fetching again.
but in such case, the url is already present in crawldb in fetched status. so if I run crawling on this urls, won't the file be considered as "already fetched"? > A segment cannot be halve complete, it either fails or succeeds > entirely. Well doesn't it encourage us to put a small amount of urls in a segment and thus make sure that if the segment fails because of a single url, I'll have to re-crawl only a "reasonable" amount of urls? About the second question: assume that I run nutch on a huge amount of files. one file's parsing fails, lets say because the file was corrupted. I fix the file and now I want to re-crawl it. If I just run crawling again, the file won't be fetched because of the interval. I can try run nutch on this file aside, and then merge, but well, it doesn't works as it should. while merging something goes wrong and I find myself with a smaller index than I had before. So the question is - how can I recrawl a single url while running nutch crawl on an existing input. Any other insights on these issues will be appreciated Markus Jelsma-2 wrote > > hi, > > On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@" > <nutch.buddy@> wrote: >> Hi >> There are some scenarios of failure in nutch which I'm not sure how >> to >> handle. >> >> 1. I run nutch on a huge amount of urls and some kind of OOM >> exception if >> thrown, or one of those "cannot allocate memory". The result is that >> my >> segment is half complete. > > A segment cannot be halve complete, it either fails or succeeds > entirely. > >> How can I recover from this? Do I have to recrawl all the urls that >> were in >> the segment? >> If so, how do I mark them for recrawl in crawldb? > > You can just generate new segments or remove all directories from the > segment except crawl_generate and start fetching again. > >> >> 2. I run nutch on a huge amount of urls and some urls are not parsed >> sucessfully. >> I get an index which has all the urls that worked and doesnt have the >> ones >> that didnt work. >> How can I handle them without having to recrawl the whole thing? > > I don't understand. > > > >> >> >> Thanks. >> >> -- >> View this message in context: >> >> http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html >> Sent from the Nutch - User mailing list archive at Nabble.com. > hi, On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@" <nutch.buddy@> wrote: > Hi > There are some scenarios of failure in nutch which I'm not sure how > to > handle. > > 1. I run nutch on a huge amount of urls and some kind of OOM > exception if > thrown, or one of those "cannot allocate memory". The result is that > my > segment is half complete. A segment cannot be halve complete, it either fails or succeeds entirely. > How can I recover from this? Do I have to recrawl all the urls that > were in > the segment? > If so, how do I mark them for recrawl in crawldb? You can just generate new segments or remove all directories from the segment except crawl_generate and start fetching again. > > 2. I run nutch on a huge amount of urls and some urls are not parsed > sucessfully. > I get an index which has all the urls that worked and doesnt have the > ones > that didnt work. > How can I handle them without having to recrawl the whole thing? I don't understand. > > > Thanks. > > -- > View this message in context: > > http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html > Sent from the Nutch - User mailing list archive at Nabble.com. hi, On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@" <nutch.buddy@> wrote: > Hi > There are some scenarios of failure in nutch which I'm not sure how > to > handle. > > 1. I run nutch on a huge amount of urls and some kind of OOM > exception if > thrown, or one of those "cannot allocate memory". The result is that > my > segment is half complete. A segment cannot be halve complete, it either fails or succeeds entirely. > How can I recover from this? Do I have to recrawl all the urls that > were in > the segment? > If so, how do I mark them for recrawl in crawldb? You can just generate new segments or remove all directories from the segment except crawl_generate and start fetching again. > > 2. I run nutch on a huge amount of urls and some urls are not parsed > sucessfully. > I get an index which has all the urls that worked and doesnt have the > ones > that didnt work. > How can I handle them without having to recrawl the whole thing? I don't understand. > > > Thanks. > > -- > View this message in context: > > http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html > Sent from the Nutch - User mailing list archive at Nabble.com. hi, On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), "nutch.buddy@" <nutch.buddy@> wrote: > Hi > There are some scenarios of failure in nutch which I'm not sure how > to > handle. > > 1. I run nutch on a huge amount of urls and some kind of OOM > exception if > thrown, or one of those "cannot allocate memory". The result is that > my > segment is half complete. A segment cannot be halve complete, it either fails or succeeds entirely. > How can I recover from this? Do I have to recrawl all the urls that > were in > the segment? > If so, how do I mark them for recrawl in crawldb? You can just generate new segments or remove all directories from the segment except crawl_generate and start fetching again. > > 2. I run nutch on a huge amount of urls and some urls are not parsed > sucessfully. > I get an index which has all the urls that worked and doesnt have the > ones > that didnt work. > How can I handle them without having to recrawl the whole thing? I don't understand. > > > Thanks. > > -- > View this message in context: > > http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3898768.html > Sent from the Nutch - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-handle-failures-in-nutch-tp3898768p3899044.html Sent from the Nutch - User mailing list archive at Nabble.com.

