does maxSegments control the number of segments per level? Do I know for sure that if I have 1 milion pages in a certain level, and assuming I'm not setting topN paramater (so it is set to default, MAX LONG), and I set maxSegment to 4, than for that level i'll have 4 segments, each 250K pages?
Or have I misunderstood your answer? Markus Jelsma-2 wrote > > On Mon, 7 May 2012 22:52:52 -0700 (PDT), "nutch.buddy@" > <nutch.buddy@> wrote: >> Yeah I've meant an unexpected failure that crashed the job, like OOM. >> >> Regarding topN - Nutch tutorial says: >> "-topN N determines the maximum number of pages that will be >> retrieved at >> each level up to the depth." >> >> Does it mean that when the limit is reached, no more urls on this >> level will >> be added to the fetch list, or in other words - does is mean that >> nutch will >> not fetch all the urls? > > Yes. Only N records are generated for this fetch/update cycle. > Therefore, use maxSegments to control how many segments of ~N records > are generated. > >> >> >> >> Markus Jelsma-2 wrote >>> >>> On Mon, 7 May 2012 22:31:43 -0700 (PDT), "nutch.buddy@" >>> <nutch.buddy@> wrote: >>>> In a previous discussion about handling of failures in nutch, it >>>> was >>>> mentioned that a broken segment cannot be fixed and it's urls >>>> should >>>> be >>>> re-crawled. >>>> Thus, it seems that there should be a way to control segment size, >>>> so >>>> that >>>> one can limit the risk of having to re-crawl a huge amount of urls >>>> if >>>> only >>>> one of them fails. >>> >>> If one what fails? It's not as if one URL's fails, the whole >>> segment >>> has failed. A segment is failed when the fetcher unexpectedly dies >>> and >>> is not successfully retried by Hadoop. >>> >>>> >>>> Any existing way in nutch to do this? >>> >>> Sure, the -topN parameter of the generator tool. >>> >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> >>>> >>>> http://lucene.472066.n3.nabble.com/Is-it-possible-to-control-the-segment-size-tp3970452.html >>>> Sent from the Nutch - User mailing list archive at Nabble.com. >>> >>> -- >>> Markus Jelsma - CTO - Openindex >>> http://www.linkedin.com/in/markus17 >>> 050-8536600 / 06-50258350 >>> >> >> -- >> View this message in context: >> >> http://lucene.472066.n3.nabble.com/Is-it-possible-to-control-the-segment-size-tp3970452p3970478.html >> Sent from the Nutch - User mailing list archive at Nabble.com. > -- View this message in context: http://lucene.472066.n3.nabble.com/Is-it-possible-to-control-the-segment-size-tp3970452p3970500.html Sent from the Nutch - User mailing list archive at Nabble.com.

