does maxSegments control the number of segments per level?
Do I know for sure that if I have 1 milion pages in a certain level, and
assuming I'm not setting topN paramater (so it is set to default, MAX LONG),
and I set maxSegment to 4, than for that level i'll have 4 segments, each
250K pages?

Or have I misunderstood your answer?



Markus Jelsma-2 wrote
> 
> On Mon, 7 May 2012 22:52:52 -0700 (PDT), "nutch.buddy@" 
>  <nutch.buddy@> wrote:
>> Yeah I've meant an unexpected failure that crashed the job, like OOM.
>>
>> Regarding topN - Nutch tutorial says:
>> "-topN N determines the maximum number of pages that will be 
>> retrieved at
>> each level up to the depth."
>>
>> Does it mean that when the limit is reached, no more urls on this 
>> level will
>> be added to the fetch list, or in other words - does is mean that 
>> nutch will
>> not fetch all the urls?
> 
>  Yes. Only N records are generated for this fetch/update cycle. 
>  Therefore, use maxSegments to control how many segments of ~N records 
>  are generated.
> 
>>
>>
>>
>> Markus Jelsma-2 wrote
>>>
>>> On Mon, 7 May 2012 22:31:43 -0700 (PDT), "nutch.buddy@"
>>>  <nutch.buddy@> wrote:
>>>> In a previous discussion about handling of failures in nutch, it 
>>>> was
>>>> mentioned that a broken segment cannot be fixed and it's urls 
>>>> should
>>>> be
>>>> re-crawled.
>>>> Thus, it seems that there should be a way to control segment size, 
>>>> so
>>>> that
>>>> one can limit the risk of having to re-crawl a huge amount of urls 
>>>> if
>>>> only
>>>> one of them fails.
>>>
>>>  If one what fails? It's not as if one URL's fails, the whole 
>>> segment
>>>  has failed. A segment is failed when the fetcher unexpectedly dies 
>>> and
>>>  is not successfully retried by Hadoop.
>>>
>>>>
>>>> Any existing way in nutch to do this?
>>>
>>>  Sure, the -topN parameter of the generator tool.
>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>>
>>>> 
>>>> http://lucene.472066.n3.nabble.com/Is-it-possible-to-control-the-segment-size-tp3970452.html
>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>> --
>>>  Markus Jelsma - CTO - Openindex
>>>  http://www.linkedin.com/in/markus17
>>>  050-8536600 / 06-50258350
>>>
>>
>> --
>> View this message in context:
>> 
>> http://lucene.472066.n3.nabble.com/Is-it-possible-to-control-the-segment-size-tp3970452p3970478.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-possible-to-control-the-segment-size-tp3970452p3970500.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to