On Mon, 7 May 2012 22:31:43 -0700 (PDT), "[email protected]" <[email protected]> wrote:
In a previous discussion about handling of failures in nutch, it was
mentioned that a broken segment cannot be fixed and it's urls should be
re-crawled.
Thus, it seems that there should be a way to control segment size, so that one can limit the risk of having to re-crawl a huge amount of urls if only
one of them fails.

If one what fails? It's not as if one URL's fails, the whole segment has failed. A segment is failed when the fetcher unexpectedly dies and is not successfully retried by Hadoop.


Any existing way in nutch to do this?

Sure, the -topN parameter of the generator tool.




--
View this message in context:

http://lucene.472066.n3.nabble.com/Is-it-possible-to-control-the-segment-size-tp3970452.html
Sent from the Nutch - User mailing list archive at Nabble.com.

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350

Reply via email to