On Mon, 7 May 2012 22:31:43 -0700 (PDT), "[email protected]"
<[email protected]> wrote:
In a previous discussion about handling of failures in nutch, it was
mentioned that a broken segment cannot be fixed and it's urls should
be
re-crawled.
Thus, it seems that there should be a way to control segment size, so
that
one can limit the risk of having to re-crawl a huge amount of urls if
only
one of them fails.
If one what fails? It's not as if one URL's fails, the whole segment
has failed. A segment is failed when the fetcher unexpectedly dies and
is not successfully retried by Hadoop.
Any existing way in nutch to do this?
Sure, the -topN parameter of the generator tool.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Is-it-possible-to-control-the-segment-size-tp3970452.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350