Is there any documentation about the limits of a single Nutch crawler,
running with just the built-in Hadoop?

I started off the crawler, using the runbot.sh script, and set the topN to
1000, and let it fly.  I set up a cron job so that it kicks off every few
hours.  It went pretty well for a few days, then I noticed that I was
getting some IO Exceptions on the fetcher.  I tweaked the heap size for
Hadoop, and it ran a bit better.  I started a crawl on Tuesday and it was
still running that same job today.  The LinkDB mapreduce calls were taking
over 3 hours to run.  I'm up to 147813 URLs in Solr.

In order to get more content, I've created a fresh crawl directory (keeping
the old one), and a new logs (keeping the old one), and kicked that off.
I'm getting some new content, but since the LinkDb is empty I'm definitely
getting URLs that are already in Solr.

I'm just trying to see how far I can push a single instance of Nutch
without having to set up an external Hadoop.  I'm running on the use case
of a person/group needing a small to medium sized index of local content,
so it doesn't need to be "web scale."

Thanks!

-- Chris

Reply via email to