Is there any documentation about the limits of a single Nutch crawler, running with just the built-in Hadoop?
I started off the crawler, using the runbot.sh script, and set the topN to 1000, and let it fly. I set up a cron job so that it kicks off every few hours. It went pretty well for a few days, then I noticed that I was getting some IO Exceptions on the fetcher. I tweaked the heap size for Hadoop, and it ran a bit better. I started a crawl on Tuesday and it was still running that same job today. The LinkDB mapreduce calls were taking over 3 hours to run. I'm up to 147813 URLs in Solr. In order to get more content, I've created a fresh crawl directory (keeping the old one), and a new logs (keeping the old one), and kicked that off. I'm getting some new content, but since the LinkDb is empty I'm definitely getting URLs that are already in Solr. I'm just trying to see how far I can push a single instance of Nutch without having to set up an external Hadoop. I'm running on the use case of a person/group needing a small to medium sized index of local content, so it doesn't need to be "web scale." Thanks! -- Chris

