Hi Chris, > I started off the crawler, using the runbot.sh script Which Nutch version and what script is used?
> I'm up to 147813 URLs in Solr. Because there are also redirects, robots=noindex, and other URLs fetched but not indexed, the crawled content is somewhat larger. But it should be possible to get this amount crawled on a single node. > The LinkDB mapreduce calls were taking over 3 hours to run. That's probably because filtering and normalization is on: in this case all existing Links are normalized and filtered. If Outlinks are normalized and filtered during parse this can be avoided and inverting links should get faster. Hadoop (but not in local mode) will also speed-up the job: normalization and filtering is done in the mapper and such is ideal for parallelization. > but since the LinkDb is empty I'm definitely > getting URLs that are already in Solr. LinkDb holds the incoming links for each document together with the anchor texts. URLs and status information (unfetched, fetched, gone, etc.) is contained in CrawlDb. Everything is crawled anew from scratch/seeds if CrawlDb is removed. Sebastian On 07/24/2014 05:59 PM, Christopher Gross wrote: > Is there any documentation about the limits of a single Nutch crawler, > running with just the built-in Hadoop? > > I started off the crawler, using the runbot.sh script, and set the topN to > 1000, and let it fly. I set up a cron job so that it kicks off every few > hours. It went pretty well for a few days, then I noticed that I was > getting some IO Exceptions on the fetcher. I tweaked the heap size for > Hadoop, and it ran a bit better. I started a crawl on Tuesday and it was > still running that same job today. The LinkDB mapreduce calls were taking > over 3 hours to run. I'm up to 147813 URLs in Solr. > > In order to get more content, I've created a fresh crawl directory (keeping > the old one), and a new logs (keeping the old one), and kicked that off. > I'm getting some new content, but since the LinkDb is empty I'm definitely > getting URLs that are already in Solr. > > I'm just trying to see how far I can push a single instance of Nutch > without having to set up an external Hadoop. I'm running on the use case > of a person/group needing a small to medium sized index of local content, > so it doesn't need to be "web scale." > > Thanks! > > -- Chris >

