Re: Limits of a single crawler

Sebastian Nagel Thu, 24 Jul 2014 14:01:28 -0700

Hi Chris,

> I started off the crawler, using the runbot.sh script
Which Nutch version and what script is used?


> I'm up to 147813 URLs in Solr.
Because there are also redirects, robots=noindex, and
other URLs fetched but not indexed, the crawled content
is somewhat larger. But it should be possible to get
this amount crawled on a single node.

> The LinkDB mapreduce calls were taking over 3 hours to run.
That's probably because filtering and normalization is on:
in this case all existing Links are normalized and filtered.
If Outlinks are normalized and filtered during parse
this can be avoided and inverting links should get faster.
Hadoop (but not in local mode) will also speed-up the
job: normalization and filtering is done in the mapper
and such is ideal for parallelization.

> but since the LinkDb is empty I'm definitely
> getting URLs that are already in Solr.
LinkDb holds the incoming links for each document
together with the anchor texts.
URLs and status information (unfetched, fetched, gone, etc.)
is contained in CrawlDb. Everything is crawled anew from scratch/seeds
if CrawlDb is removed.

Sebastian

On 07/24/2014 05:59 PM, Christopher Gross wrote:
> Is there any documentation about the limits of a single Nutch crawler,
> running with just the built-in Hadoop?
> 
> I started off the crawler, using the runbot.sh script, and set the topN to
> 1000, and let it fly.  I set up a cron job so that it kicks off every few
> hours.  It went pretty well for a few days, then I noticed that I was
> getting some IO Exceptions on the fetcher.  I tweaked the heap size for
> Hadoop, and it ran a bit better.  I started a crawl on Tuesday and it was
> still running that same job today.  The LinkDB mapreduce calls were taking
> over 3 hours to run.  I'm up to 147813 URLs in Solr.
> 
> In order to get more content, I've created a fresh crawl directory (keeping
> the old one), and a new logs (keeping the old one), and kicked that off.
> I'm getting some new content, but since the LinkDb is empty I'm definitely
> getting URLs that are already in Solr.
> 
> I'm just trying to see how far I can push a single instance of Nutch
> without having to set up an external Hadoop.  I'm running on the use case
> of a person/group needing a small to medium sized index of local content,
> so it doesn't need to be "web scale."
> 
> Thanks!
> 
> -- Chris
>

Re: Limits of a single crawler

Reply via email to