Re: Limits of a single crawler

Christopher Gross Fri, 25 Jul 2014 04:45:56 -0700

On Thu, Jul 24, 2014 at 5:00 PM, Sebastian Nagel <[email protected]
> wrote:


> Hi Chris,
>
> > I started off the crawler, using the runbot.sh script
> Which Nutch version and what script is used?
>
>
Nutch 1.6
Sorry, its the newer "crawl" script (I just have a runbot.sh that calls it
and writes the output to a file).


> > I'm up to 147813 URLs in Solr.
> Because there are also redirects, robots=noindex, and
> other URLs fetched but not indexed, the crawled content
> is somewhat larger. But it should be possible to get
> this amount crawled on a single node.
>
>
OK.


> > The LinkDB mapreduce calls were taking over 3 hours to run.
> That's probably because filtering and normalization is on:
> in this case all existing Links are normalized and filtered.
> If Outlinks are normalized and filtered during parse
> this can be avoided and inverting links should get faster.
> Hadoop (but not in local mode) will also speed-up the
> job: normalization and filtering is done in the mapper
> and such is ideal for parallelization.
>
>
How do I normalize & filter outlinks?  Do I just need to add some arguments
to the crawl script, or are they configuration settings?

Thanks!

-- Chris

Re: Limits of a single crawler

Reply via email to