Re: Limits of a single crawler

Christopher Gross Tue, 29 Jul 2014 05:45:07 -0700

Sebastian,

Can you provide more information?


I'd be fine with using command line arguments, unless you have a case
against it.
Are there any of the filter files that I'd need to work with?

Is there a page in the wiki that covers this a bit more?

Thanks.

-- Chris


On Mon, Jul 28, 2014 at 5:29 PM, Sebastian Nagel <[email protected]
> wrote:

> Hi Chris,
>
> > How do I normalize & filter outlinks?  Do I just need to add some
> arguments
> > to the crawl script, or are they configuration settings?
>
> Well, for historical reasons there are 3 possible ways to enable/disable
> URL normalizers and filters:
> * configuration properties
> * command-line arguments for tools (updatedb, index, etc.)
> * inverse command-line arguments (-noFilter/-noNormalize)
>   for other tools (parse, invertlinks)
> Command-line arguments always overwrite configuration properties.
>
>
> Sebastian
>
>
> On 07/25/2014 01:44 PM, Christopher Gross wrote:
> > On Thu, Jul 24, 2014 at 5:00 PM, Sebastian Nagel <
> [email protected]
> >> wrote:
> >
> >> Hi Chris,
> >>
> >>> I started off the crawler, using the runbot.sh script
> >> Which Nutch version and what script is used?
> >>
> >>
> > Nutch 1.6
> > Sorry, its the newer "crawl" script (I just have a runbot.sh that calls
> it
> > and writes the output to a file).
> >
> >
> >>> I'm up to 147813 URLs in Solr.
> >> Because there are also redirects, robots=noindex, and
> >> other URLs fetched but not indexed, the crawled content
> >> is somewhat larger. But it should be possible to get
> >> this amount crawled on a single node.
> >>
> >>
> > OK.
> >
> >
> >>> The LinkDB mapreduce calls were taking over 3 hours to run.
> >> That's probably because filtering and normalization is on:
> >> in this case all existing Links are normalized and filtered.
> >> If Outlinks are normalized and filtered during parse
> >> this can be avoided and inverting links should get faster.
> >> Hadoop (but not in local mode) will also speed-up the
> >> job: normalization and filtering is done in the mapper
> >> and such is ideal for parallelization.
> >>
> >>
> > How do I normalize & filter outlinks?  Do I just need to add some
> arguments
> > to the crawl script, or are they configuration settings?
> >
> > Thanks!
> >
> > -- Chris
> >
>
>

Re: Limits of a single crawler

Reply via email to