Re: Limits of a single crawler

Sebastian Nagel Mon, 28 Jul 2014 14:30:04 -0700

Hi Chris,

> How do I normalize & filter outlinks?  Do I just need to add some arguments
> to the crawl script, or are they configuration settings?


Well, for historical reasons there are 3 possible ways to enable/disable
URL normalizers and filters:
* configuration properties
* command-line arguments for tools (updatedb, index, etc.)
* inverse command-line arguments (-noFilter/-noNormalize)
  for other tools (parse, invertlinks)
Command-line arguments always overwrite configuration properties.


Sebastian


On 07/25/2014 01:44 PM, Christopher Gross wrote:
> On Thu, Jul 24, 2014 at 5:00 PM, Sebastian Nagel <[email protected]
>> wrote:
> 
>> Hi Chris,
>>
>>> I started off the crawler, using the runbot.sh script
>> Which Nutch version and what script is used?
>>
>>
> Nutch 1.6
> Sorry, its the newer "crawl" script (I just have a runbot.sh that calls it
> and writes the output to a file).
> 
> 
>>> I'm up to 147813 URLs in Solr.
>> Because there are also redirects, robots=noindex, and
>> other URLs fetched but not indexed, the crawled content
>> is somewhat larger. But it should be possible to get
>> this amount crawled on a single node.
>>
>>
> OK.
> 
> 
>>> The LinkDB mapreduce calls were taking over 3 hours to run.
>> That's probably because filtering and normalization is on:
>> in this case all existing Links are normalized and filtered.
>> If Outlinks are normalized and filtered during parse
>> this can be avoided and inverting links should get faster.
>> Hadoop (but not in local mode) will also speed-up the
>> job: normalization and filtering is done in the mapper
>> and such is ideal for parallelization.
>>
>>
> How do I normalize & filter outlinks?  Do I just need to add some arguments
> to the crawl script, or are they configuration settings?
> 
> Thanks!
> 
> -- Chris
>

Re: Limits of a single crawler

Reply via email to