Re: config help

KRIS MUSSHORN Tue, 13 Dec 2016 05:31:18 -0800

Sebastian. i am triggering nutch with a bash script that fires crawl. 
How would i set it up to use the index filtering?

Kris 

----- Original Message -----

From: "Sebastian Nagel" <[email protected]> 
To: [email protected] 
Sent: Tuesday, December 13, 2016 6:11:52 AM 
Subject: Re: config help 

Hi Kris, 

also the indexer can filter by URL. It's possible to create an extra 
configuration file used only for indexing and set this only for the indexing 
job 
in combination with the option -filter to enable URL filtering (off by 
default): 

bin/nutch index -Durlfilter.regex.file=regex-urlfilter-index.txt ... -filter 

Make sure that the extra file is properly placed / packed so that it is found. 
Since most undesired URLs are already filtered (.jpeg, etc.), for better 
performance 
the file should contain only those rules required to keep the index clean. Also 
note that the -D... arguments must precede all other arguments. 

Best, 
Sebastian 

On 12/12/2016 08:54 PM, KRIS MUSSHORN wrote: 
> I'm using nutch 1.12 and Solr 5.4.1. 
> 
> Crawling a website and indexing into nutch. 
> 
> AFAIK the regex-urlfilter.txt file will cause content to not be crawled.. 
> 
> what if I have 
> https://XXXX/inside/default.cfm as my seed url... 
> I want the links on this page to be crawled and indexed but I do not want 
> this page to be indexed into SOLR. 
> How would I set this up? 
> 
> I'm thnking that the regex.urlfilter.txt file is NOT the right place. 
>

Re: config help

Reply via email to