Hi Kris,

also the indexer can filter by URL. It's possible to create an extra
configuration file used only for indexing and set this only for the indexing job
in combination with the option -filter to enable URL filtering (off by default):

  bin/nutch index -Durlfilter.regex.file=regex-urlfilter-index.txt ... -filter

Make sure that the extra file is properly placed / packed so that it is found.
Since most undesired URLs are already filtered (.jpeg, etc.), for better 
performance
the file should contain only those rules required to keep the index clean. Also
note that the -D... arguments must precede all other arguments.

Best,
Sebastian

On 12/12/2016 08:54 PM, KRIS MUSSHORN wrote:
> I'm using nutch 1.12 and Solr 5.4.1.  
>    
> Crawling a website and indexing into nutch.  
>   
> AFAIK the regex-urlfilter.txt file will cause content to not be crawled..  
>    
> what if I have  
> https://XXXX/inside/default.cfm  as my seed url...  
> I want the links on this page to be crawled and indexed but I do not want 
> this page to be indexed into SOLR.  
> How would I set this up?  
>    
> I'm thnking that the regex.urlfilter.txt file is NOT the right place. 
> 

Reply via email to