Sebastian. i am triggering nutch with a bash script that fires crawl. How would i set it up to use the index filtering?
Kris ----- Original Message ----- From: "Sebastian Nagel" <[email protected]> To: [email protected] Sent: Tuesday, December 13, 2016 6:11:52 AM Subject: Re: config help Hi Kris, also the indexer can filter by URL. It's possible to create an extra configuration file used only for indexing and set this only for the indexing job in combination with the option -filter to enable URL filtering (off by default): bin/nutch index -Durlfilter.regex.file=regex-urlfilter-index.txt ... -filter Make sure that the extra file is properly placed / packed so that it is found. Since most undesired URLs are already filtered (.jpeg, etc.), for better performance the file should contain only those rules required to keep the index clean. Also note that the -D... arguments must precede all other arguments. Best, Sebastian On 12/12/2016 08:54 PM, KRIS MUSSHORN wrote: > I'm using nutch 1.12 and Solr 5.4.1. > > Crawling a website and indexing into nutch. > > AFAIK the regex-urlfilter.txt file will cause content to not be crawled.. > > what if I have > https://XXXX/inside/default.cfm as my seed url... > I want the links on this page to be crawled and indexed but I do not want > this page to be indexed into SOLR. > How would I set this up? > > I'm thnking that the regex.urlfilter.txt file is NOT the right place. >

