RE: Why aren't my path exclusions getting excluded in the Nutch index to Solr?

Markus Jelsma Mon, 22 Jul 2013 00:30:01 -0700

Don't forget to refilter the database after changes been made to URL filters or 
they will be regenerated and fetched. 
 
-----Original message-----
> From:dogrdon <[email protected]>
> Sent: Friday 19th July 2013 18:44
> To: [email protected]
> Subject: Why aren't my path exclusions getting excluded in the Nutch index to 
> Solr?
> 
> So in crawling and indexing our site to Solr via Nutch, we need to be able to
> exclude any content that falls under a certain path.
> 
> So say we have our site: http://oursite.com/ and we have a path that we
> don't want to index at http://oursite.com/private/
> 
> I have http://oursite.com/ in the seed.txt file and
> +^http://www.oursite.com/([a-z0-9\-A-Z]*\/)* in the regex-urlfilter.txt file
> 
> I thought that putting: -.*/private/.* also in the regex-urlfilter.txt file
> would exclude that path and anything under it, but the crawler is still
> fetching and indexing content under the /private/ path.
> 
> Is there some kind of restart I need to do on the server, like Solr? Or is
> my regex not actually the right way to do this?
> 
> thanks
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Why-aren-t-my-path-exclusions-getting-excluded-in-the-Nutch-index-to-Solr-tp4079172.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

RE: Why aren't my path exclusions getting excluded in the Nutch index to Solr?

Reply via email to