So in crawling and indexing our site to Solr via Nutch, we need to be able to
exclude any content that falls under a certain path.

So say we have our site: http://oursite.com/ and we have a path that we
don't want to index at http://oursite.com/private/

I have http://oursite.com/ in the seed.txt file and
+^http://www.oursite.com/([a-z0-9\-A-Z]*\/)* in the regex-urlfilter.txt file

I thought that putting: -.*/private/.* also in the regex-urlfilter.txt file
would exclude that path and anything under it, but the crawler is still
fetching and indexing content under the /private/ path.

Is there some kind of restart I need to do on the server, like Solr? Or is
my regex not actually the right way to do this?

thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Why-aren-t-my-path-exclusions-getting-excluded-in-the-Nutch-index-to-Solr-tp4079172.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to