Don't forget to refilter the database after changes been made to URL filters or
they will be regenerated and fetched.
-----Original message-----
> From:dogrdon <[email protected]>
> Sent: Friday 19th July 2013 18:44
> To: [email protected]
> Subject: Why aren't my path exclusions getting excluded in the Nutch index to
> Solr?
>
> So in crawling and indexing our site to Solr via Nutch, we need to be able to
> exclude any content that falls under a certain path.
>
> So say we have our site: http://oursite.com/ and we have a path that we
> don't want to index at http://oursite.com/private/
>
> I have http://oursite.com/ in the seed.txt file and
> +^http://www.oursite.com/([a-z0-9\-A-Z]*\/)* in the regex-urlfilter.txt file
>
> I thought that putting: -.*/private/.* also in the regex-urlfilter.txt file
> would exclude that path and anything under it, but the crawler is still
> fetching and indexing content under the /private/ path.
>
> Is there some kind of restart I need to do on the server, like Solr? Or is
> my regex not actually the right way to do this?
>
> thanks
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Why-aren-t-my-path-exclusions-getting-excluded-in-the-Nutch-index-to-Solr-tp4079172.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>