Sorry I misunderstood completely. You can enable filtering (and normalizing) for the solr-indexer job in trunk
http://wiki.apache.org/nutch/bin/nutch%20solrindex This will enable you to crawl everything but restrict what gets sent down to the index from your crawdl. hth Lewis On Friday, July 19, 2013, dogrdon <[email protected]> wrote: > Hi Lewis, thanks for a quick reply, but I actually don't understand this: > > as far as I can tell, > > +^http://www.oursite.com/([a-z0-9\-A-Z]*\/)* in the regex-urlfilter.txt > means that it will crawl all pages under that main domain, which is what I > want. > > If i set it to -^http://www.oursite.com/([a-z0-9\-A-Z]*\/)*, it crawls > nothing and says no URLs to fetch. > > How is it that I *can* crawl my whole site, with the exception of skipping > over a few paths. > > sorry if my confusion is confusing :) > > > > -- > View this message in context: http://lucene.472066.n3.nabble.com/Why-aren-t-my-path-exclusions-getting-excluded-in-the-Nutch-index-to-Solr-tp4079172p4079205.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- *Lewis*

