Sorry I misunderstood completely.
You can enable filtering (and normalizing) for the solr-indexer job in trunk

http://wiki.apache.org/nutch/bin/nutch%20solrindex

This will enable you to crawl everything but restrict what gets sent down
to the index from your crawdl.

hth
Lewis

On Friday, July 19, 2013, dogrdon <[email protected]> wrote:
> Hi Lewis, thanks for a quick reply, but I actually don't understand this:
>
> as far as I can tell,
>
> +^http://www.oursite.com/([a-z0-9\-A-Z]*\/)* in the regex-urlfilter.txt
> means that it will crawl all pages under that main domain, which is what I
> want.
>
> If i set it to -^http://www.oursite.com/([a-z0-9\-A-Z]*\/)*, it crawls
> nothing and says no URLs to fetch.
>
> How is it that I *can* crawl my whole site, with the exception of skipping
> over a few paths.
>
> sorry if my confusion is confusing :)
>
>
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/Why-aren-t-my-path-exclusions-getting-excluded-in-the-Nutch-index-to-Solr-tp4079172p4079205.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*

Reply via email to