you can use this regex to fit your requirements # skip parent dir include files in parent dir -^http://my.domain.name/dir/.*/$
# accept subdirectories +^http://my.domain.name/dir/.*/.* you can use this command to test your setting: bin/nutch plugin urlfilter-regex org.apache.nutch.urlfilter.regex.RegexURLFilter and you can use this site to debug your regex. http://www.debuggex.com/ On Thu, Aug 8, 2013 at 9:09 PM, stone2dbone <[email protected]>wrote: > Unfortunately, > > -^http://my.domain.name/dir/$ > > didn't work for me. I need to skip just the documents in the directory, but > this skips all the subdirectories as well. Is there another solution, or > possibly some way to go back and remove all the parent directories after > the > crawl? > > Thanks for your help. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Prevent-crawl-of-parent-URL-tp4080032p4083287.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Don't Grow Old, Grow Up... :-)

