Re: Prevent crawl of parent URL

feng lu Thu, 08 Aug 2013 09:06:35 -0700

you can use this regex to fit your requirements

 # skip parent dir include files in parent dir
-^http://my.domain.name/dir/.*/$


 # accept subdirectories
+^http://my.domain.name/dir/.*/.*

you can use this command to test your setting:

bin/nutch plugin urlfilter-regex
org.apache.nutch.urlfilter.regex.RegexURLFilter

and you can use this site to debug your regex.

http://www.debuggex.com/


On Thu, Aug 8, 2013 at 9:09 PM, stone2dbone
<[email protected]>wrote:

> Unfortunately,
>
> -^http://my.domain.name/dir/$
>
> didn't work for me. I need to skip just the documents in the directory, but
> this skips all the subdirectories as well. Is there another solution, or
> possibly some way to go back and remove all the parent directories after
> the
> crawl?
>
> Thanks for your help.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Prevent-crawl-of-parent-URL-tp4080032p4083287.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Prevent crawl of parent URL

Reply via email to