RE: Prevent crawl of parent URL

Markus Jelsma Wed, 24 Jul 2013 06:07:49 -0700

Hi

-----Original message-----
> From:stone2dbone <[email protected]>
> Sent: Wednesday 24th July 2013 14:56
> To: [email protected]
> Subject: Prevent crawl of parent URL
> 
> I would like to crawl everything in
> 
> http://my.domain.name/dir/subdir
> 
> but nothing in its parent
> 
> http://my.domain.name/dir/
> 
> In regex-urlfilter.txt I have the following:
> 
> # skip URLs
> -^http://my.domain.name/dir/


This will also skip all URL's below this depth. You need to mark the end of the 
URL.
-^http://my.domain.name/dir/$

> 
> # accept URLs
> +^http://my.domain.name/dir/subdir/*
> 
> but Nutch still crawls the skip URLs. Any suggestions how to correct this
> behavior?

Did you refilter the crawldb? Modifying the URL filters does not magically 
update the DB.

Cheers

> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Prevent-crawl-of-parent-URL-tp4080032.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

RE: Prevent crawl of parent URL

Reply via email to