Hi -----Original message----- > From:stone2dbone <[email protected]> > Sent: Wednesday 24th July 2013 14:56 > To: [email protected] > Subject: Prevent crawl of parent URL > > I would like to crawl everything in > > http://my.domain.name/dir/subdir > > but nothing in its parent > > http://my.domain.name/dir/ > > In regex-urlfilter.txt I have the following: > > # skip URLs > -^http://my.domain.name/dir/
This will also skip all URL's below this depth. You need to mark the end of the URL. -^http://my.domain.name/dir/$ > > # accept URLs > +^http://my.domain.name/dir/subdir/* > > but Nutch still crawls the skip URLs. Any suggestions how to correct this > behavior? Did you refilter the crawldb? Modifying the URL filters does not magically update the DB. Cheers > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Prevent-crawl-of-parent-URL-tp4080032.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

