Thanks, I was aware of these precedence rules but strayed a bit from them as I tweaked to try and get the results I wanted.
What really helped was realising that URLs are not resolved into absolute links before they are tested so patterns need to match however they appear in parsed content. The hadoop.log file only displays absolute URLs which can be misleading. Second, this command line test for URL filtering saves a load of time and effort when tuning rules. bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined Now enter a test URL and hit Enter. StdOut will show whether the URL passes or fails current checks by displaying a plus or minus. > > > > 20 > # Each non-comment, non-blank line contains a regular expression > 21 > # prefixed by '+' or '-'. The first matching pattern in the file > 22 > # determines whether a URL is included or ignored. If no pattern > 23 > # matches, the URL is ignored. > > > > http://svn.apache.org/viewvc/nutch/trunk/conf/regex- > urlfilter.txt.template?view=markup > >> Hi, >> >> I'm crawling a single web site and am going round in circles specifying the >> correct type and order of regex expressions in regex-urlfilter.txt to >> produce the following results: >> >> * Crawl no single level paths on the site other than the exceptions >> specified * Crawl two or more level paths other than those under top level >> paths I've excluded >> >> >> I have the folllowing directives in regex-urlfilter.txt: >> >> >> # Accept anything >> +. >> >> # Exclude URLs under these top level paths >> -.*/example/.* >> >> # Exclude pages located immediately under root >> -^(http://)([^/]+/)([a-z]+)$ >> >> #Allow exception URL located under root >> +http://my.site.com/exception >> >> >> I can't get it to work. Variations are either too restrictive or ignore the >> first level exclusion. I've tested the expressions elsewhere and they >> match as required. Can anyone point me in the right direction here please. >> >> Thanks, >> Matt

