> Thanks, I was aware of these precedence rules but strayed a bit from them > as I tweaked to try and get the results I wanted. > > What really helped was realising that URLs are not resolved into absolute > links before they are tested so patterns need to match however they appear > in parsed content. The hadoop.log file only displays absolute URLs which > can be misleading.
As far as i know all URL's are long resolved before ever being passed to any filter. The parser is responsible for resolving relative to absolute. > > Second, this command line test for URL filtering saves a load of time and > effort when tuning rules. > > bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined > > Now enter a test URL and hit Enter. StdOut will show whether the URL passes > or fails current checks by displaying a plus or minus. > > > 20 > > # Each non-comment, non-blank line contains a regular expression > > 21 > > # prefixed by '+' or '-'. The first matching pattern in the file > > 22 > > # determines whether a URL is included or ignored. If no pattern > > 23 > > # matches, the URL is ignored. > > > > > > > > http://svn.apache.org/viewvc/nutch/trunk/conf/regex- > > urlfilter.txt.template?view=markup > > > >> Hi, > >> > >> I'm crawling a single web site and am going round in circles specifying > >> the correct type and order of regex expressions in regex-urlfilter.txt > >> to produce the following results: > >> > >> * Crawl no single level paths on the site other than the exceptions > >> specified * Crawl two or more level paths other than those under top > >> level paths I've excluded > >> > >> > >> I have the folllowing directives in regex-urlfilter.txt: > >> > >> > >> # Accept anything > >> +. > >> > >> # Exclude URLs under these top level paths > >> -.*/example/.* > >> > >> # Exclude pages located immediately under root > >> -^(http://)([^/]+/)([a-z]+)$ > >> > >> #Allow exception URL located under root > >> +http://my.site.com/exception > >> > >> > >> I can't get it to work. Variations are either too restrictive or ignore > >> the first level exclusion. I've tested the expressions elsewhere and > >> they match as required. Can anyone point me in the right direction here > >> please. > >> > >> Thanks, > >> Matt

