> Thanks, I was aware of these precedence rules but strayed a bit from them
> as I tweaked to try and get the results I wanted.
> 
> What really helped was realising that URLs are not resolved into absolute
> links before they are tested so patterns need to match however they appear
> in parsed content. The hadoop.log file only displays absolute URLs which
> can be misleading.

As far as i know all URL's are long resolved before ever being passed to any 
filter. The parser is responsible for resolving relative to absolute.

> 
> Second, this command line test for URL filtering saves a load of time and
> effort when tuning rules.
> 
> bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
> 
> Now enter a test URL and hit Enter. StdOut will show whether the URL passes
> or fails current checks by displaying a plus or minus.
> 
> > 20
> > # Each non-comment, non-blank line contains a regular expression
> > 21
> > # prefixed by '+' or '-'. The first matching pattern in the file
> > 22
> > # determines whether a URL is included or ignored. If no pattern
> > 23
> > # matches, the URL is ignored.
> > 
> > 
> > 
> > http://svn.apache.org/viewvc/nutch/trunk/conf/regex-
> > urlfilter.txt.template?view=markup
> > 
> >> Hi,
> >> 
> >> I'm crawling a single web site and am going round in circles specifying
> >> the correct type and order of regex expressions in regex-urlfilter.txt
> >> to produce the following results:
> >> 
> >> * Crawl no single level paths on the site other than the exceptions
> >> specified * Crawl two or more level paths other than those under top
> >> level paths I've excluded
> >> 
> >> 
> >> I have the folllowing directives in regex-urlfilter.txt:
> >> 
> >> 
> >> # Accept anything
> >> +.
> >> 
> >> # Exclude URLs under these top level paths
> >> -.*/example/.*
> >> 
> >> # Exclude pages located immediately under root
> >> -^(http://)([^/]+/)([a-z]+)$
> >> 
> >> #Allow exception URL located under root
> >> +http://my.site.com/exception
> >> 
> >> 
> >> I can't get it to work. Variations are either too restrictive or ignore
> >> the first level exclusion. I've tested the expressions elsewhere and
> >> they match as required. Can anyone point me in the right direction here
> >> please.
> >> 
> >> Thanks,
> >> Matt

Reply via email to