>> As far as i know all URL's are long resolved before ever being passed to any >> filter. The parser is responsible for resolving relative to absolute.
Well, my rules with explicit pattern matches for absolute URLs including the protocol and domain failed until I made the protocol and domain optional. Doesn't work... -^(http://[^/]+)/([\w\-]+) Works... -^(http://[^/]+)?/([\w\-]+) On 21/12/2011, at 8:04 AM, Markus Jelsma wrote: > >> Thanks, I was aware of these precedence rules but strayed a bit from them >> as I tweaked to try and get the results I wanted. >> >> What really helped was realising that URLs are not resolved into absolute >> links before they are tested so patterns need to match however they appear >> in parsed content. The hadoop.log file only displays absolute URLs which >> can be misleading. > > As far as i know all URL's are long resolved before ever being passed to any > filter. The parser is responsible for resolving relative to absolute. > >> >> Second, this command line test for URL filtering saves a load of time and >> effort when tuning rules. >> >> bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined >> >> Now enter a test URL and hit Enter. StdOut will show whether the URL passes >> or fails current checks by displaying a plus or minus. >> >>> 20 >>> # Each non-comment, non-blank line contains a regular expression >>> 21 >>> # prefixed by '+' or '-'. The first matching pattern in the file >>> 22 >>> # determines whether a URL is included or ignored. If no pattern >>> 23 >>> # matches, the URL is ignored. >>> >>> >>> >>> http://svn.apache.org/viewvc/nutch/trunk/conf/regex- >>> urlfilter.txt.template?view=markup >>> >>>> Hi, >>>> >>>> I'm crawling a single web site and am going round in circles specifying >>>> the correct type and order of regex expressions in regex-urlfilter.txt >>>> to produce the following results: >>>> >>>> * Crawl no single level paths on the site other than the exceptions >>>> specified * Crawl two or more level paths other than those under top >>>> level paths I've excluded >>>> >>>> >>>> I have the folllowing directives in regex-urlfilter.txt: >>>> >>>> >>>> # Accept anything >>>> +. >>>> >>>> # Exclude URLs under these top level paths >>>> -.*/example/.* >>>> >>>> # Exclude pages located immediately under root >>>> -^(http://)([^/]+/)([a-z]+)$ >>>> >>>> #Allow exception URL located under root >>>> +http://my.site.com/exception >>>> >>>> >>>> I can't get it to work. Variations are either too restrictive or ignore >>>> the first level exclusion. I've tested the expressions elsewhere and >>>> they match as required. Can anyone point me in the right direction here >>>> please. >>>> >>>> Thanks, >>>> Matt .headfirst WEB DEVELOPERS .ENGAGING .USEFUL .WORKS web:www.headfirst.co.nz email:[email protected] phone:(04) 498 5737 mobile:022 384 3874

