Then there is something with the regex. The parserchecker's output plus code is proof URL's are returned absolute by the parser.
Good luck! > >> As far as i know all URL's are long resolved before ever being passed to > >> any filter. The parser is responsible for resolving relative to > >> absolute. > > Well, my rules with explicit pattern matches for absolute URLs including > the protocol and domain failed until I made the protocol and domain > optional. > > Doesn't work... > -^(http://[^/]+)/([\w\-]+) > > Works... > -^(http://[^/]+)?/([\w\-]+) > > On 21/12/2011, at 8:04 AM, Markus Jelsma wrote: > >> Thanks, I was aware of these precedence rules but strayed a bit from > >> them as I tweaked to try and get the results I wanted. > >> > >> What really helped was realising that URLs are not resolved into > >> absolute links before they are tested so patterns need to match however > >> they appear in parsed content. The hadoop.log file only displays > >> absolute URLs which can be misleading. > > > > As far as i know all URL's are long resolved before ever being passed to > > any filter. The parser is responsible for resolving relative to > > absolute. > > > >> Second, this command line test for URL filtering saves a load of time > >> and effort when tuning rules. > >> > >> bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined > >> > >> Now enter a test URL and hit Enter. StdOut will show whether the URL > >> passes or fails current checks by displaying a plus or minus. > >> > >>> 20 > >>> # Each non-comment, non-blank line contains a regular expression > >>> 21 > >>> # prefixed by '+' or '-'. The first matching pattern in the file > >>> 22 > >>> # determines whether a URL is included or ignored. If no pattern > >>> 23 > >>> # matches, the URL is ignored. > >>> > >>> > >>> > >>> http://svn.apache.org/viewvc/nutch/trunk/conf/regex- > >>> urlfilter.txt.template?view=markup > >>> > >>>> Hi, > >>>> > >>>> I'm crawling a single web site and am going round in circles > >>>> specifying the correct type and order of regex expressions in > >>>> regex-urlfilter.txt to produce the following results: > >>>> > >>>> * Crawl no single level paths on the site other than the exceptions > >>>> specified * Crawl two or more level paths other than those under top > >>>> level paths I've excluded > >>>> > >>>> > >>>> I have the folllowing directives in regex-urlfilter.txt: > >>>> > >>>> > >>>> # Accept anything > >>>> +. > >>>> > >>>> # Exclude URLs under these top level paths > >>>> -.*/example/.* > >>>> > >>>> # Exclude pages located immediately under root > >>>> -^(http://)([^/]+/)([a-z]+)$ > >>>> > >>>> #Allow exception URL located under root > >>>> +http://my.site.com/exception > >>>> > >>>> > >>>> I can't get it to work. Variations are either too restrictive or > >>>> ignore the first level exclusion. I've tested the expressions > >>>> elsewhere and they match as required. Can anyone point me in the > >>>> right direction here please. > >>>> > >>>> Thanks, > >>>> Matt > > .headfirst > WEB DEVELOPERS .ENGAGING .USEFUL .WORKS > web:www.headfirst.co.nz > email:[email protected] > phone:(04) 498 5737 > mobile:022 384 3874

