Re: Correct syntax for regex-urlfilter.txt - trying to exclude single path results

Markus Jelsma Tue, 20 Dec 2011 12:02:16 -0800

Then there is something with the regex. The parserchecker's output plus code 
is proof URL's are returned absolute by the parser.


Good luck!

> >> As far as i know all URL's are long resolved before ever being passed to
> >> any filter. The parser is responsible for resolving relative to
> >> absolute.
> 
> Well, my rules with explicit pattern matches for absolute URLs including
> the protocol and domain failed until I made the protocol and domain
> optional.
> 
> Doesn't work...
> -^(http://[^/]+)/([\w\-]+)
> 
> Works...
> -^(http://[^/]+)?/([\w\-]+)
> 
> On 21/12/2011, at 8:04 AM, Markus Jelsma wrote:
> >> Thanks, I was aware of these precedence rules but strayed a bit from
> >> them as I tweaked to try and get the results I wanted.
> >> 
> >> What really helped was realising that URLs are not resolved into
> >> absolute links before they are tested so patterns need to match however
> >> they appear in parsed content. The hadoop.log file only displays
> >> absolute URLs which can be misleading.
> > 
> > As far as i know all URL's are long resolved before ever being passed to
> > any filter. The parser is responsible for resolving relative to
> > absolute.
> > 
> >> Second, this command line test for URL filtering saves a load of time
> >> and effort when tuning rules.
> >> 
> >> bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
> >> 
> >> Now enter a test URL and hit Enter. StdOut will show whether the URL
> >> passes or fails current checks by displaying a plus or minus.
> >> 
> >>> 20
> >>> # Each non-comment, non-blank line contains a regular expression
> >>> 21
> >>> # prefixed by '+' or '-'. The first matching pattern in the file
> >>> 22
> >>> # determines whether a URL is included or ignored. If no pattern
> >>> 23
> >>> # matches, the URL is ignored.
> >>> 
> >>> 
> >>> 
> >>> http://svn.apache.org/viewvc/nutch/trunk/conf/regex-
> >>> urlfilter.txt.template?view=markup
> >>> 
> >>>> Hi,
> >>>> 
> >>>> I'm crawling a single web site and am going round in circles
> >>>> specifying the correct type and order of regex expressions in
> >>>> regex-urlfilter.txt to produce the following results:
> >>>> 
> >>>> * Crawl no single level paths on the site other than the exceptions
> >>>> specified * Crawl two or more level paths other than those under top
> >>>> level paths I've excluded
> >>>> 
> >>>> 
> >>>> I have the folllowing directives in regex-urlfilter.txt:
> >>>> 
> >>>> 
> >>>> # Accept anything
> >>>> +.
> >>>> 
> >>>> # Exclude URLs under these top level paths
> >>>> -.*/example/.*
> >>>> 
> >>>> # Exclude pages located immediately under root
> >>>> -^(http://)([^/]+/)([a-z]+)$
> >>>> 
> >>>> #Allow exception URL located under root
> >>>> +http://my.site.com/exception
> >>>> 
> >>>> 
> >>>> I can't get it to work. Variations are either too restrictive or
> >>>> ignore the first level exclusion. I've tested the expressions
> >>>> elsewhere and they match as required. Can anyone point me in the
> >>>> right direction here please.
> >>>> 
> >>>> Thanks,
> >>>> Matt
> 
> .headfirst
> WEB DEVELOPERS .ENGAGING .USEFUL .WORKS
> web:www.headfirst.co.nz
> email:[email protected]
> phone:(04) 498 5737
> mobile:022 384 3874

Re: Correct syntax for regex-urlfilter.txt - trying to exclude single path results

Reply via email to