Re: Correct syntax for regex-urlfilter.txt - trying to exclude single path results

Matt Poff Tue, 20 Dec 2011 11:38:00 -0800

>> As far as i know all URL's are long resolved before ever being passed to any 
>> filter. The parser is responsible for resolving relative to absolute.


Well, my rules with explicit pattern matches for absolute URLs including the 
protocol and domain failed until I made the protocol and domain optional.

Doesn't work...
-^(http://[^/]+)/([\w\-]+)

Works...
-^(http://[^/]+)?/([\w\-]+)




On 21/12/2011, at 8:04 AM, Markus Jelsma wrote:

> 
>> Thanks, I was aware of these precedence rules but strayed a bit from them
>> as I tweaked to try and get the results I wanted.
>> 
>> What really helped was realising that URLs are not resolved into absolute
>> links before they are tested so patterns need to match however they appear
>> in parsed content. The hadoop.log file only displays absolute URLs which
>> can be misleading.
> 
> As far as i know all URL's are long resolved before ever being passed to any 
> filter. The parser is responsible for resolving relative to absolute.
> 
>> 
>> Second, this command line test for URL filtering saves a load of time and
>> effort when tuning rules.
>> 
>> bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
>> 
>> Now enter a test URL and hit Enter. StdOut will show whether the URL passes
>> or fails current checks by displaying a plus or minus.
>> 
>>> 20
>>> # Each non-comment, non-blank line contains a regular expression
>>> 21
>>> # prefixed by '+' or '-'. The first matching pattern in the file
>>> 22
>>> # determines whether a URL is included or ignored. If no pattern
>>> 23
>>> # matches, the URL is ignored.
>>> 
>>> 
>>> 
>>> http://svn.apache.org/viewvc/nutch/trunk/conf/regex-
>>> urlfilter.txt.template?view=markup
>>> 
>>>> Hi,
>>>> 
>>>> I'm crawling a single web site and am going round in circles specifying
>>>> the correct type and order of regex expressions in regex-urlfilter.txt
>>>> to produce the following results:
>>>> 
>>>> * Crawl no single level paths on the site other than the exceptions
>>>> specified * Crawl two or more level paths other than those under top
>>>> level paths I've excluded
>>>> 
>>>> 
>>>> I have the folllowing directives in regex-urlfilter.txt:
>>>> 
>>>> 
>>>> # Accept anything
>>>> +.
>>>> 
>>>> # Exclude URLs under these top level paths
>>>> -.*/example/.*
>>>> 
>>>> # Exclude pages located immediately under root
>>>> -^(http://)([^/]+/)([a-z]+)$
>>>> 
>>>> #Allow exception URL located under root
>>>> +http://my.site.com/exception
>>>> 
>>>> 
>>>> I can't get it to work. Variations are either too restrictive or ignore
>>>> the first level exclusion. I've tested the expressions elsewhere and
>>>> they match as required. Can anyone point me in the right direction here
>>>> please.
>>>> 
>>>> Thanks,
>>>> Matt


.headfirst
WEB DEVELOPERS .ENGAGING .USEFUL .WORKS
web:www.headfirst.co.nz
email:[email protected]
phone:(04) 498 5737
mobile:022 384 3874

Re: Correct syntax for regex-urlfilter.txt - trying to exclude single path results

Reply via email to