Re: Correct syntax for regex-urlfilter.txt - trying to exclude single path results

Matt Poff Tue, 20 Dec 2011 10:14:52 -0800

Thanks, I was aware of these precedence rules but strayed a bit from them as I 
tweaked to try and get the results I wanted.


What really helped was realising that URLs are not resolved into absolute links 
before they are tested so patterns need to match however they appear in parsed 
content. The hadoop.log file only displays absolute URLs which can be 
misleading.

Second, this command line test for URL filtering saves a load of time and 
effort when tuning rules.

bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined 

Now enter a test URL and hit Enter. StdOut will show whether the URL passes or 
fails current checks by displaying a plus or minus.


> 
> 

> 
> 20
> # Each non-comment, non-blank line contains a regular expression 
> 21
> # prefixed by '+' or '-'. The first matching pattern in the file 
> 22
> # determines whether a URL is included or ignored. If no pattern 
> 23
> # matches, the URL is ignored. 
> 
> 
> 
> http://svn.apache.org/viewvc/nutch/trunk/conf/regex-
> urlfilter.txt.template?view=markup
> 
>> Hi,
>> 
>> I'm crawling a single web site and am going round in circles specifying the
>> correct type and order of regex expressions in regex-urlfilter.txt to
>> produce the following results:
>> 
>> * Crawl no single level paths on the site other than the exceptions
>> specified * Crawl two or more level paths other than those under top level
>> paths I've excluded
>> 
>> 
>> I have the folllowing directives in regex-urlfilter.txt:
>> 
>> 
>> # Accept anything
>> +.
>> 
>> # Exclude URLs under these top level paths
>> -.*/example/.*
>> 
>> # Exclude pages located immediately under root
>> -^(http://)([^/]+/)([a-z]+)$
>> 
>> #Allow exception URL located under root
>> +http://my.site.com/exception
>> 
>> 
>> I can't get it to work. Variations are either too restrictive or ignore the
>> first level exclusion. I've tested the expressions elsewhere and they
>> match as required. Can anyone point me in the right direction here please.
>> 
>> Thanks,
>> Matt

Re: Correct syntax for regex-urlfilter.txt - trying to exclude single path results

Reply via email to