Re: Correct syntax for regex-urlfilter.txt - trying to exclude single path results

Markus Jelsma Mon, 19 Dec 2011 23:09:46 -0800


20
# Each non-comment, non-blank line contains a regular expression 
21
# prefixed by '+' or '-'. The first matching pattern in the file 
22
# determines whether a URL is included or ignored. If no pattern 
23
# matches, the URL is ignored. 



http://svn.apache.org/viewvc/nutch/trunk/conf/regex-
urlfilter.txt.template?view=markup

> Hi,
> 
> I'm crawling a single web site and am going round in circles specifying the
> correct type and order of regex expressions in regex-urlfilter.txt to
> produce the following results:
> 
>  * Crawl no single level paths on the site other than the exceptions
> specified * Crawl two or more level paths other than those under top level
> paths I've excluded
> 
> 
> I have the folllowing directives in regex-urlfilter.txt:
> 
> 
> # Accept anything
> +.
> 
> # Exclude URLs under these top level paths
> -.*/example/.*
> 
> # Exclude pages located immediately under root
> -^(http://)([^/]+/)([a-z]+)$
> 
> #Allow exception URL located under root
> +http://my.site.com/exception
> 
> 
> I can't get it to work. Variations are either too restrictive or ignore the
> first level exclusion. I've tested the expressions elsewhere and they
> match as required. Can anyone point me in the right direction here please.
> 
> Thanks,
> Matt

Re: Correct syntax for regex-urlfilter.txt - trying to exclude single path results

Reply via email to