20
# Each non-comment, non-blank line contains a regular expression
21
# prefixed by '+' or '-'. The first matching pattern in the file
22
# determines whether a URL is included or ignored. If no pattern
23
# matches, the URL is ignored.
http://svn.apache.org/viewvc/nutch/trunk/conf/regex-
urlfilter.txt.template?view=markup
> Hi,
>
> I'm crawling a single web site and am going round in circles specifying the
> correct type and order of regex expressions in regex-urlfilter.txt to
> produce the following results:
>
> * Crawl no single level paths on the site other than the exceptions
> specified * Crawl two or more level paths other than those under top level
> paths I've excluded
>
>
> I have the folllowing directives in regex-urlfilter.txt:
>
>
> # Accept anything
> +.
>
> # Exclude URLs under these top level paths
> -.*/example/.*
>
> # Exclude pages located immediately under root
> -^(http://)([^/]+/)([a-z]+)$
>
> #Allow exception URL located under root
> +http://my.site.com/exception
>
>
> I can't get it to work. Variations are either too restrictive or ignore the
> first level exclusion. I've tested the expressions elsewhere and they
> match as required. Can anyone point me in the right direction here please.
>
> Thanks,
> Matt