Re: regex-urlfilter.txt is ignored

Andrzej Bialecki Tue, 02 Nov 2010 03:27:44 -0700

On 2010-11-02 10:11, Erlend Garåsen wrote:
> On 27.10.10 11.21, Markus Jelsma wrote:
>> That depends on your urlfilter.regex.file configuration setting. It
>> defaults to
>> regex-urlfilter.txt in shipped releases.
> 
> Since it defaults to regex-urlfilter.txt, I removed "urlfilter-regex"
> from "plugin.includes", so now it is just:
> <value>protocol-httpclient|parse-(text|html|tika)|index-(basic|more)|query-(basic|site|url|lang)</value>
> 
> 
> But same problem. My settings in regex-urlfilter.txt are ignored. Since
> I have the following line in regex-urlfilter.txt (and in
> crawl-urlfilter.txt as well, just to be sure that this file is not read
> instead):
> -^http://www.arena.uio.no/events/*


This is not a valid regex rule. The * char should be followed by a
sequence of chars to be repeated. For example this is a valid regex rule:

-^http://www.arena.uio.no/events/*.

> 
> Why does Nutch crawl the following URL?
> http://www.arena.uio.no/events/

Likely because of the above. Also, rules are processed sequentially,
they do NOT form an AND or OR. If a rule matches an input url, the
action is performed (accept or reject) and all other rules that follow
after it are ignored.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: regex-urlfilter.txt is ignored

Reply via email to