On 2010-11-02 10:11, Erlend Garåsen wrote: > On 27.10.10 11.21, Markus Jelsma wrote: >> That depends on your urlfilter.regex.file configuration setting. It >> defaults to >> regex-urlfilter.txt in shipped releases. > > Since it defaults to regex-urlfilter.txt, I removed "urlfilter-regex" > from "plugin.includes", so now it is just: > <value>protocol-httpclient|parse-(text|html|tika)|index-(basic|more)|query-(basic|site|url|lang)</value> > > > But same problem. My settings in regex-urlfilter.txt are ignored. Since > I have the following line in regex-urlfilter.txt (and in > crawl-urlfilter.txt as well, just to be sure that this file is not read > instead): > -^http://www.arena.uio.no/events/*
This is not a valid regex rule. The * char should be followed by a sequence of chars to be repeated. For example this is a valid regex rule: -^http://www.arena.uio.no/events/*. > > Why does Nutch crawl the following URL? > http://www.arena.uio.no/events/ Likely because of the above. Also, rules are processed sequentially, they do NOT form an AND or OR. If a rule matches an input url, the action is performed (accept or reject) and all other rules that follow after it are ignored. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

