You're right about the incorrect rule, I just copied a line from an existing Ultraseek filter setting.

Still it crawls the URL. I have tried the following without any luck:
-^http://www.arena.uio.no/events/*.
-^http://www.arena.uio.no/events/.*

I think the latter is correct, at least is should be a proper Perl regexp rule.

Well, something strange is going on since I have commented out everything in my filter file but the exclusion of the URL above, and all kinds of urls are included into my index.

The following command says that the url should be skipped:
bin/nutch plugin urlfilter-regex org.apache.nutch.urlfilter.regex.RegexURLFilter
http://www.arena.uio.no/events

Which returns:
-http://www.arena.uio.no/events

And the same is true for all the other URLs since I have commented out the inclusion lines. So why are the URL above included as well as the other URLs since I have commented out all the inclusion lines? Example:
# +^http://www.uio.no/om/finn-fram/parkering/.*

BTW, I'm running the following command:
bin/nutch crawl urls -dir crawl -depth 2 -topN 10

And I always delete the crawl folder first:
rm -fr crawl/

Erlend


On 02.11.10 11.26, Andrzej Bialecki wrote:
On 2010-11-02 10:11, Erlend Garåsen wrote:
On 27.10.10 11.21, Markus Jelsma wrote:
That depends on your urlfilter.regex.file configuration setting. It
defaults to
regex-urlfilter.txt in shipped releases.

Since it defaults to regex-urlfilter.txt, I removed "urlfilter-regex"
from "plugin.includes", so now it is just:
<value>protocol-httpclient|parse-(text|html|tika)|index-(basic|more)|query-(basic|site|url|lang)</value>


But same problem. My settings in regex-urlfilter.txt are ignored. Since
I have the following line in regex-urlfilter.txt (and in
crawl-urlfilter.txt as well, just to be sure that this file is not read
instead):
-^http://www.arena.uio.no/events/*

This is not a valid regex rule. The * char should be followed by a
sequence of chars to be repeated. For example this is a valid regex rule:

-^http://www.arena.uio.no/events/*.


Why does Nutch crawl the following URL?
http://www.arena.uio.no/events/

Likely because of the above. Also, rules are processed sequentially,
they do NOT form an AND or OR. If a rule matches an input url, the
action is performed (accept or reject) and all other rules that follow
after it are ignored.



--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Reply via email to