You're right about the incorrect rule, I just copied a line from an
existing Ultraseek filter setting.
Still it crawls the URL. I have tried the following without any luck:
-^http://www.arena.uio.no/events/*.
-^http://www.arena.uio.no/events/.*
I think the latter is correct, at least is should be a proper Perl
regexp rule.
Well, something strange is going on since I have commented out
everything in my filter file but the exclusion of the URL above, and all
kinds of urls are included into my index.
The following command says that the url should be skipped:
bin/nutch plugin urlfilter-regex
org.apache.nutch.urlfilter.regex.RegexURLFilter
http://www.arena.uio.no/events
Which returns:
-http://www.arena.uio.no/events
And the same is true for all the other URLs since I have commented out
the inclusion lines. So why are the URL above included as well as the
other URLs since I have commented out all the inclusion lines? Example:
# +^http://www.uio.no/om/finn-fram/parkering/.*
BTW, I'm running the following command:
bin/nutch crawl urls -dir crawl -depth 2 -topN 10
And I always delete the crawl folder first:
rm -fr crawl/
Erlend
On 02.11.10 11.26, Andrzej Bialecki wrote:
On 2010-11-02 10:11, Erlend Garåsen wrote:
On 27.10.10 11.21, Markus Jelsma wrote:
That depends on your urlfilter.regex.file configuration setting. It
defaults to
regex-urlfilter.txt in shipped releases.
Since it defaults to regex-urlfilter.txt, I removed "urlfilter-regex"
from "plugin.includes", so now it is just:
<value>protocol-httpclient|parse-(text|html|tika)|index-(basic|more)|query-(basic|site|url|lang)</value>
But same problem. My settings in regex-urlfilter.txt are ignored. Since
I have the following line in regex-urlfilter.txt (and in
crawl-urlfilter.txt as well, just to be sure that this file is not read
instead):
-^http://www.arena.uio.no/events/*
This is not a valid regex rule. The * char should be followed by a
sequence of chars to be repeated. For example this is a valid regex rule:
-^http://www.arena.uio.no/events/*.
Why does Nutch crawl the following URL?
http://www.arena.uio.no/events/
Likely because of the above. Also, rules are processed sequentially,
they do NOT form an AND or OR. If a rule matches an input url, the
action is performed (accept or reject) and all other rules that follow
after it are ignored.
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050