Re: regex-urlfilter.txt is ignored

Erlend Garåsen Tue, 02 Nov 2010 07:44:07 -0700

You're right about the incorrect rule, I just copied a line from anexisting Ultraseek filter setting.


Still it crawls the URL. I have tried the following without any luck:
-^http://www.arena.uio.no/events/*.
-^http://www.arena.uio.no/events/.*

I think the latter is correct, at least is should be a proper Perlregexp rule.

Well, something strange is going on since I have commented outeverything in my filter file but the exclusion of the URL above, and allkinds of urls are included into my index.


The following command says that the url should be skipped:

bin/nutch plugin urlfilter-regexorg.apache.nutch.urlfilter.regex.RegexURLFilter

http://www.arena.uio.no/events

Which returns:
-http://www.arena.uio.no/events

And the same is true for all the other URLs since I have commented outthe inclusion lines. So why are the URL above included as well as theother URLs since I have commented out all the inclusion lines? Example:

# +^http://www.uio.no/om/finn-fram/parkering/.*

BTW, I'm running the following command:
bin/nutch crawl urls -dir crawl -depth 2 -topN 10

And I always delete the crawl folder first:
rm -fr crawl/

Erlend


On 02.11.10 11.26, Andrzej Bialecki wrote:

On 2010-11-02 10:11, Erlend Garåsen wrote:

On 27.10.10 11.21, Markus Jelsma wrote:

That depends on your urlfilter.regex.file configuration setting. It
defaults to
regex-urlfilter.txt in shipped releases.


Since it defaults to regex-urlfilter.txt, I removed "urlfilter-regex"
from "plugin.includes", so now it is just:
<value>protocol-httpclient|parse-(text|html|tika)|index-(basic|more)|query-(basic|site|url|lang)</value>


But same problem. My settings in regex-urlfilter.txt are ignored. Since
I have the following line in regex-urlfilter.txt (and in
crawl-urlfilter.txt as well, just to be sure that this file is not read
instead):
-^http://www.arena.uio.no/events/*


This is not a valid regex rule. The * char should be followed by a
sequence of chars to be repeated. For example this is a valid regex rule:

-^http://www.arena.uio.no/events/*.


Why does Nutch crawl the following URL?
http://www.arena.uio.no/events/


Likely because of the above. Also, rules are processed sequentially,
they do NOT form an AND or OR. If a rule matches an input url, the
action is performed (accept or reject) and all other rules that follow
after it are ignored.



--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: regex-urlfilter.txt is ignored

Reply via email to