regex-urlfilter.txt is ignored

Erlend Garåsen Mon, 25 Oct 2010 08:43:07 -0700


Hello list,


I'm using Nutch 1.2 on OS X.

Before I start to crawl all the university's web pages (about 1million), I want to be sure that my settings are correct. Now I justfigured out that my lines in the regex-urlfilter.txt file are ignored.

The following setting in my nutch-site.xml file should tell Nutch to useregex-urlfilter.txt:


<property>
  <name>plugin.includes</name>

<value>protocol-httpclient|urlfilter-regex|parse-(text|html|tika)|index-(basic|more)|query-(basic|site|url|lang)</value>
</property>

I have the following start URLs in my urls/urls.txt file:
http://ridder.uio.no
http://www.uio.no/om/finn-fram/parkering/
http://www.uio.no/studier/program/eld-master/
http://www.arena.uio.no/index-nor.xml
http://www.usit.uio.no/web/

And the following in my regex-urlfilter.txt file:
...
+^http://www.uio.no/studier/program/eld-master/*
-^http://www.arena.uio.no/events/*
+^http://www.usit.uio.no/web/*
# deny everything else
-.

But the strange part is that I find the following URLs in my index afterthe crawler has finished:

http://www.admin.uio.no/prosjekter/nyuioweb/
http://www.arena.uio.no/events/

The first URL is not mentioned in my filter settings at all, whilst thelatter has an explicit deny setting. And, yes, I deleted the whole crawlfolder before my last crawl attempt.


But when I run the following command, everything seems to be ok:

bin/nutch plugin urlfilter-regexorg.apache.nutch.urlfilter.regex.RegexURLFilter

http://www.arena.uio.no/events/
-http://www.arena.uio.no/events/
http://www.admin.uio.no/prosjekter/nyuioweb/
-http://www.admin.uio.no/prosjekter/nyuioweb/
http://www.usit.uio.no/web/
+http://www.usit.uio.no/web/

Have I missed something?

Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

regex-urlfilter.txt is ignored

Reply via email to