Hello list,
I'm using Nutch 1.2 on OS X.
Before I start to crawl all the university's web pages (about 1
million), I want to be sure that my settings are correct. Now I just
figured out that my lines in the regex-urlfilter.txt file are ignored.
The following setting in my nutch-site.xml file should tell Nutch to use
regex-urlfilter.txt:
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|tika)|index-(basic|more)|query-(basic|site|url|lang)</value>
</property>
I have the following start URLs in my urls/urls.txt file:
http://ridder.uio.no
http://www.uio.no/om/finn-fram/parkering/
http://www.uio.no/studier/program/eld-master/
http://www.arena.uio.no/index-nor.xml
http://www.usit.uio.no/web/
And the following in my regex-urlfilter.txt file:
...
+^http://www.uio.no/studier/program/eld-master/*
-^http://www.arena.uio.no/events/*
+^http://www.usit.uio.no/web/*
# deny everything else
-.
But the strange part is that I find the following URLs in my index after
the crawler has finished:
http://www.admin.uio.no/prosjekter/nyuioweb/
http://www.arena.uio.no/events/
The first URL is not mentioned in my filter settings at all, whilst the
latter has an explicit deny setting. And, yes, I deleted the whole crawl
folder before my last crawl attempt.
But when I run the following command, everything seems to be ok:
bin/nutch plugin urlfilter-regex
org.apache.nutch.urlfilter.regex.RegexURLFilter
http://www.arena.uio.no/events/
-http://www.arena.uio.no/events/
http://www.admin.uio.no/prosjekter/nyuioweb/
-http://www.admin.uio.no/prosjekter/nyuioweb/
http://www.usit.uio.no/web/
+http://www.usit.uio.no/web/
Have I missed something?
Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050