Re: regex-urlfilter.txt is ignored

Erlend Garåsen Tue, 02 Nov 2010 02:12:29 -0700

On 27.10.10 11.21, Markus Jelsma wrote:

That depends on your urlfilter.regex.file configuration setting. It defaults to
regex-urlfilter.txt in shipped releases.

Since it defaults to regex-urlfilter.txt, I removed "urlfilter-regex"from "plugin.includes", so now it is just:

<value>protocol-httpclient|parse-(text|html|tika)|index-(basic|more)|query-(basic|site|url|lang)</value>

But same problem. My settings in regex-urlfilter.txt are ignored. SinceI have the following line in regex-urlfilter.txt (and incrawl-urlfilter.txt as well, just to be sure that this file is not readinstead):

-^http://www.arena.uio.no/events/*

Why does Nutch crawl the following URL?
http://www.arena.uio.no/events/
?

Erlend


On Wednesday 27 October 2010 09:02:51 Jitendra wrote:

Hi,

I may not be right but I think, It uses crawl-urifilter.txt to define
regex's. Try using this file to define your regex.

Thanks

On Mon, Oct 25, 2010 at 9:13 PM, Erlend Garåsen [via Lucene]<
[email protected]<ml-node%2B1768031-108353626
[email protected]>

wrote:


Hello list,

I'm using Nutch 1.2 on OS X.

Before I start to crawl all the university's web pages (about 1
million), I want to be sure that my settings are correct. Now I just
figured out that my lines in the regex-urlfilter.txt file are ignored.

The following setting in my nutch-site.xml file should tell Nutch to use
regex-urlfilter.txt:

<property>

    <name>plugin.includes</name>

<value>protocol-httpclient|urlfilter-regex|parse-(text|html|tika)|index-(
basic|more)|query-(basic|site|url|lang)</value>

</property>

I have the following start URLs in my urls/urls.txt file:
http://ridder.uio.no<http://ridder.uio.no?by-user=t>
http://www.uio.no/om/finn-fram/parkering/
http://www.uio.no/studier/program/eld-master/
http://www.arena.uio.no/index-nor.xml
http://www.usit.uio.no/web/

And the following in my regex-urlfilter.txt file:
...
+^http://www.uio.no/studier/program/eld-master/*
-^http://www.arena.uio.no/events/*
+^http://www.usit.uio.no/web/*
# deny everything else
-.

But the strange part is that I find the following URLs in my index after
the crawler has finished:
http://www.admin.uio.no/prosjekter/nyuioweb/
http://www.arena.uio.no/events/

The first URL is not mentioned in my filter settings at all, whilst the
latter has an explicit deny setting. And, yes, I deleted the whole crawl
folder before my last crawl attempt.

But when I run the following command, everything seems to be ok:
bin/nutch plugin urlfilter-regex
org.apache.nutch.urlfilter.regex.RegexURLFilter
http://www.arena.uio.no/events/
-http://www.arena.uio.no/events/
http://www.admin.uio.no/prosjekter/nyuioweb/
-http://www.admin.uio.no/prosjekter/nyuioweb/
http://www.usit.uio.no/web/
+http://www.usit.uio.no/web/

Have I missed something?

Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
31050



------------------------------

  View message @

http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-is-ignored-tp17680
31p1768031.html To start a new topic under Nutch - User, email
[email protected]<ml-node%2B603147-511429585-
[email protected]>  To unsubscribe from Nutch - User, click
here<http://lucene.472066.n3.nabble.com/template/TplServlet.jtp?tpl=unsu
bscribe_by_code&node=603147&code=amVldC5sb3Zlc0BnbWFpbC5jb218NjAzMTQ3fC0x
MDg2ODAyNDgy>.



--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: regex-urlfilter.txt is ignored

Reply via email to