Hi,

I may not be right but I think, It uses crawl-urifilter.txt to define
regex's. Try using this file to define your regex.

Thanks

On Mon, Oct 25, 2010 at 9:13 PM, Erlend Garåsen [via Lucene] <
[email protected]<ml-node%[email protected]>
> wrote:

>
> Hello list,
>
> I'm using Nutch 1.2 on OS X.
>
> Before I start to crawl all the university's web pages (about 1
> million), I want to be sure that my settings are correct. Now I just
> figured out that my lines in the regex-urlfilter.txt file are ignored.
>
> The following setting in my nutch-site.xml file should tell Nutch to use
> regex-urlfilter.txt:
>
> <property>
>    <name>plugin.includes</name>
>
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|tika)|index-(basic|more)|query-(basic|site|url|lang)</value>
>
> </property>
>
> I have the following start URLs in my urls/urls.txt file:
> http://ridder.uio.no <http://ridder.uio.no?by-user=t>
> http://www.uio.no/om/finn-fram/parkering/
> http://www.uio.no/studier/program/eld-master/
> http://www.arena.uio.no/index-nor.xml
> http://www.usit.uio.no/web/
>
> And the following in my regex-urlfilter.txt file:
> ...
> +^http://www.uio.no/studier/program/eld-master/*
> -^http://www.arena.uio.no/events/*
> +^http://www.usit.uio.no/web/*
> # deny everything else
> -.
>
> But the strange part is that I find the following URLs in my index after
> the crawler has finished:
> http://www.admin.uio.no/prosjekter/nyuioweb/
> http://www.arena.uio.no/events/
>
> The first URL is not mentioned in my filter settings at all, whilst the
> latter has an explicit deny setting. And, yes, I deleted the whole crawl
> folder before my last crawl attempt.
>
> But when I run the following command, everything seems to be ok:
> bin/nutch plugin urlfilter-regex
> org.apache.nutch.urlfilter.regex.RegexURLFilter
> http://www.arena.uio.no/events/
> -http://www.arena.uio.no/events/
> http://www.admin.uio.no/prosjekter/nyuioweb/
> -http://www.admin.uio.no/prosjekter/nyuioweb/
> http://www.usit.uio.no/web/
> +http://www.usit.uio.no/web/
>
> Have I missed something?
>
> Erlend
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>
>
>
> ------------------------------
>  View message @
> http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-is-ignored-tp1768031p1768031.html
> To start a new topic under Nutch - User, email
> [email protected]<ml-node%[email protected]>
> To unsubscribe from Nutch - User, click 
> here<http://lucene.472066.n3.nabble.com/template/TplServlet.jtp?tpl=unsubscribe_by_code&node=603147&code=amVldC5sb3Zlc0BnbWFpbC5jb218NjAzMTQ3fC0xMDg2ODAyNDgy>.
>
>
>


-- 
Thanks and regards

Jitendra Singh

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-is-ignored-tp1768031p1778603.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to