That depends on your urlfilter.regex.file configuration setting. It defaults to 
regex-urlfilter.txt in shipped releases.

On Wednesday 27 October 2010 09:02:51 Jitendra wrote:
> Hi,
> 
> I may not be right but I think, It uses crawl-urifilter.txt to define
> regex's. Try using this file to define your regex.
> 
> Thanks
> 
> On Mon, Oct 25, 2010 at 9:13 PM, Erlend Garåsen [via Lucene] <
> [email protected]<ml-node%2B1768031-108353626
> [email protected]>
> 
> > wrote:
> > 
> > 
> > Hello list,
> > 
> > I'm using Nutch 1.2 on OS X.
> > 
> > Before I start to crawl all the university's web pages (about 1
> > million), I want to be sure that my settings are correct. Now I just
> > figured out that my lines in the regex-urlfilter.txt file are ignored.
> > 
> > The following setting in my nutch-site.xml file should tell Nutch to use
> > regex-urlfilter.txt:
> > 
> > <property>
> > 
> >    <name>plugin.includes</name>
> > 
> > <value>protocol-httpclient|urlfilter-regex|parse-(text|html|tika)|index-(
> > basic|more)|query-(basic|site|url|lang)</value>
> > 
> > </property>
> > 
> > I have the following start URLs in my urls/urls.txt file:
> > http://ridder.uio.no <http://ridder.uio.no?by-user=t>
> > http://www.uio.no/om/finn-fram/parkering/
> > http://www.uio.no/studier/program/eld-master/
> > http://www.arena.uio.no/index-nor.xml
> > http://www.usit.uio.no/web/
> > 
> > And the following in my regex-urlfilter.txt file:
> > ...
> > +^http://www.uio.no/studier/program/eld-master/*
> > -^http://www.arena.uio.no/events/*
> > +^http://www.usit.uio.no/web/*
> > # deny everything else
> > -.
> > 
> > But the strange part is that I find the following URLs in my index after
> > the crawler has finished:
> > http://www.admin.uio.no/prosjekter/nyuioweb/
> > http://www.arena.uio.no/events/
> > 
> > The first URL is not mentioned in my filter settings at all, whilst the
> > latter has an explicit deny setting. And, yes, I deleted the whole crawl
> > folder before my last crawl attempt.
> > 
> > But when I run the following command, everything seems to be ok:
> > bin/nutch plugin urlfilter-regex
> > org.apache.nutch.urlfilter.regex.RegexURLFilter
> > http://www.arena.uio.no/events/
> > -http://www.arena.uio.no/events/
> > http://www.admin.uio.no/prosjekter/nyuioweb/
> > -http://www.admin.uio.no/prosjekter/nyuioweb/
> > http://www.usit.uio.no/web/
> > +http://www.usit.uio.no/web/
> > 
> > Have I missed something?
> > 
> > Erlend
> > --
> > Erlend Garåsen
> > Center for Information Technology Services
> > University of Oslo
> > P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> > Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
> > 31050
> > 
> > 
> > 
> > ------------------------------
> > 
> >  View message @
> > 
> > http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-is-ignored-tp17680
> > 31p1768031.html To start a new topic under Nutch - User, email
> > [email protected]<ml-node%2B603147-511429585-
> > [email protected]> To unsubscribe from Nutch - User, click
> > here<http://lucene.472066.n3.nabble.com/template/TplServlet.jtp?tpl=unsu
> > bscribe_by_code&node=603147&code=amVldC5sb3Zlc0BnbWFpbC5jb218NjAzMTQ3fC0x
> > MDg2ODAyNDgy>.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350

Reply via email to