That depends on your urlfilter.regex.file configuration setting. It defaults to regex-urlfilter.txt in shipped releases.
On Wednesday 27 October 2010 09:02:51 Jitendra wrote: > Hi, > > I may not be right but I think, It uses crawl-urifilter.txt to define > regex's. Try using this file to define your regex. > > Thanks > > On Mon, Oct 25, 2010 at 9:13 PM, Erlend Garåsen [via Lucene] < > [email protected]<ml-node%2B1768031-108353626 > [email protected]> > > > wrote: > > > > > > Hello list, > > > > I'm using Nutch 1.2 on OS X. > > > > Before I start to crawl all the university's web pages (about 1 > > million), I want to be sure that my settings are correct. Now I just > > figured out that my lines in the regex-urlfilter.txt file are ignored. > > > > The following setting in my nutch-site.xml file should tell Nutch to use > > regex-urlfilter.txt: > > > > <property> > > > > <name>plugin.includes</name> > > > > <value>protocol-httpclient|urlfilter-regex|parse-(text|html|tika)|index-( > > basic|more)|query-(basic|site|url|lang)</value> > > > > </property> > > > > I have the following start URLs in my urls/urls.txt file: > > http://ridder.uio.no <http://ridder.uio.no?by-user=t> > > http://www.uio.no/om/finn-fram/parkering/ > > http://www.uio.no/studier/program/eld-master/ > > http://www.arena.uio.no/index-nor.xml > > http://www.usit.uio.no/web/ > > > > And the following in my regex-urlfilter.txt file: > > ... > > +^http://www.uio.no/studier/program/eld-master/* > > -^http://www.arena.uio.no/events/* > > +^http://www.usit.uio.no/web/* > > # deny everything else > > -. > > > > But the strange part is that I find the following URLs in my index after > > the crawler has finished: > > http://www.admin.uio.no/prosjekter/nyuioweb/ > > http://www.arena.uio.no/events/ > > > > The first URL is not mentioned in my filter settings at all, whilst the > > latter has an explicit deny setting. And, yes, I deleted the whole crawl > > folder before my last crawl attempt. > > > > But when I run the following command, everything seems to be ok: > > bin/nutch plugin urlfilter-regex > > org.apache.nutch.urlfilter.regex.RegexURLFilter > > http://www.arena.uio.no/events/ > > -http://www.arena.uio.no/events/ > > http://www.admin.uio.no/prosjekter/nyuioweb/ > > -http://www.admin.uio.no/prosjekter/nyuioweb/ > > http://www.usit.uio.no/web/ > > +http://www.usit.uio.no/web/ > > > > Have I missed something? > > > > Erlend > > -- > > Erlend Garåsen > > Center for Information Technology Services > > University of Oslo > > P.O. Box 1086 Blindern, N-0317 OSLO, Norway > > Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: > > 31050 > > > > > > > > ------------------------------ > > > > View message @ > > > > http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-is-ignored-tp17680 > > 31p1768031.html To start a new topic under Nutch - User, email > > [email protected]<ml-node%2B603147-511429585- > > [email protected]> To unsubscribe from Nutch - User, click > > here<http://lucene.472066.n3.nabble.com/template/TplServlet.jtp?tpl=unsu > > bscribe_by_code&node=603147&code=amVldC5sb3Zlc0BnbWFpbC5jb218NjAzMTQ3fC0x > > MDg2ODAyNDgy>. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

