Hi, I may not be right but I think, It uses crawl-urifilter.txt to define regex's. Try using this file to define your regex.
Thanks On Mon, Oct 25, 2010 at 9:13 PM, Erlend Garåsen [via Lucene] < [email protected]<ml-node%[email protected]> > wrote: > > Hello list, > > I'm using Nutch 1.2 on OS X. > > Before I start to crawl all the university's web pages (about 1 > million), I want to be sure that my settings are correct. Now I just > figured out that my lines in the regex-urlfilter.txt file are ignored. > > The following setting in my nutch-site.xml file should tell Nutch to use > regex-urlfilter.txt: > > <property> > <name>plugin.includes</name> > > <value>protocol-httpclient|urlfilter-regex|parse-(text|html|tika)|index-(basic|more)|query-(basic|site|url|lang)</value> > > </property> > > I have the following start URLs in my urls/urls.txt file: > http://ridder.uio.no <http://ridder.uio.no?by-user=t> > http://www.uio.no/om/finn-fram/parkering/ > http://www.uio.no/studier/program/eld-master/ > http://www.arena.uio.no/index-nor.xml > http://www.usit.uio.no/web/ > > And the following in my regex-urlfilter.txt file: > ... > +^http://www.uio.no/studier/program/eld-master/* > -^http://www.arena.uio.no/events/* > +^http://www.usit.uio.no/web/* > # deny everything else > -. > > But the strange part is that I find the following URLs in my index after > the crawler has finished: > http://www.admin.uio.no/prosjekter/nyuioweb/ > http://www.arena.uio.no/events/ > > The first URL is not mentioned in my filter settings at all, whilst the > latter has an explicit deny setting. And, yes, I deleted the whole crawl > folder before my last crawl attempt. > > But when I run the following command, everything seems to be ok: > bin/nutch plugin urlfilter-regex > org.apache.nutch.urlfilter.regex.RegexURLFilter > http://www.arena.uio.no/events/ > -http://www.arena.uio.no/events/ > http://www.admin.uio.no/prosjekter/nyuioweb/ > -http://www.admin.uio.no/prosjekter/nyuioweb/ > http://www.usit.uio.no/web/ > +http://www.usit.uio.no/web/ > > Have I missed something? > > Erlend > -- > Erlend Garåsen > Center for Information Technology Services > University of Oslo > P.O. Box 1086 Blindern, N-0317 OSLO, Norway > Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 > > > > ------------------------------ > View message @ > http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-is-ignored-tp1768031p1768031.html > To start a new topic under Nutch - User, email > [email protected]<ml-node%[email protected]> > To unsubscribe from Nutch - User, click > here<http://lucene.472066.n3.nabble.com/template/TplServlet.jtp?tpl=unsubscribe_by_code&node=603147&code=amVldC5sb3Zlc0BnbWFpbC5jb218NjAzMTQ3fC0xMDg2ODAyNDgy>. > > > -- Thanks and regards Jitendra Singh -- View this message in context: http://lucene.472066.n3.nabble.com/regex-urlfilter-txt-is-ignored-tp1768031p1778603.html Sent from the Nutch - User mailing list archive at Nabble.com.

