Thanks a lot Julien. I figured it out with your help
Cya Tom -----Original Message----- From: Julien Nioche [mailto:[email protected]] Sent: Mittwoch, 19. Mai 2010 20:34 To: [email protected] Subject: Re: Regex urlfilter > > But I still have a question. I have to rebuild the .job file with ant > right? you could of course modify the job file directly - it's just a jar with a fancy name > The rebuilding only uses the regex-urlfilter.txt in /conf and > don't uses the NUTCH_CONF_DIR variable to get the conf dir. No, it does use conf.dir. Have a look at the task JOB in build.xml, it contains the line * <zipfileset dir="${conf.dir}" excludes="*.template,hadoop*.*"/>* basically, most of the content of conf.dir (i.e. conf/ by default) is put the job file, not only the regex-urlfilter Julien -- DigitalPebble Ltd http://www.digitalpebble.com > > > -----Original Message----- > From: Julien Nioche [mailto:[email protected]] > Sent: Mittwoch, 19. Mai 2010 17:24 > To: [email protected] > Subject: Re: Regex urlfilter > > Tom, > > You can test your filters using > *./nutch plugin urlfilter-regex > org.apache.nutch.urlfilter.regex.RegexURLFilter* > then enter a URL to check whether it is filtered or not > > If you are in a distributed environment the filters in the conf dir of > your > master are not used : you need to regenerate a job file as it is what > the > slaves use > > HTH > > Julien > > -- > DigitalPebble Ltd > http://www.digitalpebble.com > > On 19 May 2010 16:05, Tom Landvoigt <[email protected]> wrote: > > > Hi, > > > > > > > > I have a little problem. > > > > > > > > In my crawldb are urls like > > http://blog2.de/fotos/tags/080807/photo/1150136437/DSC0717.html but I > > don't want to crawl them. > > > > > > > > So I put a line in my regex-urlfilter.txt: > > > > > > > > -^http://blog2.de/fotos/tags/ > > > > > > > > But when I generate a segment the url is still in it. Can someone help > > me with this? > > > > > > > > Thanks a lot > > > > > > > > --------------------- > > > > Tom Landvoigt > > > > > > > > >

