Hi, I am having the exact same problem, could you share with me what changes you did?
by the way I have been putting my filtering rules into crawl-urlfilter.txt, could someone explain the difference between the two and what decides which one is in fact used by nutch? best regards, Magnus On Wed, May 19, 2010 at 6:40 PM, Tom Landvoigt <[email protected]> wrote: > Thanks a lot Julien. > > I figured it out with your help > > Cya > > Tom > > -----Original Message----- > From: Julien Nioche [mailto:[email protected]] > Sent: Mittwoch, 19. Mai 2010 20:34 > To: [email protected] > Subject: Re: Regex urlfilter > >> >> But I still have a question. I have to rebuild the .job file with ant >> right? > > > you could of course modify the job file directly - it's just a jar with > a > fancy name > > >> The rebuilding only uses the regex-urlfilter.txt in /conf and >> don't uses the NUTCH_CONF_DIR variable to get the conf dir. > > > No, it does use conf.dir. Have a look at the task JOB in build.xml, it > contains the line > * <zipfileset dir="${conf.dir}" excludes="*.template,hadoop*.*"/>* > basically, most of the content of conf.dir (i.e. conf/ by default) is > put > the job file, not only the regex-urlfilter > > Julien > -- > DigitalPebble Ltd > http://www.digitalpebble.com > > >> >> >> -----Original Message----- >> From: Julien Nioche [mailto:[email protected]] >> Sent: Mittwoch, 19. Mai 2010 17:24 >> To: [email protected] >> Subject: Re: Regex urlfilter >> >> Tom, >> >> You can test your filters using >> *./nutch plugin urlfilter-regex >> org.apache.nutch.urlfilter.regex.RegexURLFilter* >> then enter a URL to check whether it is filtered or not >> >> If you are in a distributed environment the filters in the conf dir of >> your >> master are not used : you need to regenerate a job file as it is what >> the >> slaves use >> >> HTH >> >> Julien >> >> -- >> DigitalPebble Ltd >> http://www.digitalpebble.com >> >> On 19 May 2010 16:05, Tom Landvoigt <[email protected]> wrote: >> >> > Hi, >> > >> > >> > >> > I have a little problem. >> > >> > >> > >> > In my crawldb are urls like >> > http://blog2.de/fotos/tags/080807/photo/1150136437/DSC0717.html but > I >> > don't want to crawl them. >> > >> > >> > >> > So I put a line in my regex-urlfilter.txt: >> > >> > >> > >> > -^http://blog2.de/fotos/tags/ >> > >> > >> > >> > But when I generate a segment the url is still in it. Can someone > help >> > me with this? >> > >> > >> > >> > Thanks a lot >> > >> > >> > >> > --------------------- >> > >> > Tom Landvoigt >> > >> > >> > >> > >> >

