RE: Regex urlfilter

Tom Landvoigt Wed, 19 May 2010 11:50:44 -0700

Thanks a lot Julien.

I figured it out with your help


Cya

Tom

-----Original Message-----
From: Julien Nioche [mailto:[email protected]] 
Sent: Mittwoch, 19. Mai 2010 20:34
To: [email protected]
Subject: Re: Regex urlfilter

>
> But I still have a question. I have to rebuild the .job file with ant
> right?


you could of course modify the job file directly - it's just a jar with
a
fancy name


> The rebuilding only uses the regex-urlfilter.txt in /conf and
> don't uses the NUTCH_CONF_DIR variable to get the conf dir.


No, it does use conf.dir. Have a look at the task JOB in build.xml, it
contains the line
    * <zipfileset dir="${conf.dir}" excludes="*.template,hadoop*.*"/>*
basically, most of the content of conf.dir (i.e. conf/ by default) is
put
the job file, not only the regex-urlfilter

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


>
>
> -----Original Message-----
> From: Julien Nioche [mailto:[email protected]]
> Sent: Mittwoch, 19. Mai 2010 17:24
> To: [email protected]
> Subject: Re: Regex urlfilter
>
> Tom,
>
> You can test your filters using
> *./nutch plugin urlfilter-regex
> org.apache.nutch.urlfilter.regex.RegexURLFilter*
> then enter a URL to check whether it is filtered or not
>
> If you are in a distributed environment the filters in the conf dir of
> your
> master are not used : you need to regenerate a job file as it is what
> the
> slaves use
>
> HTH
>
> Julien
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
> On 19 May 2010 16:05, Tom Landvoigt <[email protected]> wrote:
>
> > Hi,
> >
> >
> >
> > I have a little problem.
> >
> >
> >
> > In my crawldb are urls like
> > http://blog2.de/fotos/tags/080807/photo/1150136437/DSC0717.html but
I
> > don't want to crawl them.
> >
> >
> >
> > So I put a line in my regex-urlfilter.txt:
> >
> >
> >
> > -^http://blog2.de/fotos/tags/
> >
> >
> >
> > But when I generate a segment the url is still in it. Can someone
help
> > me with this?
> >
> >
> >
> > Thanks a lot
> >
> >
> >
> > ---------------------
> >
> > Tom Landvoigt
> >
> >
> >
> >
>

RE: Regex urlfilter

Reply via email to