RE: Regex urlfilter

Tom Landvoigt Wed, 19 May 2010 10:33:09 -0700

First of all thanks Julien,

I solved the problem with your help.


But I still have a question. I have to rebuild the .job file with ant
right? The rebuilding only uses the regex-urlfilter.txt in /conf and
don't uses the NUTCH_CONF_DIR variable to get the conf dir. 

Is that right?

Thanks

Tom

-----Original Message-----
From: Julien Nioche [mailto:[email protected]] 
Sent: Mittwoch, 19. Mai 2010 17:24
To: [email protected]
Subject: Re: Regex urlfilter

Tom,

You can test your filters using
*./nutch plugin urlfilter-regex
org.apache.nutch.urlfilter.regex.RegexURLFilter*
then enter a URL to check whether it is filtered or not

If you are in a distributed environment the filters in the conf dir of
your
master are not used : you need to regenerate a job file as it is what
the
slaves use

HTH

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

On 19 May 2010 16:05, Tom Landvoigt <[email protected]> wrote:

> Hi,
>
>
>
> I have a little problem.
>
>
>
> In my crawldb are urls like
> http://blog2.de/fotos/tags/080807/photo/1150136437/DSC0717.html but I
> don't want to crawl them.
>
>
>
> So I put a line in my regex-urlfilter.txt:
>
>
>
> -^http://blog2.de/fotos/tags/
>
>
>
> But when I generate a segment the url is still in it. Can someone help
> me with this?
>
>
>
> Thanks a lot
>
>
>
> ---------------------
>
> Tom Landvoigt
>
>
>
>

RE: Regex urlfilter

Reply via email to