Re: Regex urlfilter

Magnús Skúlason Wed, 19 May 2010 13:16:19 -0700

Hi,

I am having the exact same problem, could you share with me what
changes you did?


by the way I have been putting my filtering rules into
crawl-urlfilter.txt, could someone explain the difference between the
two and what decides which one is in fact used by nutch?

best regards,
Magnus

On Wed, May 19, 2010 at 6:40 PM, Tom Landvoigt
<[email protected]> wrote:
> Thanks a lot Julien.
>
> I figured it out with your help
>
> Cya
>
> Tom
>
> -----Original Message-----
> From: Julien Nioche [mailto:[email protected]]
> Sent: Mittwoch, 19. Mai 2010 20:34
> To: [email protected]
> Subject: Re: Regex urlfilter
>
>>
>> But I still have a question. I have to rebuild the .job file with ant
>> right?
>
>
> you could of course modify the job file directly - it's just a jar with
> a
> fancy name
>
>
>> The rebuilding only uses the regex-urlfilter.txt in /conf and
>> don't uses the NUTCH_CONF_DIR variable to get the conf dir.
>
>
> No, it does use conf.dir. Have a look at the task JOB in build.xml, it
> contains the line
>    * <zipfileset dir="${conf.dir}" excludes="*.template,hadoop*.*"/>*
> basically, most of the content of conf.dir (i.e. conf/ by default) is
> put
> the job file, not only the regex-urlfilter
>
> Julien
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
>
>>
>>
>> -----Original Message-----
>> From: Julien Nioche [mailto:[email protected]]
>> Sent: Mittwoch, 19. Mai 2010 17:24
>> To: [email protected]
>> Subject: Re: Regex urlfilter
>>
>> Tom,
>>
>> You can test your filters using
>> *./nutch plugin urlfilter-regex
>> org.apache.nutch.urlfilter.regex.RegexURLFilter*
>> then enter a URL to check whether it is filtered or not
>>
>> If you are in a distributed environment the filters in the conf dir of
>> your
>> master are not used : you need to regenerate a job file as it is what
>> the
>> slaves use
>>
>> HTH
>>
>> Julien
>>
>> --
>> DigitalPebble Ltd
>> http://www.digitalpebble.com
>>
>> On 19 May 2010 16:05, Tom Landvoigt <[email protected]> wrote:
>>
>> > Hi,
>> >
>> >
>> >
>> > I have a little problem.
>> >
>> >
>> >
>> > In my crawldb are urls like
>> > http://blog2.de/fotos/tags/080807/photo/1150136437/DSC0717.html but
> I
>> > don't want to crawl them.
>> >
>> >
>> >
>> > So I put a line in my regex-urlfilter.txt:
>> >
>> >
>> >
>> > -^http://blog2.de/fotos/tags/
>> >
>> >
>> >
>> > But when I generate a segment the url is still in it. Can someone
> help
>> > me with this?
>> >
>> >
>> >
>> > Thanks a lot
>> >
>> >
>> >
>> > ---------------------
>> >
>> > Tom Landvoigt
>> >
>> >
>> >
>> >
>>
>

Re: Regex urlfilter

Reply via email to