Doh!  I really should just read the code of things before posting.

I ran the URLFilterChecker and passed it in a url that the SuffixFilter
should flag and it still passed it.  However, if I change the url to end in
a format that is in the default config file, it rejects the url.

So it looks like the problem is that it's not loading the altered config
file from my conf directory.  Not sure why since the regex filter correctly
finds it's config file.


On Wed, Jun 12, 2013 at 8:34 AM, Markus Jelsma
<[email protected]>wrote:

> We happily use that filter just as it is shipped with Nutch. Just enabling
> it in plugin.includes works for us. To ease testing you can use the
> bin/nutch org.apache.nutch.net.URLFilterChecker to test filters.
>
>
> -----Original message-----
> > From:Bai Shen <[email protected]>
> > Sent: Wed 12-Jun-2013 14:32
> > To: [email protected]
> > Subject: Suffix URLFilter not working
> >
> > I'm dealing with a lot of file types that I don't want to index.  I was
> > originally using the regex filter to exclude them but it was getting out
> of
> > hand.
> >
> > I changed my plugin includes from
> >
> > urlfilter-regex
> >
> > to
> >
> > urlfilter-(regex|suffix)
> >
> > I've tried using both the default urlfilter-suffix.txt file via adding
> the
> > extensions I don't want and making my own file that starts with + and
> > includes the extensions I do want.
> >
> > Neither of these approaches seem to work.  I continue to get urls added
> to
> > the database which continue extensions I don't want.  Even adding a
> > urlfilter.order section to my nutch-site.xml doesn't work.
> >
> > I don't see any obvious bugs in the code, so I'm a bit stumped.  Any
> > suggestions for what else to look at?
> >
> > Thanks.
> >
>

Reply via email to