-----Original message-----
> From:Andy Xue <[email protected]>
> Sent: Wed 06-Jun-2012 05:04
> To: [email protected]
> Subject: Behaviour of &quot;urlfilter-suffix&quot; plug-in when dealing with 
> a URL without filename extension
> 
> Hi all:

hi

> 
> Does the "urlfilter-suffix" plug-in prune URL which does not have a
> filename extension?
> 
> e.g., allow this
>     http://nutch.apache.org/index.html
> but prune this
>     http://nutch.apache.org/
> 
> It seems to happen to me. Dumping crawldb after injecting will give me an
> empty text file when no url in the seed list has a filename extension.

I'm not really sure.You can quickly test your URLFilters with the bin/nutch 
org.apache.nutch.net.URLFilterChecker -allCombined tool.

> 
> The configuration file "suffix-urlfilter.txt" is set to default (i.e.,
> allow all except for the extensions listed):
> # config file for urlfilter-suffix plugin
> 
> # case-insensitive, allow unknown suffixes
> +I
> # uncomment the line below to filter on url path
> #+P
> 
> ### prohibit these
> # pictures
> .gif
> .jpg
> .jpeg
> .bmp
> .png
> and so on.
> 
> I'm working with nutch trunk.
> 
> Thanks for the time and help.
> Andy
> 

Reply via email to