Hi all:

Does the "urlfilter-suffix" plug-in prune URL which does not have a
filename extension?

e.g., allow this
    http://nutch.apache.org/index.html
but prune this
    http://nutch.apache.org/

It seems to happen to me. Dumping crawldb after injecting will give me an
empty text file when no url in the seed list has a filename extension.

The configuration file "suffix-urlfilter.txt" is set to default (i.e.,
allow all except for the extensions listed):
# config file for urlfilter-suffix plugin

# case-insensitive, allow unknown suffixes
+I
# uncomment the line below to filter on url path
#+P

### prohibit these
# pictures
.gif
.jpg
.jpeg
.bmp
.png
and so on.

I'm working with nutch trunk.

Thanks for the time and help.
Andy

Reply via email to