Hi all:
Does the "urlfilter-suffix" plug-in prune URL which does not have a
filename extension?
e.g., allow this
http://nutch.apache.org/index.html
but prune this
http://nutch.apache.org/
It seems to happen to me. Dumping crawldb after injecting will give me an
empty text file when no url in the seed list has a filename extension.
The configuration file "suffix-urlfilter.txt" is set to default (i.e.,
allow all except for the extensions listed):
# config file for urlfilter-suffix plugin
# case-insensitive, allow unknown suffixes
+I
# uncomment the line below to filter on url path
#+P
### prohibit these
# pictures
.gif
.jpg
.jpeg
.bmp
.png
and so on.
I'm working with nutch trunk.
Thanks for the time and help.
Andy