-----Original message----- > From:Andy Xue <[email protected]> > Sent: Wed 06-Jun-2012 05:04 > To: [email protected] > Subject: Behaviour of "urlfilter-suffix" plug-in when dealing with > a URL without filename extension > > Hi all:
hi > > Does the "urlfilter-suffix" plug-in prune URL which does not have a > filename extension? > > e.g., allow this > http://nutch.apache.org/index.html > but prune this > http://nutch.apache.org/ > > It seems to happen to me. Dumping crawldb after injecting will give me an > empty text file when no url in the seed list has a filename extension. I'm not really sure.You can quickly test your URLFilters with the bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined tool. > > The configuration file "suffix-urlfilter.txt" is set to default (i.e., > allow all except for the extensions listed): > # config file for urlfilter-suffix plugin > > # case-insensitive, allow unknown suffixes > +I > # uncomment the line below to filter on url path > #+P > > ### prohibit these > # pictures > .gif > .jpg > .jpeg > .bmp > .png > and so on. > > I'm working with nutch trunk. > > Thanks for the time and help. > Andy >

