-----Original message----- > From:Andy Xue <[email protected]> > Sent: Wed 06-Jun-2012 11:11 > To: Markus Jelsma <[email protected]>; [email protected] > Subject: Re: Behaviour of "urlfilter-suffix" plug-in when dealing > with a URL without filename extension > > Hi Markus:
hi > > Thanks for the reply and information provided. I did a quick test by: > 1. adding "urlfilter-suffix" in "plugin.includes" property in "nutch-site.xml" > 2. running "runtime/local/bin/nutch org.apache.nutch.net > <http://org.apache.nutch.net> . > URLFilterChecker -filterName > org.apache.nutch.urlfilter.suffix.SuffixURLFilter" > > Here is the finding (disclaimer: the test is far from thorough. no guarantee > on the correctness, and I did not read the source code. It is more like my > guess and speculation). The behaviour of the plug-in looks like: > Take a line from the configuration file (e.g., ".jpeg"), and use regular > expression to match a URL using something like /\.jpeg$/ . If this pattern is > found, the URl is pruned. > > This is all fine except that some lines in the configuration file > "suffix-urlfilter.txt" are ".au" (listed under heading "audio/video") and > ".com" (under heading "executables"). Therefore, it will prune, for instance, > the following urls: > http://www.google.com <http://www.google.com> (will prune all .com web > sites) > http://www.unimelb.edu.au <http://www.unimelb.edu.au> (this is important to > me since I am in Australia) > > But these are fine (i.e., add slash at the end): > http://www.google.com/ <http://www.google.com/> > http://www.unimelb.edu.au/ <http://www.unimelb.edu.au/> > > My current workaround would be to delete the ".com" and ".au" lines from the > configuration file. Better would be to run the normalizers first because that will solve the problem. The default normalizers add a trailing slash to hosts when it's missing, that means .au/ is not a suffix anymore and is not going to be filtered out. Cheers > > Regards > Andy > > > On 6 June 2012 18:05, Markus Jelsma <[email protected] > <mailto:[email protected]> > wrote: > > -----Original message----- > > From:Andy Xue <[email protected] <mailto:[email protected]> > > > Sent: Wed 06-Jun-2012 05:04 > > To: [email protected] <mailto:[email protected]> > > Subject: Behaviour of "urlfilter-suffix" plug-in when dealing > > with a URL without filename extension > > > > Hi all: > > hi > > > > > Does the "urlfilter-suffix" plug-in prune URL which does not have a > > filename extension? > > > > e.g., allow this > > http://nutch.apache.org/index.html <http://nutch.apache.org/index.html> > > but prune this > > http://nutch.apache.org/ <http://nutch.apache.org/> > > > > It seems to happen to me. Dumping crawldb after injecting will give me an > > empty text file when no url in the seed list has a filename extension. > > I'm not really sure.You can quickly test your URLFilters with the bin/nutch > org.apache.nutch.net.URLFilterChecker -allCombined tool. > > > > > The configuration file "suffix-urlfilter.txt" is set to default (i.e., > > allow all except for the extensions listed): > > # config file for urlfilter-suffix plugin > > > > # case-insensitive, allow unknown suffixes > > +I > > # uncomment the line below to filter on url path > > #+P > > > > ### prohibit these > > # pictures > > .gif > > .jpg > > .jpeg > > .bmp > > .png > > and so on. > > > > I'm working with nutch trunk. > > > > Thanks for the time and help. > > Andy > > > >

