-----Original message-----
> From:Andy Xue <[email protected]>
> Sent: Wed 06-Jun-2012 11:11
> To: Markus Jelsma <[email protected]>; [email protected]
> Subject: Re: Behaviour of &quot;urlfilter-suffix&quot; plug-in when dealing 
> with a URL without filename extension
> 
> Hi Markus:

hi

> 
> Thanks for the reply and information provided. I did a quick test by:
> 1. adding "urlfilter-suffix" in "plugin.includes" property in "nutch-site.xml"
> 2. running "runtime/local/bin/nutch org.apache.nutch.net 
> <http://org.apache.nutch.net> .
> URLFilterChecker -filterName 
> org.apache.nutch.urlfilter.suffix.SuffixURLFilter"
> 
> Here is the finding (disclaimer: the test is far from thorough. no guarantee 
> on the correctness, and I did not read the source code. It is more like my 
> guess and speculation). The behaviour of the plug-in looks like:
> Take a line from the configuration file (e.g., ".jpeg"), and use regular 
> expression to match a URL using something like /\.jpeg$/ . If this pattern is 
> found, the URl is pruned.
> 
> This is all fine except that some lines in the configuration file 
> "suffix-urlfilter.txt" are ".au" (listed under heading "audio/video") and 
> ".com" (under heading "executables"). Therefore, it will prune, for instance, 
> the following urls:
> http://www.google.com <http://www.google.com>      (will prune all .com web 
> sites)
> http://www.unimelb.edu.au <http://www.unimelb.edu.au>   (this is important to 
> me since I am in Australia)
> 
> But these are fine (i.e., add slash at the end):
> http://www.google.com/ <http://www.google.com/> 
> http://www.unimelb.edu.au/ <http://www.unimelb.edu.au/> 
> 
> My current workaround would be to delete the ".com" and ".au" lines from the 
> configuration file.

Better would be to run the normalizers first because that will solve the 
problem. The default normalizers add a trailing slash to hosts when it's 
missing, that means .au/ is not a suffix anymore and is not going to be 
filtered out.

Cheers

> 
> Regards
> Andy
> 
> 
> On 6 June 2012 18:05, Markus Jelsma <[email protected] 
> <mailto:[email protected]> > wrote:
> 
> -----Original message-----
> > From:Andy Xue <[email protected] <mailto:[email protected]> >
> > Sent: Wed 06-Jun-2012 05:04
> > To: [email protected] <mailto:[email protected]> 
> > Subject: Behaviour of &quot;urlfilter-suffix&quot; plug-in when dealing 
> > with a URL without filename extension
> >
> > Hi all:
> 
> hi
> 
> >
> > Does the "urlfilter-suffix" plug-in prune URL which does not have a
> > filename extension?
> >
> > e.g., allow this
> >     http://nutch.apache.org/index.html <http://nutch.apache.org/index.html> 
> > but prune this
> >     http://nutch.apache.org/ <http://nutch.apache.org/> 
> >
> > It seems to happen to me. Dumping crawldb after injecting will give me an
> > empty text file when no url in the seed list has a filename extension.
> 
> I'm not really sure.You can quickly test your URLFilters with the bin/nutch 
> org.apache.nutch.net.URLFilterChecker -allCombined tool.
> 
> >
> > The configuration file "suffix-urlfilter.txt" is set to default (i.e.,
> > allow all except for the extensions listed):
> > # config file for urlfilter-suffix plugin
> >
> > # case-insensitive, allow unknown suffixes
> > +I
> > # uncomment the line below to filter on url path
> > #+P
> >
> > ### prohibit these
> > # pictures
> > .gif
> > .jpg
> > .jpeg
> > .bmp
> > .png
> > and so on.
> >
> > I'm working with nutch trunk.
> >
> > Thanks for the time and help.
> > Andy
> >
> 
> 

Reply via email to