Hi Markus:

Thanks for the reply and information provided. I did a quick test by:
1. adding "urlfilter-suffix" in "plugin.includes" property in
"nutch-site.xml"
2. running "runtime/local/bin/nutch org.apache.nutch.net.
URLFilterChecker -filterName
org.apache.nutch.urlfilter.suffix.SuffixURLFilter"

Here is the finding (disclaimer: the test is far from thorough. no
guarantee on the correctness, and I did not read the source code. It is
more like my guess and speculation). The behaviour of the plug-in looks
like:
Take a line from the configuration file (e.g., "*.jpeg"*), and use regular
expression to match a URL using something like /\.jpeg$/ . If this pattern
is found, the URl is pruned.

This is all fine except that some lines in the configuration file
"suffix-urlfilter.txt" are ".au" (listed under heading "audio/video") and
".com" (under heading "executables"). Therefore, it will prune, for
instance, the following urls:
http://www.google.com     (will prune all .com web sites)
http://www.unimelb.edu.au  (this is important to me since I am in Australia)

But these are fine (i.e., add slash at the end):
http://www.google.com/
http://www.unimelb.edu.au/

My current workaround would be to delete the ".com" and ".au" lines from
the configuration file.

Regards
Andy


On 6 June 2012 18:05, Markus Jelsma <[email protected]> wrote:

>
> -----Original message-----
> > From:Andy Xue <[email protected]>
> > Sent: Wed 06-Jun-2012 05:04
> > To: [email protected]
> > Subject: Behaviour of &quot;urlfilter-suffix&quot; plug-in when dealing
> with a URL without filename extension
> >
> > Hi all:
>
> hi
>
> >
> > Does the "urlfilter-suffix" plug-in prune URL which does not have a
> > filename extension?
> >
> > e.g., allow this
> >     http://nutch.apache.org/index.html
> > but prune this
> >     http://nutch.apache.org/
> >
> > It seems to happen to me. Dumping crawldb after injecting will give me an
> > empty text file when no url in the seed list has a filename extension.
>
> I'm not really sure.You can quickly test your URLFilters with the
> bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined tool.
>
> >
> > The configuration file "suffix-urlfilter.txt" is set to default (i.e.,
> > allow all except for the extensions listed):
> > # config file for urlfilter-suffix plugin
> >
> > # case-insensitive, allow unknown suffixes
> > +I
> > # uncomment the line below to filter on url path
> > #+P
> >
> > ### prohibit these
> > # pictures
> > .gif
> > .jpg
> > .jpeg
> > .bmp
> > .png
> > and so on.
> >
> > I'm working with nutch trunk.
> >
> > Thanks for the time and help.
> > Andy
> >
>

Reply via email to