Re: Behaviour of "urlfilter-suffix" plug-in when dealing with a URL without filename extension

Sebastian Nagel Tue, 12 Jun 2012 14:33:23 -0700

> My current workaround would be to delete the ".com" and ".au" lines from
> the configuration file.


You could also activate the option +P in suffix-urlfilter.txt:
>>> # uncomment the line below to filter on url path
>>> #+P

The pattern are then exclusively applied to the path of the URL
and not to host or query (eg, .../search.cgi?q=google.com).
The overhead for parsing/splitting the URL is acceptable.

On 06/06/2012 11:10 AM, Andy Xue wrote:
> Hi Markus:
> 
> Thanks for the reply and information provided. I did a quick test by:
> 1. adding "urlfilter-suffix" in "plugin.includes" property in
> "nutch-site.xml"
> 2. running "runtime/local/bin/nutch org.apache.nutch.net.
> URLFilterChecker -filterName
> org.apache.nutch.urlfilter.suffix.SuffixURLFilter"
> 
> Here is the finding (disclaimer: the test is far from thorough. no
> guarantee on the correctness, and I did not read the source code. It is
> more like my guess and speculation). The behaviour of the plug-in looks
> like:
> Take a line from the configuration file (e.g., "*.jpeg"*), and use regular
> expression to match a URL using something like /\.jpeg$/ . If this pattern
> is found, the URl is pruned.
> 
> This is all fine except that some lines in the configuration file
> "suffix-urlfilter.txt" are ".au" (listed under heading "audio/video") and
> ".com" (under heading "executables"). Therefore, it will prune, for
> instance, the following urls:
> http://www.google.com     (will prune all .com web sites)
> http://www.unimelb.edu.au  (this is important to me since I am in Australia)
> 
> But these are fine (i.e., add slash at the end):
> http://www.google.com/
> http://www.unimelb.edu.au/
> 
> My current workaround would be to delete the ".com" and ".au" lines from
> the configuration file.
> 
> Regards
> Andy
> 
> 
> On 6 June 2012 18:05, Markus Jelsma <[email protected]> wrote:
> 
>>
>> -----Original message-----
>>> From:Andy Xue <[email protected]>
>>> Sent: Wed 06-Jun-2012 05:04
>>> To: [email protected]
>>> Subject: Behaviour of &quot;urlfilter-suffix&quot; plug-in when dealing
>> with a URL without filename extension
>>>
>>> Hi all:
>>
>> hi
>>
>>>
>>> Does the "urlfilter-suffix" plug-in prune URL which does not have a
>>> filename extension?
>>>
>>> e.g., allow this
>>>     http://nutch.apache.org/index.html
>>> but prune this
>>>     http://nutch.apache.org/
>>>
>>> It seems to happen to me. Dumping crawldb after injecting will give me an
>>> empty text file when no url in the seed list has a filename extension.
>>
>> I'm not really sure.You can quickly test your URLFilters with the
>> bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined tool.
>>
>>>
>>> The configuration file "suffix-urlfilter.txt" is set to default (i.e.,
>>> allow all except for the extensions listed):
>>> # config file for urlfilter-suffix plugin
>>>
>>> # case-insensitive, allow unknown suffixes
>>> +I
>>> # uncomment the line below to filter on url path
>>> #+P
>>>
>>> ### prohibit these
>>> # pictures
>>> .gif
>>> .jpg
>>> .jpeg
>>> .bmp
>>> .png
>>> and so on.
>>>
>>> I'm working with nutch trunk.
>>>
>>> Thanks for the time and help.
>>> Andy
>>>
>>
>

Re: Behaviour of "urlfilter-suffix" plug-in when dealing with a URL without filename extension

Reply via email to