> My current workaround would be to delete the ".com" and ".au" lines from > the configuration file.
You could also activate the option +P in suffix-urlfilter.txt: >>> # uncomment the line below to filter on url path >>> #+P The pattern are then exclusively applied to the path of the URL and not to host or query (eg, .../search.cgi?q=google.com). The overhead for parsing/splitting the URL is acceptable. On 06/06/2012 11:10 AM, Andy Xue wrote: > Hi Markus: > > Thanks for the reply and information provided. I did a quick test by: > 1. adding "urlfilter-suffix" in "plugin.includes" property in > "nutch-site.xml" > 2. running "runtime/local/bin/nutch org.apache.nutch.net. > URLFilterChecker -filterName > org.apache.nutch.urlfilter.suffix.SuffixURLFilter" > > Here is the finding (disclaimer: the test is far from thorough. no > guarantee on the correctness, and I did not read the source code. It is > more like my guess and speculation). The behaviour of the plug-in looks > like: > Take a line from the configuration file (e.g., "*.jpeg"*), and use regular > expression to match a URL using something like /\.jpeg$/ . If this pattern > is found, the URl is pruned. > > This is all fine except that some lines in the configuration file > "suffix-urlfilter.txt" are ".au" (listed under heading "audio/video") and > ".com" (under heading "executables"). Therefore, it will prune, for > instance, the following urls: > http://www.google.com (will prune all .com web sites) > http://www.unimelb.edu.au (this is important to me since I am in Australia) > > But these are fine (i.e., add slash at the end): > http://www.google.com/ > http://www.unimelb.edu.au/ > > My current workaround would be to delete the ".com" and ".au" lines from > the configuration file. > > Regards > Andy > > > On 6 June 2012 18:05, Markus Jelsma <[email protected]> wrote: > >> >> -----Original message----- >>> From:Andy Xue <[email protected]> >>> Sent: Wed 06-Jun-2012 05:04 >>> To: [email protected] >>> Subject: Behaviour of "urlfilter-suffix" plug-in when dealing >> with a URL without filename extension >>> >>> Hi all: >> >> hi >> >>> >>> Does the "urlfilter-suffix" plug-in prune URL which does not have a >>> filename extension? >>> >>> e.g., allow this >>> http://nutch.apache.org/index.html >>> but prune this >>> http://nutch.apache.org/ >>> >>> It seems to happen to me. Dumping crawldb after injecting will give me an >>> empty text file when no url in the seed list has a filename extension. >> >> I'm not really sure.You can quickly test your URLFilters with the >> bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined tool. >> >>> >>> The configuration file "suffix-urlfilter.txt" is set to default (i.e., >>> allow all except for the extensions listed): >>> # config file for urlfilter-suffix plugin >>> >>> # case-insensitive, allow unknown suffixes >>> +I >>> # uncomment the line below to filter on url path >>> #+P >>> >>> ### prohibit these >>> # pictures >>> .gif >>> .jpg >>> .jpeg >>> .bmp >>> .png >>> and so on. >>> >>> I'm working with nutch trunk. >>> >>> Thanks for the time and help. >>> Andy >>> >> >

