Author: jerome Date: Tue Mar 21 14:40:15 2006 New Revision: 387657 URL: http://svn.apache.org/viewcvs?rev=387657&view=rev Log: Add an automaton urlfilter rules template
Added: lucene/nutch/trunk/conf/automaton-urlfilter.txt.template Added: lucene/nutch/trunk/conf/automaton-urlfilter.txt.template URL: http://svn.apache.org/viewcvs/lucene/nutch/trunk/conf/automaton-urlfilter.txt.template?rev=387657&view=auto ============================================================================== --- lucene/nutch/trunk/conf/automaton-urlfilter.txt.template (added) +++ lucene/nutch/trunk/conf/automaton-urlfilter.txt.template Tue Mar 21 14:40:15 2006 @@ -0,0 +1,19 @@ +# The default url filter. +# Better for whole-internet crawling. + +# Each non-comment, non-blank line contains a regular expression +# prefixed by '+' or '-'. The first matching pattern in the file +# determines whether a URL is included or ignored. If no pattern +# matches, the URL is ignored. + +# skip file: ftp: and mailto: urls +-(file|ftp|mailto):.* + +# skip image and other suffixes we can't yet parse +-.*\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe) + +# skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] + +# accept anything else ++.*