Author: jerome
Date: Tue Mar 21 14:40:15 2006
New Revision: 387657

URL: http://svn.apache.org/viewcvs?rev=387657&view=rev
Log:
Add an automaton urlfilter rules template

Added:
    lucene/nutch/trunk/conf/automaton-urlfilter.txt.template

Added: lucene/nutch/trunk/conf/automaton-urlfilter.txt.template
URL: 
http://svn.apache.org/viewcvs/lucene/nutch/trunk/conf/automaton-urlfilter.txt.template?rev=387657&view=auto
==============================================================================
--- lucene/nutch/trunk/conf/automaton-urlfilter.txt.template (added)
+++ lucene/nutch/trunk/conf/automaton-urlfilter.txt.template Tue Mar 21 
14:40:15 2006
@@ -0,0 +1,19 @@
+# The default url filter.
+# Better for whole-internet crawling.
+
+# Each non-comment, non-blank line contains a regular expression
+# prefixed by '+' or '-'.  The first matching pattern in the file
+# determines whether a URL is included or ignored.  If no pattern
+# matches, the URL is ignored.
+
+# skip file: ftp: and mailto: urls
+-(file|ftp|mailto):.*
+
+# skip image and other suffixes we can't yet parse
+-.*\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)
+
+# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
+
+# accept anything else
++.*


Reply via email to