Author: snagel Date: Fri Dec 5 19:53:35 2014 New Revision: 1643412 URL: http://svn.apache.org/r1643412 Log: NUTCH-1877 Suffix URL filter to ignore query string by default
Modified: nutch/branches/2.x/CHANGES.txt nutch/branches/2.x/conf/suffix-urlfilter.txt.template nutch/trunk/CHANGES.txt nutch/trunk/conf/suffix-urlfilter.txt.template Modified: nutch/branches/2.x/CHANGES.txt URL: http://svn.apache.org/viewvc/nutch/branches/2.x/CHANGES.txt?rev=1643412&r1=1643411&r2=1643412&view=diff ============================================================================== --- nutch/branches/2.x/CHANGES.txt (original) +++ nutch/branches/2.x/CHANGES.txt Fri Dec 5 19:53:35 2014 @@ -2,6 +2,8 @@ Nutch Change Log Current Development 2.3-SNAPSHOT +* NUTCH-1877 Suffix URL filter to ignore query string by default (markus via snagel) + * NUTCH-1825 protocol-http may hang for certain web pages (Phu Kieu via snagel) * NUTCH-1483 Can't crawl filesystem with protocol-file plugin (Rogério Pereira Araújo, Mengying Wang, snagel) Modified: nutch/branches/2.x/conf/suffix-urlfilter.txt.template URL: http://svn.apache.org/viewvc/nutch/branches/2.x/conf/suffix-urlfilter.txt.template?rev=1643412&r1=1643411&r2=1643412&view=diff ============================================================================== --- nutch/branches/2.x/conf/suffix-urlfilter.txt.template (original) +++ nutch/branches/2.x/conf/suffix-urlfilter.txt.template Fri Dec 5 19:53:35 2014 @@ -16,8 +16,19 @@ # case-insensitive, allow unknown suffixes +I -# uncomment the line below to filter on url path -#+P + +# filter on URL path only ++P +# comment out to filter on complete URL +# but be aware that the pattern +# .com +# will then reject +# http://xyz.com +# http://xyz.com/search?q=foo.com +# while the pattern +# .mp3 +# will not apply to (URLs will pass) +# http://xyz.com/music.mp3?q=abc ### prohibit these # pictures Modified: nutch/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?rev=1643412&r1=1643411&r2=1643412&view=diff ============================================================================== --- nutch/trunk/CHANGES.txt (original) +++ nutch/trunk/CHANGES.txt Fri Dec 5 19:53:35 2014 @@ -2,6 +2,8 @@ Nutch Change Log Nutch Current Development 1.10-SNAPSHOT +* NUTCH-1877 Suffix URL filter to ignore query string by default (markus via snagel) + * NUTCH-1890 Major Typo in Documentation for Integrating Nutch and Solr (Boadu Akoto Charles Jnr, mattmann) * NUTCH-1887 Specify HTMLMapper to use in TikaParser (jnioche) Modified: nutch/trunk/conf/suffix-urlfilter.txt.template URL: http://svn.apache.org/viewvc/nutch/trunk/conf/suffix-urlfilter.txt.template?rev=1643412&r1=1643411&r2=1643412&view=diff ============================================================================== --- nutch/trunk/conf/suffix-urlfilter.txt.template (original) +++ nutch/trunk/conf/suffix-urlfilter.txt.template Fri Dec 5 19:53:35 2014 @@ -2,8 +2,19 @@ # case-insensitive, allow unknown suffixes +I -# uncomment the line below to filter on url path -#+P + +# filter on URL path only ++P +# comment out to filter on complete URL +# but be aware that the pattern +# .com +# will then reject +# http://xyz.com +# http://xyz.com/search?q=foo.com +# while the pattern +# .mp3 +# will not apply to (URLs will pass) +# http://xyz.com/music.mp3?q=abc ### prohibit these # pictures