Hi, I've tried to reproduce it. But it works as expected:
% cat robots.txt User-agent: * Disallow: /wpblog/feed/ % cat test.txt http://www.example.com/wpblog/feed/ http://www.example.com/wpblog/feed/index.html % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt test.txt 'myAgent' not allowed: http://www.example.com/wpblog/feed/ not allowed: http://www.example.com/wpblog/feed/index.html There are no steps required to make Nutch respect the robots.txt rules. Only the robots.txt must be properly placed and readable. Best, Sebastian On 12/10/2017 11:16 PM, mabi wrote: > Hello, > > I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect > the robots.txt Disallow from my website. I have the following very simple robots.txt file: > > User-agent: * > Disallow: /wpblog/feed/ > > Still the /wpblog/feed/ URL gets parsed and finally indexed. > > Do I need to enable anything special in the nutch-site.xml config file maybe? > > Thanks, > Mabi > > > >

