Hi Sebastian, I am already using the protocol-httpclient plugin as I also require HTTPS and I checked the access.log from the website I am crawling and see that it did a GET on the robots.txt as you can see here:
123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" 200 223 "-" "MyCrawler/0.1" What I also did is to enable DEBUG logging in log4j.properties like that: log4j.logger.org.apache.nutch=DEBUG and grep robots in the hadoop.log file but there nothing could be found either, no errors nothing. What else could I try or check? Best, M. >-------- Original Message -------- >Subject: Re: robots.txt Disallow not respected >Local Time: December 11, 2017 7:13 AM >UTC Time: December 11, 2017 6:13 AM >From: [email protected] >To: [email protected] > >Hi, > > Check that robots.txt is acquired and parsed correctly. Try to change the > protocol to protocol-httpclient. > > Z > On 2017-12-10 23:54:14, Sebastian Nagel [email protected] wrote: > Hi, > > I've tried to reproduce it. But it works as expected: > > % cat robots.txt > User-agent: * > Disallow: /wpblog/feed/ > > % cat test.txt >http://www.example.com/wpblog/feed/ >http://www.example.com/wpblog/feed/index.html > > % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt > test.txt 'myAgent' > not allowed: http://www.example.com/wpblog/feed/ > not allowed: http://www.example.com/wpblog/feed/index.html > > > There are no steps required to make Nutch respect the robots.txt rules. > Only the robots.txt must be properly placed and readable. > > Best, > Sebastian > > > On 12/10/2017 11:16 PM, mabi wrote: >>Hello, >>I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect >>the robots.txt >> Disallow from my website. I have the following very simple robots.txt file: >>User-agent: * >> Disallow: /wpblog/feed/ >>Still the /wpblog/feed/ URL gets parsed and finally indexed. >>Do I need to enable anything special in the nutch-site.xml config file maybe? >>Thanks, >> Mabi >> >> >>

