Hi, did you already test whether the robots.txt file is correctly parsed and rules are applied as expected? See the previous response.
If https or non-default ports are used: is the robots.txt shipped also for other protocol/port combinations? See https://issues.apache.org/jira/browse/NUTCH-1752 Also note that content is not removed when the robots.txt is changed. The robots.txt is only applied to a URL which is (re)fetched. To be sure, delete the web table (stored in HBase, etc.) and restart the crawl. Best, Sebastian On 12/11/2017 07:39 PM, mabi wrote: > Hi Sebastian, > > I am already using the protocol-httpclient plugin as I also require HTTPS and > I checked the access.log from the website I am crawling and see that it did a > GET on the robots.txt as you can see here: > > 123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" > 200 223 "-" "MyCrawler/0.1" > > What I also did is to enable DEBUG logging in log4j.properties like that: > > log4j.logger.org.apache.nutch=DEBUG > > and grep robots in the hadoop.log file but there nothing could be found > either, no errors nothing. > > What else could I try or check? > > Best, > M. > >> -------- Original Message -------- >> Subject: Re: robots.txt Disallow not respected >> Local Time: December 11, 2017 7:13 AM >> UTC Time: December 11, 2017 6:13 AM >> From: [email protected] >> To: [email protected] >> >> Hi, >> >> Check that robots.txt is acquired and parsed correctly. Try to change the >> protocol to protocol-httpclient. >> >> Z >> On 2017-12-10 23:54:14, Sebastian Nagel [email protected] wrote: >> Hi, >> >> I've tried to reproduce it. But it works as expected: >> >> % cat robots.txt >> User-agent: * >> Disallow: /wpblog/feed/ >> >> % cat test.txt >> http://www.example.com/wpblog/feed/ >> http://www.example.com/wpblog/feed/index.html >> >> % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser >> robots.txt test.txt 'myAgent' >> not allowed: http://www.example.com/wpblog/feed/ >> not allowed: http://www.example.com/wpblog/feed/index.html >> >> >> There are no steps required to make Nutch respect the robots.txt rules. >> Only the robots.txt must be properly placed and readable. >> >> Best, >> Sebastian >> >> >> On 12/10/2017 11:16 PM, mabi wrote: >>> Hello, >>> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not >>> respect the robots.txt >>> Disallow from my website. I have the following very simple robots.txt file: >>> User-agent: * >>> Disallow: /wpblog/feed/ >>> Still the /wpblog/feed/ URL gets parsed and finally indexed. >>> Do I need to enable anything special in the nutch-site.xml config file >>> maybe? >>> Thanks, >>> Mabi >>> >>> >>>

