Hi, Yes I tested manually the robots.txt by using nutch's org.apache.nutch.protocol.RobotRulesParser as suggested in the previous mail and when I do that everything works correctly: the URL which are disallowed get "not allowed" and the others "allowed". So I don't understand why this works but not my crawling. I am using https for this website but have a 301 which redirects all http traffic to https. I also tried deleting the whole hbase table as well as the solr core but that did not help either :(
Regards, M. >-------- Original Message -------- >Subject: Re: robots.txt Disallow not respected >Local Time: December 12, 2017 10:09 AM >UTC Time: December 12, 2017 9:09 AM >From: [email protected] >To: [email protected] > >Hi, > > did you already test whether the robots.txt file is correctly parsed > and rules are applied as expected? See the previous response. > > If https or non-default ports are used: is the robots.txt shipped also > for other protocol/port combinations? See >https://issues.apache.org/jira/browse/NUTCH-1752 > > Also note that content is not removed when the robots.txt is changed. > The robots.txt is only applied to a URL which is (re)fetched. To be sure, > delete the web table (stored in HBase, etc.) and restart the crawl. > > Best, > Sebastian > > On 12/11/2017 07:39 PM, mabi wrote: >>Hi Sebastian, >>I am already using the protocol-httpclient plugin as I also require HTTPS and >>I checked the access.log from the website I am crawling and see that it did a >>GET on the robots.txt as you can see here: >>123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" >>200 223 "-" "MyCrawler/0.1" >>What I also did is to enable DEBUG logging in log4j.properties like that: >>log4j.logger.org.apache.nutch=DEBUG >>and grep robots in the hadoop.log file but there nothing could be found >>either, no errors nothing. >>What else could I try or check? >>Best, >> M. >>>-------- Original Message -------- >>> Subject: Re: robots.txt Disallow not respected >>> Local Time: December 11, 2017 7:13 AM >>> UTC Time: December 11, 2017 6:13 AM >>> From: [email protected] >>> To: [email protected] >>>Hi, >>>Check that robots.txt is acquired and parsed correctly. Try to change the >>>protocol to protocol-httpclient. >>>Z >>> On 2017-12-10 23:54:14, Sebastian Nagel [email protected] wrote: >>> Hi, >>>I've tried to reproduce it. But it works as expected: >>>% cat robots.txt >>> User-agent: * >>> Disallow: /wpblog/feed/ >>>% cat test.txt >>>http://www.example.com/wpblog/feed/ >>>http://www.example.com/wpblog/feed/index.html >>>% $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser >>>robots.txt test.txt 'myAgent' >>> not allowed: http://www.example.com/wpblog/feed/ >>> not allowed: http://www.example.com/wpblog/feed/index.html >>>There are no steps required to make Nutch respect the robots.txt rules. >>> Only the robots.txt must be properly placed and readable. >>>Best, >>> Sebastian >>>On 12/10/2017 11:16 PM, mabi wrote: >>>>Hello, >>>> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not >>>> respect the robots.txt >>>> Disallow from my website. I have the following very simple robots.txt file: >>>> User-agent: * >>>> Disallow: /wpblog/feed/ >>>> Still the /wpblog/feed/ URL gets parsed and finally indexed. >>>> Do I need to enable anything special in the nutch-site.xml config file >>>> maybe? >>>> Thanks, >>>> Mabi >>>> >>>> >>>> >

