FWIW, in versions of Nutch post 1.10, there is a robots.whitelist property that you can use to whitelist sites to ignore robots.txt explicitly.
Cheers, Chris On 12/12/17, 2:32 PM, "Sebastian Nagel" <[email protected]> wrote: :) On 12/12/2017 11:11 PM, mabi wrote: > Sorry my bad, I was using a nutch from a previous project that I had modified and recompiled to ignore the robots.txt file (as there is not flag to enable/disable that). > > I confirm that the parsing of robots.txt works. > > >> -------- Original Message -------- >> Subject: Re: robots.txt Disallow not respected >> Local Time: December 12, 2017 10:54 PM >> UTC Time: December 12, 2017 9:54 PM >> From: [email protected] >> To: [email protected] <[email protected]> >> >> Hi, >> >> Yes I tested manually the robots.txt by using nutch's org.apache.nutch.protocol.RobotRulesParser as suggested in the previous mail and when I do that everything works correctly: the URL which are disallowed get "not allowed" and the others "allowed". So I don't understand why this works but not my crawling. >> >> I am using https for this website but have a 301 which redirects all http traffic to https. I also tried deleting the whole hbase table as well as the solr core but that did not help either :( >> >> Regards, >> M. >>> -------- Original Message -------- >>> Subject: Re: robots.txt Disallow not respected >>> Local Time: December 12, 2017 10:09 AM >>> UTC Time: December 12, 2017 9:09 AM >>> From: [email protected] >>> To: [email protected] >>> Hi, >>> did you already test whether the robots.txt file is correctly parsed >>> and rules are applied as expected? See the previous response. >>> If https or non-default ports are used: is the robots.txt shipped also >>> for other protocol/port combinations? See >>> https://issues.apache.org/jira/browse/NUTCH-1752 >>> Also note that content is not removed when the robots.txt is changed. >>> The robots.txt is only applied to a URL which is (re)fetched. To be sure, >>> delete the web table (stored in HBase, etc.) and restart the crawl. >>> Best, >>> Sebastian >>> On 12/11/2017 07:39 PM, mabi wrote: >>>> Hi Sebastian, >>>> I am already using the protocol-httpclient plugin as I also require HTTPS and I checked the access.log from the website I am crawling and see that it did a GET on the robots.txt as you can see here: >>>> 123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" 200 223 "-" "MyCrawler/0.1" >>>> What I also did is to enable DEBUG logging in log4j.properties like that: >>>> log4j.logger.org.apache.nutch=DEBUG >>>> and grep robots in the hadoop.log file but there nothing could be found either, no errors nothing. >>>> What else could I try or check? >>>> Best, >>>> M. >>>>> -------- Original Message -------- >>>>> Subject: Re: robots.txt Disallow not respected >>>>> Local Time: December 11, 2017 7:13 AM >>>>> UTC Time: December 11, 2017 6:13 AM >>>>> From: [email protected] >>>>> To: [email protected] >>>>> Hi, >>>>> Check that robots.txt is acquired and parsed correctly. Try to change the protocol to protocol-httpclient. >>>>> Z >>>>> On 2017-12-10 23:54:14, Sebastian Nagel [email protected] wrote: >>>>> Hi, >>>>> I've tried to reproduce it. But it works as expected: >>>>> % cat robots.txt >>>>> User-agent: * >>>>> Disallow: /wpblog/feed/ >>>>> % cat test.txt >>>>> http://www.example.com/wpblog/feed/ >>>>> http://www.example.com/wpblog/feed/index.html >>>>> % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt test.txt 'myAgent' >>>>> not allowed: http://www.example.com/wpblog/feed/ >>>>> not allowed: http://www.example.com/wpblog/feed/index.html >>>>> There are no steps required to make Nutch respect the robots.txt rules. >>>>> Only the robots.txt must be properly placed and readable. >>>>> Best, >>>>> Sebastian >>>>> On 12/10/2017 11:16 PM, mabi wrote: >>>>>> Hello, >>>>>> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect the robots.txt >>>>>> Disallow from my website. I have the following very simple robots.txt file: >>>>>> User-agent: * >>>>>> Disallow: /wpblog/feed/ >>>>>> Still the /wpblog/feed/ URL gets parsed and finally indexed. >>>>>> Do I need to enable anything special in the nutch-site.xml config file maybe? >>>>>> Thanks, >>>>>> Mabi >>>>>> >>>>>> >>>>>>

