Sorry my bad, I was using a nutch from a previous project that I had modified and recompiled to ignore the robots.txt file (as there is not flag to enable/disable that). I confirm that the parsing of robots.txt works.
>-------- Original Message -------- >Subject: Re: robots.txt Disallow not respected >Local Time: December 12, 2017 10:54 PM >UTC Time: December 12, 2017 9:54 PM >From: [email protected] >To: [email protected] <[email protected]> > >Hi, > > Yes I tested manually the robots.txt by using nutch's > org.apache.nutch.protocol.RobotRulesParser as suggested in the previous mail > and when I do that everything works correctly: the URL which are disallowed > get "not allowed" and the others "allowed". So I don't understand why this > works but not my crawling. > > I am using https for this website but have a 301 which redirects all http > traffic to https. I also tried deleting the whole hbase table as well as the > solr core but that did not help either :( > > Regards, > M. >>-------- Original Message -------- >> Subject: Re: robots.txt Disallow not respected >> Local Time: December 12, 2017 10:09 AM >> UTC Time: December 12, 2017 9:09 AM >> From: [email protected] >> To: [email protected] >>Hi, >>did you already test whether the robots.txt file is correctly parsed >> and rules are applied as expected? See the previous response. >>If https or non-default ports are used: is the robots.txt shipped also >> for other protocol/port combinations? See >>https://issues.apache.org/jira/browse/NUTCH-1752 >>Also note that content is not removed when the robots.txt is changed. >> The robots.txt is only applied to a URL which is (re)fetched. To be sure, >> delete the web table (stored in HBase, etc.) and restart the crawl. >>Best, >> Sebastian >>On 12/11/2017 07:39 PM, mabi wrote: >>>Hi Sebastian, >>> I am already using the protocol-httpclient plugin as I also require HTTPS >>> and I checked the access.log from the website I am crawling and see that it >>> did a GET on the robots.txt as you can see here: >>> 123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" >>> 200 223 "-" "MyCrawler/0.1" >>> What I also did is to enable DEBUG logging in log4j.properties like that: >>> log4j.logger.org.apache.nutch=DEBUG >>> and grep robots in the hadoop.log file but there nothing could be found >>> either, no errors nothing. >>> What else could I try or check? >>> Best, >>> M. >>>>-------- Original Message -------- >>>> Subject: Re: robots.txt Disallow not respected >>>> Local Time: December 11, 2017 7:13 AM >>>> UTC Time: December 11, 2017 6:13 AM >>>> From: [email protected] >>>> To: [email protected] >>>> Hi, >>>> Check that robots.txt is acquired and parsed correctly. Try to change the >>>> protocol to protocol-httpclient. >>>> Z >>>> On 2017-12-10 23:54:14, Sebastian Nagel [email protected] wrote: >>>> Hi, >>>> I've tried to reproduce it. But it works as expected: >>>> % cat robots.txt >>>> User-agent: * >>>> Disallow: /wpblog/feed/ >>>> % cat test.txt >>>>http://www.example.com/wpblog/feed/ >>>>http://www.example.com/wpblog/feed/index.html >>>> % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser >>>> robots.txt test.txt 'myAgent' >>>> not allowed: http://www.example.com/wpblog/feed/ >>>> not allowed: http://www.example.com/wpblog/feed/index.html >>>> There are no steps required to make Nutch respect the robots.txt rules. >>>> Only the robots.txt must be properly placed and readable. >>>> Best, >>>> Sebastian >>>> On 12/10/2017 11:16 PM, mabi wrote: >>>>>Hello, >>>>> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not >>>>> respect the robots.txt >>>>> Disallow from my website. I have the following very simple robots.txt >>>>> file: >>>>> User-agent: * >>>>> Disallow: /wpblog/feed/ >>>>> Still the /wpblog/feed/ URL gets parsed and finally indexed. >>>>> Do I need to enable anything special in the nutch-site.xml config file >>>>> maybe? >>>>> Thanks, >>>>> Mabi >>>>> >>>>> >>>>>

