Re: robots.txt Disallow not respected

Sebastian Nagel Tue, 12 Dec 2017 01:10:26 -0800

Hi,

did you already test whether the robots.txt file is correctly parsed
and rules are applied as expected? See the previous response.


If https or non-default ports are used: is the robots.txt shipped also
for other protocol/port combinations? See
   https://issues.apache.org/jira/browse/NUTCH-1752

Also note that content is not removed when the robots.txt is changed.
The robots.txt is only applied to a URL which is (re)fetched. To be sure,
delete the web table (stored in HBase, etc.) and restart the crawl.

Best,
Sebastian

On 12/11/2017 07:39 PM, mabi wrote:
> Hi Sebastian,
> 
> I am already using the protocol-httpclient plugin as I also require HTTPS and 
> I checked the access.log from the website I am crawling and see that it did a 
> GET on the robots.txt as you can see here:
> 
> 123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" 
> 200 223 "-" "MyCrawler/0.1"
> 
> What I also did is to enable DEBUG logging in log4j.properties like that:
> 
> log4j.logger.org.apache.nutch=DEBUG
> 
> and grep robots in the hadoop.log file but there nothing could be found 
> either, no errors nothing.
> 
> What else could I try or check?
> 
> Best,
> M.
> 
>> -------- Original Message --------
>> Subject: Re: robots.txt Disallow not respected
>> Local Time: December 11, 2017 7:13 AM
>> UTC Time: December 11, 2017 6:13 AM
>> From: [email protected]
>> To: [email protected]
>>
>> Hi,
>>
>> Check that robots.txt is acquired and parsed correctly. Try to change the 
>> protocol to protocol-httpclient.
>>
>> Z
>> On 2017-12-10 23:54:14, Sebastian Nagel [email protected] wrote:
>> Hi,
>>
>> I've tried to reproduce it. But it works as expected:
>>
>> % cat robots.txt
>> User-agent: *
>> Disallow: /wpblog/feed/
>>
>> % cat test.txt
>> http://www.example.com/wpblog/feed/
>> http://www.example.com/wpblog/feed/index.html
>>
>> % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser 
>> robots.txt test.txt 'myAgent'
>> not allowed: http://www.example.com/wpblog/feed/
>> not allowed: http://www.example.com/wpblog/feed/index.html
>>
>>
>> There are no steps required to make Nutch respect the robots.txt rules.
>> Only the robots.txt must be properly placed and readable.
>>
>> Best,
>> Sebastian
>>
>>
>> On 12/10/2017 11:16 PM, mabi wrote:
>>> Hello,
>>> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not 
>>> respect the robots.txt
>>> Disallow from my website. I have the following very simple robots.txt file:
>>> User-agent: *
>>> Disallow: /wpblog/feed/
>>> Still the /wpblog/feed/ URL gets parsed and finally indexed.
>>> Do I need to enable anything special in the nutch-site.xml config file 
>>> maybe?
>>> Thanks,
>>> Mabi
>>> 
>>> 
>>>

Re: robots.txt Disallow not respected

Reply via email to