Re: robots.txt Disallow not respected

mabi Mon, 11 Dec 2017 11:12:17 -0800

Hi Sebastian,

I am already using the protocol-httpclient plugin as I also require HTTPS and I 
checked the access.log from the website I am crawling and see that it did a GET 
on the robots.txt as you can see here:


123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" 200 
223 "-" "MyCrawler/0.1"

What I also did is to enable DEBUG logging in log4j.properties like that:

log4j.logger.org.apache.nutch=DEBUG

and grep robots in the hadoop.log file but there nothing could be found 
either, no errors nothing.

What else could I try or check?

Best,
M.

>-------- Original Message --------
>Subject: Re: robots.txt Disallow not respected
>Local Time: December 11, 2017 7:13 AM
>UTC Time: December 11, 2017 6:13 AM
>From: [email protected]
>To: [email protected]
>
>Hi,
>
> Check that robots.txt is acquired and parsed correctly. Try to change the 
> protocol to protocol-httpclient.
>
> Z
> On 2017-12-10 23:54:14, Sebastian Nagel [email protected] wrote:
> Hi,
>
> I've tried to reproduce it. But it works as expected:
>
> % cat robots.txt
> User-agent: *
> Disallow: /wpblog/feed/
>
> % cat test.txt
>http://www.example.com/wpblog/feed/
>http://www.example.com/wpblog/feed/index.html
>
> % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt 
> test.txt 'myAgent'
> not allowed: http://www.example.com/wpblog/feed/
> not allowed: http://www.example.com/wpblog/feed/index.html
>
>
> There are no steps required to make Nutch respect the robots.txt rules.
> Only the robots.txt must be properly placed and readable.
>
> Best,
> Sebastian
>
>
> On 12/10/2017 11:16 PM, mabi wrote:
>>Hello,
>>I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect 
>>the robots.txt
>> Disallow from my website. I have the following very simple robots.txt file:
>>User-agent: *
>> Disallow: /wpblog/feed/
>>Still the /wpblog/feed/ URL gets parsed and finally indexed.
>>Do I need to enable anything special in the nutch-site.xml config file maybe?
>>Thanks,
>> Mabi
>> 
>>
>>

Re: robots.txt Disallow not respected

Reply via email to