Hi,

I've tried to reproduce it. But it works as expected:

% cat robots.txt
User-agent: *
Disallow: /wpblog/feed/

% cat test.txt
http://www.example.com/wpblog/feed/
http://www.example.com/wpblog/feed/index.html

% $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt 
test.txt 'myAgent'
not allowed:    http://www.example.com/wpblog/feed/
not allowed:    http://www.example.com/wpblog/feed/index.html


There are no steps required to make Nutch respect the robots.txt rules.
Only the robots.txt must be properly placed and readable.

Best,
Sebastian


On 12/10/2017 11:16 PM, mabi wrote:
> Hello,
>
> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect 
> the robots.txt
Disallow from my website. I have the following very simple robots.txt file:
>
> User-agent: *
> Disallow: /wpblog/feed/
>
> Still the /wpblog/feed/ URL gets parsed and finally indexed.
>
> Do I need to enable anything special in the nutch-site.xml config file maybe?
>
> Thanks,
> Mabi
> ​
>
> ​
>

Reply via email to