Sorry my bad, I was using a nutch from a previous project that I had modified 
and recompiled to ignore the robots.txt file (as there is not flag to 
enable/disable that).
​
I confirm that the parsing of robots.txt works.


>-------- Original Message --------
>Subject: Re: robots.txt Disallow not respected
>Local Time: December 12, 2017 10:54 PM
>UTC Time: December 12, 2017 9:54 PM
>From: [email protected]
>To: [email protected] <[email protected]>
>
>Hi,
>
> Yes I tested manually the robots.txt by using nutch's 
> org.apache.nutch.protocol.RobotRulesParser as suggested in the previous mail 
> and when I do that everything works correctly: the URL which are disallowed 
> get "not allowed" and the others "allowed". So I don't understand why this 
> works but not my crawling.
> ​
> I am using https for this website but have a 301 which redirects all http 
> traffic to https. I also tried deleting the whole hbase table as well as the 
> solr core but that did not help either :(
>
> Regards,
> M.
>>-------- Original Message --------
>> Subject: Re: robots.txt Disallow not respected
>> Local Time: December 12, 2017 10:09 AM
>> UTC Time: December 12, 2017 9:09 AM
>> From: [email protected]
>> To: [email protected]
>>Hi,
>>did you already test whether the robots.txt file is correctly parsed
>> and rules are applied as expected? See the previous response.
>>If https or non-default ports are used: is the robots.txt shipped also
>> for other protocol/port combinations? See
>>https://issues.apache.org/jira/browse/NUTCH-1752
>>Also note that content is not removed when the robots.txt is changed.
>> The robots.txt is only applied to a URL which is (re)fetched. To be sure,
>> delete the web table (stored in HBase, etc.) and restart the crawl.
>>Best,
>> Sebastian
>>On 12/11/2017 07:39 PM, mabi wrote:
>>>Hi Sebastian,
>>> I am already using the protocol-httpclient plugin as I also require HTTPS 
>>> and I checked the access.log from the website I am crawling and see that it 
>>> did a GET on the robots.txt as you can see here:
>>> 123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" 
>>> 200 223 "-" "MyCrawler/0.1"
>>> ​What I also did is to enable DEBUG logging in log4j.properties like that:
>>> log4j.logger.org.apache.nutch=DEBUG
>>> ​and grep robots in the hadoop.log file but there nothing could be found 
>>> either, no errors nothing.
>>> What else could I try or check?
>>> Best,
>>> M.
>>>>-------- Original Message --------
>>>> Subject: Re: robots.txt Disallow not respected
>>>> Local Time: December 11, 2017 7:13 AM
>>>> UTC Time: December 11, 2017 6:13 AM
>>>> From: [email protected]
>>>> To: [email protected]
>>>> Hi,
>>>> Check that robots.txt is acquired and parsed correctly. Try to change the 
>>>> protocol to protocol-httpclient.
>>>> Z
>>>> On 2017-12-10 23:54:14, Sebastian Nagel [email protected] wrote:
>>>> Hi,
>>>> I've tried to reproduce it. But it works as expected:
>>>> % cat robots.txt
>>>> User-agent: *
>>>> Disallow: /wpblog/feed/
>>>> % cat test.txt
>>>>http://www.example.com/wpblog/feed/
>>>>http://www.example.com/wpblog/feed/index.html
>>>> % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser 
>>>> robots.txt test.txt 'myAgent'
>>>> not allowed: http://www.example.com/wpblog/feed/
>>>> not allowed: http://www.example.com/wpblog/feed/index.html
>>>> There are no steps required to make Nutch respect the robots.txt rules.
>>>> Only the robots.txt must be properly placed and readable.
>>>> Best,
>>>> Sebastian
>>>> On 12/10/2017 11:16 PM, mabi wrote:
>>>>>Hello,
>>>>> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not 
>>>>> respect the robots.txt
>>>>> Disallow from my website. I have the following very simple robots.txt 
>>>>> file:
>>>>> User-agent: *
>>>>> Disallow: /wpblog/feed/
>>>>> Still the /wpblog/feed/ URL gets parsed and finally indexed.
>>>>> Do I need to enable anything special in the nutch-site.xml config file 
>>>>> maybe?
>>>>> Thanks,
>>>>> Mabi
>>>>> ​
>>>>> ​
>>>>>

Reply via email to