Re: robots.txt Disallow not respected

Sebastian Nagel Tue, 12 Dec 2017 14:33:06 -0800
:)

On 12/12/2017 11:11 PM, mabi wrote:
> Sorry my bad, I was using a nutch from a previous project that I had modified 
> and recompiled to ignore the robots.txt file (as there is not flag to 
> enable/disable that).
> 
> I confirm that the parsing of robots.txt works.
> 
> 
>> -------- Original Message --------
>> Subject: Re: robots.txt Disallow not respected
>> Local Time: December 12, 2017 10:54 PM
>> UTC Time: December 12, 2017 9:54 PM
>> From: [email protected]
>> To: [email protected] <[email protected]>
>>
>> Hi,
>>
>> Yes I tested manually the robots.txt by using nutch's 
>> org.apache.nutch.protocol.RobotRulesParser as suggested in the previous mail 
>> and when I do that everything works correctly: the URL which are disallowed 
>> get "not allowed" and the others "allowed". So I don't understand why this 
>> works but not my crawling.
>> 
>> I am using https for this website but have a 301 which redirects all http 
>> traffic to https. I also tried deleting the whole hbase table as well as the 
>> solr core but that did not help either :(
>>
>> Regards,
>> M.
>>> -------- Original Message --------
>>> Subject: Re: robots.txt Disallow not respected
>>> Local Time: December 12, 2017 10:09 AM
>>> UTC Time: December 12, 2017 9:09 AM
>>> From: [email protected]
>>> To: [email protected]
>>> Hi,
>>> did you already test whether the robots.txt file is correctly parsed
>>> and rules are applied as expected? See the previous response.
>>> If https or non-default ports are used: is the robots.txt shipped also
>>> for other protocol/port combinations? See
>>> https://issues.apache.org/jira/browse/NUTCH-1752
>>> Also note that content is not removed when the robots.txt is changed.
>>> The robots.txt is only applied to a URL which is (re)fetched. To be sure,
>>> delete the web table (stored in HBase, etc.) and restart the crawl.
>>> Best,
>>> Sebastian
>>> On 12/11/2017 07:39 PM, mabi wrote:
>>>> Hi Sebastian,
>>>> I am already using the protocol-httpclient plugin as I also require HTTPS 
>>>> and I checked the access.log from the website I am crawling and see that 
>>>> it did a GET on the robots.txt as you can see here:
>>>> 123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt 
>>>> HTTP/1.0" 200 223 "-" "MyCrawler/0.1"
>>>> What I also did is to enable DEBUG logging in log4j.properties like that:
>>>> log4j.logger.org.apache.nutch=DEBUG
>>>> and grep robots in the hadoop.log file but there nothing could be found 
>>>> either, no errors nothing.
>>>> What else could I try or check?
>>>> Best,
>>>> M.
>>>>> -------- Original Message --------
>>>>> Subject: Re: robots.txt Disallow not respected
>>>>> Local Time: December 11, 2017 7:13 AM
>>>>> UTC Time: December 11, 2017 6:13 AM
>>>>> From: [email protected]
>>>>> To: [email protected]
>>>>> Hi,
>>>>> Check that robots.txt is acquired and parsed correctly. Try to change the 
>>>>> protocol to protocol-httpclient.
>>>>> Z
>>>>> On 2017-12-10 23:54:14, Sebastian Nagel [email protected] wrote:
>>>>> Hi,
>>>>> I've tried to reproduce it. But it works as expected:
>>>>> % cat robots.txt
>>>>> User-agent: *
>>>>> Disallow: /wpblog/feed/
>>>>> % cat test.txt
>>>>> http://www.example.com/wpblog/feed/
>>>>> http://www.example.com/wpblog/feed/index.html
>>>>> % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser 
>>>>> robots.txt test.txt 'myAgent'
>>>>> not allowed: http://www.example.com/wpblog/feed/
>>>>> not allowed: http://www.example.com/wpblog/feed/index.html
>>>>> There are no steps required to make Nutch respect the robots.txt rules.
>>>>> Only the robots.txt must be properly placed and readable.
>>>>> Best,
>>>>> Sebastian
>>>>> On 12/10/2017 11:16 PM, mabi wrote:
>>>>>> Hello,
>>>>>> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not 
>>>>>> respect the robots.txt
>>>>>> Disallow from my website. I have the following very simple robots.txt 
>>>>>> file:
>>>>>> User-agent: *
>>>>>> Disallow: /wpblog/feed/
>>>>>> Still the /wpblog/feed/ URL gets parsed and finally indexed.
>>>>>> Do I need to enable anything special in the nutch-site.xml config file 
>>>>>> maybe?
>>>>>> Thanks,
>>>>>> Mabi
>>>>>> 
>>>>>> 
>>>>>>
Re: robots.txt Disallow not respected

Reply via email to