Hi,

Yes I tested manually the robots.txt by using nutch's 
org.apache.nutch.protocol.RobotRulesParser as suggested in the previous mail 
and when I do that everything works correctly: the URL which are disallowed get 
"not allowed" and the others "allowed". So I don't understand why this works 
but not my crawling.
​
I am using https for this website but have a 301 which redirects all http 
traffic to https. I also tried deleting the whole hbase table as well as the 
solr core but that did not help either :(

Regards,
M.

>-------- Original Message --------
>Subject: Re: robots.txt Disallow not respected
>Local Time: December 12, 2017 10:09 AM
>UTC Time: December 12, 2017 9:09 AM
>From: [email protected]
>To: [email protected]
>
>Hi,
>
> did you already test whether the robots.txt file is correctly parsed
> and rules are applied as expected? See the previous response.
>
> If https or non-default ports are used: is the robots.txt shipped also
> for other protocol/port combinations? See
>https://issues.apache.org/jira/browse/NUTCH-1752
>
> Also note that content is not removed when the robots.txt is changed.
> The robots.txt is only applied to a URL which is (re)fetched. To be sure,
> delete the web table (stored in HBase, etc.) and restart the crawl.
>
> Best,
> Sebastian
>
> On 12/11/2017 07:39 PM, mabi wrote:
>>Hi Sebastian,
>>I am already using the protocol-httpclient plugin as I also require HTTPS and 
>>I checked the access.log from the website I am crawling and see that it did a 
>>GET on the robots.txt as you can see here:
>>123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" 
>>200 223 "-" "MyCrawler/0.1"
>>​What I also did is to enable DEBUG logging in log4j.properties like that:
>>log4j.logger.org.apache.nutch=DEBUG
>>​and grep robots in the hadoop.log file but there nothing could be found 
>>either, no errors nothing.
>>What else could I try or check?
>>Best,
>> M.
>>>-------- Original Message --------
>>> Subject: Re: robots.txt Disallow not respected
>>> Local Time: December 11, 2017 7:13 AM
>>> UTC Time: December 11, 2017 6:13 AM
>>> From: [email protected]
>>> To: [email protected]
>>>Hi,
>>>Check that robots.txt is acquired and parsed correctly. Try to change the 
>>>protocol to protocol-httpclient.
>>>Z
>>> On 2017-12-10 23:54:14, Sebastian Nagel [email protected] wrote:
>>> Hi,
>>>I've tried to reproduce it. But it works as expected:
>>>% cat robots.txt
>>> User-agent: *
>>> Disallow: /wpblog/feed/
>>>% cat test.txt
>>>http://www.example.com/wpblog/feed/
>>>http://www.example.com/wpblog/feed/index.html
>>>% $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser 
>>>robots.txt test.txt 'myAgent'
>>> not allowed: http://www.example.com/wpblog/feed/
>>> not allowed: http://www.example.com/wpblog/feed/index.html
>>>There are no steps required to make Nutch respect the robots.txt rules.
>>> Only the robots.txt must be properly placed and readable.
>>>Best,
>>> Sebastian
>>>On 12/10/2017 11:16 PM, mabi wrote:
>>>>Hello,
>>>> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not 
>>>> respect the robots.txt
>>>> Disallow from my website. I have the following very simple robots.txt file:
>>>> User-agent: *
>>>> Disallow: /wpblog/feed/
>>>> Still the /wpblog/feed/ URL gets parsed and finally indexed.
>>>> Do I need to enable anything special in the nutch-site.xml config file 
>>>> maybe?
>>>> Thanks,
>>>> Mabi
>>>> ​
>>>> ​
>>>>
>

Reply via email to