FWIW, in versions of Nutch post 1.10, there is a robots.whitelist property that 
you can use 
to whitelist sites to ignore robots.txt explicitly.

Cheers,
Chris




On 12/12/17, 2:32 PM, "Sebastian Nagel" <[email protected]> wrote:

    :)
    
    On 12/12/2017 11:11 PM, mabi wrote:
    > Sorry my bad, I was using a nutch from a previous project that I had 
modified and recompiled to ignore the robots.txt file (as there is not flag to 
enable/disable that).
    > ​
    > I confirm that the parsing of robots.txt works.
    > 
    > 
    >> -------- Original Message --------
    >> Subject: Re: robots.txt Disallow not respected
    >> Local Time: December 12, 2017 10:54 PM
    >> UTC Time: December 12, 2017 9:54 PM
    >> From: [email protected]
    >> To: [email protected] <[email protected]>
    >>
    >> Hi,
    >>
    >> Yes I tested manually the robots.txt by using nutch's 
org.apache.nutch.protocol.RobotRulesParser as suggested in the previous mail 
and when I do that everything works correctly: the URL which are disallowed get 
"not allowed" and the others "allowed". So I don't understand why this works 
but not my crawling.
    >> ​
    >> I am using https for this website but have a 301 which redirects all 
http traffic to https. I also tried deleting the whole hbase table as well as 
the solr core but that did not help either :(
    >>
    >> Regards,
    >> M.
    >>> -------- Original Message --------
    >>> Subject: Re: robots.txt Disallow not respected
    >>> Local Time: December 12, 2017 10:09 AM
    >>> UTC Time: December 12, 2017 9:09 AM
    >>> From: [email protected]
    >>> To: [email protected]
    >>> Hi,
    >>> did you already test whether the robots.txt file is correctly parsed
    >>> and rules are applied as expected? See the previous response.
    >>> If https or non-default ports are used: is the robots.txt shipped also
    >>> for other protocol/port combinations? See
    >>> https://issues.apache.org/jira/browse/NUTCH-1752
    >>> Also note that content is not removed when the robots.txt is changed.
    >>> The robots.txt is only applied to a URL which is (re)fetched. To be 
sure,
    >>> delete the web table (stored in HBase, etc.) and restart the crawl.
    >>> Best,
    >>> Sebastian
    >>> On 12/11/2017 07:39 PM, mabi wrote:
    >>>> Hi Sebastian,
    >>>> I am already using the protocol-httpclient plugin as I also require 
HTTPS and I checked the access.log from the website I am crawling and see that 
it did a GET on the robots.txt as you can see here:
    >>>> 123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt 
HTTP/1.0" 200 223 "-" "MyCrawler/0.1"
    >>>> ​What I also did is to enable DEBUG logging in log4j.properties like 
that:
    >>>> log4j.logger.org.apache.nutch=DEBUG
    >>>> ​and grep robots in the hadoop.log file but there nothing could be 
found either, no errors nothing.
    >>>> What else could I try or check?
    >>>> Best,
    >>>> M.
    >>>>> -------- Original Message --------
    >>>>> Subject: Re: robots.txt Disallow not respected
    >>>>> Local Time: December 11, 2017 7:13 AM
    >>>>> UTC Time: December 11, 2017 6:13 AM
    >>>>> From: [email protected]
    >>>>> To: [email protected]
    >>>>> Hi,
    >>>>> Check that robots.txt is acquired and parsed correctly. Try to change 
the protocol to protocol-httpclient.
    >>>>> Z
    >>>>> On 2017-12-10 23:54:14, Sebastian Nagel [email protected] 
wrote:
    >>>>> Hi,
    >>>>> I've tried to reproduce it. But it works as expected:
    >>>>> % cat robots.txt
    >>>>> User-agent: *
    >>>>> Disallow: /wpblog/feed/
    >>>>> % cat test.txt
    >>>>> http://www.example.com/wpblog/feed/
    >>>>> http://www.example.com/wpblog/feed/index.html
    >>>>> % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser 
robots.txt test.txt 'myAgent'
    >>>>> not allowed: http://www.example.com/wpblog/feed/
    >>>>> not allowed: http://www.example.com/wpblog/feed/index.html
    >>>>> There are no steps required to make Nutch respect the robots.txt 
rules.
    >>>>> Only the robots.txt must be properly placed and readable.
    >>>>> Best,
    >>>>> Sebastian
    >>>>> On 12/10/2017 11:16 PM, mabi wrote:
    >>>>>> Hello,
    >>>>>> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not 
respect the robots.txt
    >>>>>> Disallow from my website. I have the following very simple 
robots.txt file:
    >>>>>> User-agent: *
    >>>>>> Disallow: /wpblog/feed/
    >>>>>> Still the /wpblog/feed/ URL gets parsed and finally indexed.
    >>>>>> Do I need to enable anything special in the nutch-site.xml config 
file maybe?
    >>>>>> Thanks,
    >>>>>> Mabi
    >>>>>> ​
    >>>>>> ​
    >>>>>>
    
    


Reply via email to