Re: Impolite crawling using NUTCH

Mattmann, Chris A (3010) Tue, 29 Nov 2016 06:00:37 -0800

There is a robots.txt whitelist. You can find documentation here:

https://wiki.apache.org/nutch/WhiteListRobots


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 

On 11/29/16, 8:57 AM, "Tom Chiverton" <[email protected]> wrote:

    Sure, you can remove the check from the code and recompile.
    
    Under what circumstances would you need to ignore robots.txt ? Would 
    something like allowing access by particular IP or user agents be an 
    alternative ?
    
    Tom
    
    
    On 29/11/16 04:07, jyoti aditya wrote:
    > Hi team,
    >
    > Can we use NUTCH to do impolite crawling?
    > Or is there any way by which we can disobey robots.text?
    >
    >
    > With Regards
    > Jyoti Aditya
    >
    >
    > ______________________________________________________________________
    > This email has been scanned by the Symantec Email Security.cloud service.
    > For more information please visit http://www.symanteccloud.com
    > ______________________________________________________________________

Re: Impolite crawling using NUTCH

Reply via email to