There is a robots.txt whitelist. You can find documentation here: https://wiki.apache.org/nutch/WhiteListRobots
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Principal Data Scientist, Engineering Administrative Office (3010) Manager, Open Source Projects Formulation and Development Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 180-503E, Mailstop: 180-503 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ On 11/29/16, 8:57 AM, "Tom Chiverton" <[email protected]> wrote: Sure, you can remove the check from the code and recompile. Under what circumstances would you need to ignore robots.txt ? Would something like allowing access by particular IP or user agents be an alternative ? Tom On 29/11/16 04:07, jyoti aditya wrote: > Hi team, > > Can we use NUTCH to do impolite crawling? > Or is there any way by which we can disobey robots.text? > > > With Regards > Jyoti Aditya > > > ______________________________________________________________________ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > ______________________________________________________________________

