I don't recall messing with anything to do with robots.txt, I want us to be as polite as possible. On May 25, 2016 12:22 AM, "Mattmann, Chris A (3980)" < [email protected]> wrote:
> Hi, > > For security research, there is an option to white-list robots.txt. > It’s not enabled by default and must be directly enabled. > > The solution is - there isn’t one. People used to just hack > Nutch and do the same thing by commenting out a line of code > which accomplished the same check. > > Those people that are using Nutch and not obeying robots.txt > are doing just that. But Nutch itself by default does obey it. > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Director, Information Retrieval and Data Science Group (IRDS) > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > WWW: http://irds.usc.edu/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > > > > > On 5/24/16, 3:17 PM, "BlackIce" <[email protected]> wrote: > > >Hi, > > > >I've just seen on a website which tracks bots, that "Tarantula" , our > >nutch 1.11 based crawler is being classified as not obeying robots.txt. > > > >What's the solution? >

