Hi, By default, as I mentioned, Nutch does obey robots.txt. There is a whitelist property that can be set in nutch-default to selectively disable it for certain sites (again for valid security research use cases).
Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ On 5/24/16, 3:24 PM, "BlackIce" <[email protected]> wrote: >I don't recall messing with anything to do with robots.txt, I want us to >be as polite as possible. >On May 25, 2016 12:22 AM, "Mattmann, Chris A (3980)" < >[email protected]> wrote: > >> Hi, >> >> For security research, there is an option to white-list robots.txt. >> It’s not enabled by default and must be directly enabled. >> >> The solution is - there isn’t one. People used to just hack >> Nutch and do the same thing by commenting out a line of code >> which accomplished the same check. >> >> Those people that are using Nutch and not obeying robots.txt >> are doing just that. But Nutch itself by default does obey it. >> >> Cheers, >> Chris >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Director, Information Retrieval and Data Science Group (IRDS) >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> WWW: http://irds.usc.edu/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> >> >> >> >> >> On 5/24/16, 3:17 PM, "BlackIce" <[email protected]> wrote: >> >> >Hi, >> > >> >I've just seen on a website which tracks bots, that "Tarantula" , our >> >nutch 1.11 based crawler is being classified as not obeying robots.txt. >> > >> >What's the solution? >>

