I don't recall messing with anything to do with robots.txt,  I want us to
be as polite as possible.
On May 25, 2016 12:22 AM, "Mattmann, Chris A (3980)" <
[email protected]> wrote:

> Hi,
>
> For security research, there is an option to white-list robots.txt.
> It’s not enabled by default and must be directly enabled.
>
> The solution is - there isn’t one. People used to just hack
> Nutch and do the same thing by commenting out a line of code
> which accomplished the same check.
>
> Those people that are using Nutch and not obeying robots.txt
> are doing just that. But Nutch itself by default does obey it.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>
>
>
>
> On 5/24/16, 3:17 PM, "BlackIce" <[email protected]> wrote:
>
> >Hi,
> >
> >I've just seen on a website which tracks bots, that "Tarantula" ,  our
> >nutch 1.11 based crawler is being classified as not obeying robots.txt.
> >
> >What's the solution?
>

Reply via email to