Hi,

By default, as I mentioned, Nutch does obey robots.txt. There is
a whitelist property that can be set in nutch-default to selectively
disable it for certain sites (again for valid security research use
cases).

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++










On 5/24/16, 3:24 PM, "BlackIce" <[email protected]> wrote:

>I don't recall messing with anything to do with robots.txt,  I want us to
>be as polite as possible.
>On May 25, 2016 12:22 AM, "Mattmann, Chris A (3980)" <
>[email protected]> wrote:
>
>> Hi,
>>
>> For security research, there is an option to white-list robots.txt.
>> It’s not enabled by default and must be directly enabled.
>>
>> The solution is - there isn’t one. People used to just hack
>> Nutch and do the same thing by commenting out a line of code
>> which accomplished the same check.
>>
>> Those people that are using Nutch and not obeying robots.txt
>> are doing just that. But Nutch itself by default does obey it.
>>
>> Cheers,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: [email protected]
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Director, Information Retrieval and Data Science Group (IRDS)
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 5/24/16, 3:17 PM, "BlackIce" <[email protected]> wrote:
>>
>> >Hi,
>> >
>> >I've just seen on a website which tracks bots, that "Tarantula" ,  our
>> >nutch 1.11 based crawler is being classified as not obeying robots.txt.
>> >
>> >What's the solution?
>>

Reply via email to