Re: question about robots.txt

Mattmann, Chris A (3980) Sat, 13 Dec 2014 07:33:37 -0800

Hi Shane,

They get it from the  http.agent.* properties in your nutch-conf.xml
or your nutch-site.xml. You give your crawler the identifying
name., description, url, email and version.


Cheers!

Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Shane Wood <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Saturday, December 13, 2014 at 1:27 AM
To: "[email protected]" <[email protected]>
Subject: question about robots.txt

>I am asking a few websites to allow me to index there site, what you
>they add to the robots.txt and where do i get the exact name of my
>crawler.
>
>Cheers.
>Shane

Re: question about robots.txt

Reply via email to