Re: Regarding whitelist for robots.txt

Sebastian Nagel Sat, 26 Sep 2015 04:45:07 -0700

Hi Girish,

> in the hadoop.log i see “robots.txt whitelist not configured"

This means that the property is somehow not set properly.

Shouldn't it be http.robot.rules.whitelist", see below?

Also make sure that the modified nutch-site.xml is deployed.
If you modify it in conf/ you have to run "ant runtime"
to deploy it. Better place all modified config files (nutch-site.xml,
regex-urlfilter.txt, etc.) in a separate folder and let the
environment variable NUTCH_CONF_DIR point to it.
The script $NUTCH_HOME/bin/nutch will then load the rigth
configuration file. See also 
https://wiki.apache.org/nutch/NutchConfigurationFiles

Best,
Sebastian

<property>
  <name>http.robot.rules.whitelist</name>
  <value></value>
  <description>Comma separated list of hostnames or IP addresses to ignore
  robot rules parsing for. Use with care and only if you are explicitly
  allowed by the site owner to ignore the site's robots.txt!
  </description>
</property>

On 09/26/2015 08:59 AM, Girish Rao wrote:
> Hi,
> 
> I am trying to set the whitelist property in nutch-site.xml
> as below:
> <property>
>   <name>robot.rules.whitelist</name>
>   <value>test.org</value>
>   <description>Comma separated list of hostnames or IP addresses to ignore 
> robot rules parsing for.
>   </description>
> </property>
> 
> However, when i see the crawl data, i still see that the files have not been 
> crawled and they have a status like “blocked by robots.txt”
> 
> in the hadoop.log i see “robots.txt whitelist not configured"
> 
> Is there anything else that needs to be done?
> 
> Regards
> Girish
>

Re: Regarding whitelist for robots.txt

Reply via email to