Hi Sebastian,

Thanks! I had copied that from the wiki located at 
https://wiki.apache.org/nutch/WhiteListRobots 
<https://wiki.apache.org/nutch/WhiteListRobots>.

Once I changed it to http.robot.rules.whitelist, i see in the logs that the 
test.org <http://test.org/> is whitelisted. However, on crawling the site, i 
still get the "blocked by robots.txt” status.

Regards
Girish



> On Sep 26, 2015, at 4:44 AM, Sebastian Nagel <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> Hi Girish,
> 
>> in the hadoop.log i see “robots.txt whitelist not configured"
> 
> This means that the property is somehow not set properly.
> 
> Shouldn't it be http.robot.rules.whitelist", see below?
> 
> Also make sure that the modified nutch-site.xml is deployed.
> If you modify it in conf/ you have to run "ant runtime"
> to deploy it. Better place all modified config files (nutch-site.xml,
> regex-urlfilter.txt, etc.) in a separate folder and let the
> environment variable NUTCH_CONF_DIR point to it.
> The script $NUTCH_HOME/bin/nutch will then load the rigth
> configuration file. See also 
> https://wiki.apache.org/nutch/NutchConfigurationFiles 
> <https://wiki.apache.org/nutch/NutchConfigurationFiles>
> 
> Best,
> Sebastian
> 
> <property>
>  <name>http.robot.rules.whitelist</name>
>  <value></value>
>  <description>Comma separated list of hostnames or IP addresses to ignore
>  robot rules parsing for. Use with care and only if you are explicitly
>  allowed by the site owner to ignore the site's robots.txt!
>  </description>
> </property>
> 
> 
> On 09/26/2015 08:59 AM, Girish Rao wrote:
>> Hi,
>> 
>> I am trying to set the whitelist property in nutch-site.xml
>> as below:
>> <property>
>>  <name>robot.rules.whitelist</name>
>>  <value>test.org <http://test.org/></value>
>>  <description>Comma separated list of hostnames or IP addresses to ignore 
>> robot rules parsing for.
>>  </description>
>> </property>
>> 
>> However, when i see the crawl data, i still see that the files have not been 
>> crawled and they have a status like “blocked by robots.txt”
>> 
>> in the hadoop.log i see “robots.txt whitelist not configured"
>> 
>> Is there anything else that needs to be done?
>> 
>> Regards
>> Girish
>> 
> 

Reply via email to