Hi Sebastian, Thanks! I had copied that from the wiki located at https://wiki.apache.org/nutch/WhiteListRobots <https://wiki.apache.org/nutch/WhiteListRobots>.
Once I changed it to http.robot.rules.whitelist, i see in the logs that the test.org <http://test.org/> is whitelisted. However, on crawling the site, i still get the "blocked by robots.txt” status. Regards Girish > On Sep 26, 2015, at 4:44 AM, Sebastian Nagel <[email protected] > <mailto:[email protected]>> wrote: > > Hi Girish, > >> in the hadoop.log i see “robots.txt whitelist not configured" > > This means that the property is somehow not set properly. > > Shouldn't it be http.robot.rules.whitelist", see below? > > Also make sure that the modified nutch-site.xml is deployed. > If you modify it in conf/ you have to run "ant runtime" > to deploy it. Better place all modified config files (nutch-site.xml, > regex-urlfilter.txt, etc.) in a separate folder and let the > environment variable NUTCH_CONF_DIR point to it. > The script $NUTCH_HOME/bin/nutch will then load the rigth > configuration file. See also > https://wiki.apache.org/nutch/NutchConfigurationFiles > <https://wiki.apache.org/nutch/NutchConfigurationFiles> > > Best, > Sebastian > > <property> > <name>http.robot.rules.whitelist</name> > <value></value> > <description>Comma separated list of hostnames or IP addresses to ignore > robot rules parsing for. Use with care and only if you are explicitly > allowed by the site owner to ignore the site's robots.txt! > </description> > </property> > > > On 09/26/2015 08:59 AM, Girish Rao wrote: >> Hi, >> >> I am trying to set the whitelist property in nutch-site.xml >> as below: >> <property> >> <name>robot.rules.whitelist</name> >> <value>test.org <http://test.org/></value> >> <description>Comma separated list of hostnames or IP addresses to ignore >> robot rules parsing for. >> </description> >> </property> >> >> However, when i see the crawl data, i still see that the files have not been >> crawled and they have a status like “blocked by robots.txt” >> >> in the hadoop.log i see “robots.txt whitelist not configured" >> >> Is there anything else that needs to be done? >> >> Regards >> Girish >> >

