Hi,I'm using Nutch 1.9 with Solr 4.9.1. I am trying to extract news articles. 
Nutch works for some sites, but for others I get 403 failed fetch. This is the 
output when I run parsechecker. bin/nutch parsechecker -dumpText 
http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977fetching:
 
http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977Fetch
 failed with protocol status: exception(16), lastModified=0: Http code=403, 
url=http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
WIth bin/crawl i getfetch of 
http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
 failed with: Http code=403, 
url=http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977The
 regex filter for this site that I 
entered+^http://([a-z0-9]*\.)*dnaindia.comnutch-default.xml has this default 
value<property> <name>http.robots.403.allow</name> <value>true</value> 
<description>Some servers return HTTP status 403 (Forbidden) if /robots.txt 
doesn't exist. This should probably mean that we are allowed to crawl the site 
nonetheless. If this is set to false, then such sites will be treated as 
forbidden.</description></property>Anything I am missing? For what reason am I 
still getting failed fetch?Ankit                                           

Reply via email to