Hi,I'm using Nutch 1.9 with Solr 4.9.1. I am trying to extract news articles. 
Nutch works for some sites, but for others I get 403 failed fetch. 
This is the output when I run parsechecker. 








bin/nutch parsechecker -dumpText 
http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
fetching: 
http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
Fetch failed with protocol status: exception(16), lastModified=0: Http 
code=403, 
url=http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
WIth bin/crawl i get







fetch of 
http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
 failed with: Http code=403, 
url=http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
The regex filter for this site that I entered+^http://([a-z0-9]*\.)*dnaindia.com
nutch-default.xml has this default value<property>  
<name>http.robots.403.allow</name>  <value>true</value>  <description>Some 
servers return HTTP status 403 (Forbidden) if  /robots.txt doesn't exist. This 
should probably mean that we are  allowed to crawl the site nonetheless. If 
this is set to false,  then such sites will be treated as 
forbidden.</description></property>
Anything I am missing? For what reason am I still getting failed fetch?
Ankit                                     

Reply via email to