Hi I'm using Nutch 1.9 with Solr 4.9.1 on OSX. I am trying to exract news articles. Nutch works well for some sites, but for others I get error 403 failed fetch.
This is the output when I run parsechecker: dumpText http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977 fetching: http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977 Fetch failed with protocol status: exception(16), lastModified=0: Http code=403, url= http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977 When I run bin/crawl I get the following : fetch of http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977 failed with: Http code=403, url= http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977 The regex filter for the site is +^http://([a-z0-9]*\.)*dnaindia.com nutch-defailt.xml has the default value <property> <name>http.robots.403.allow</name> <value>true</value> <description>Some servers return HTTP status 403 (Forbidden) if /robots.txt doesn't exist. This should probably mean that we are allowed to crawl the site nonetheless. If this is set to false, then such sites will be treated as forbidden.</description> </property> Am i missing something? For what reason am I getting failed fetch. -- Regards, Ankit Goel http://about.me/ankitgoel

