Try tweaking your user agent, if I do a parsechecker using Nutch I can 
replicate the problem, apparently that site have implemented some sort of user 
agent filtering, check in your hadoop.log the UA you're using. I did a little 
experiment and in Chrome spoofing the same UA as I've configured in Nutch I get 
the same behavior, apparently the Varnish server they have configured as a 
reverse proxy has some kind of rule to block every request that has a UA that 
match with Nutch.

This is the relevant output of my logs/hadoop.log file:

2015-05-01 02:51:17,034 INFO  parse.ParserChecker - fetching: 
http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
2015-05-01 02:51:17,643 INFO  http.Http - http.proxy.host = 127.0.0.1
2015-05-01 02:51:17,643 INFO  http.Http - http.proxy.port = 3128
2015-05-01 02:51:17,643 INFO  http.Http - http.timeout = 10000
2015-05-01 02:51:17,643 INFO  http.Http - http.content.limit = 65536
2015-05-01 02:51:17,643 INFO  http.Http - http.agent = orion/Nutch-1.10-SNAPSHOT
2015-05-01 02:51:17,643 INFO  http.Http - http.accept.language = 
en-us,en-gb,en;q=0.7,*;q=0.3
2015-05-01 02:51:17,643 INFO  http.Http - http.accept = 
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=

Hope it helps,

----- Original Message -----
From: "Ankit Goel" <[email protected]>
To: [email protected]
Sent: Friday, May 1, 2015 2:16:43 AM
Subject: [MASSMAIL]Nutch 1.9 Error 403 : Failed fetch

Hi I'm using Nutch 1.9 with Solr 4.9.1 on OSX. I am trying to exract news
articles. Nutch works well for some sites, but for others I get error 403
failed fetch.

This is the output when I run parsechecker:

dumpText
http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977

fetching:
http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977

Fetch failed with protocol status: exception(16), lastModified=0: Http
code=403, url=
http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977

When I run bin/crawl I get the following :

fetch of
http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
failed
with: Http code=403, url=
http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977

The regex filter for the site is
+^http://([a-z0-9]*\.)*dnaindia.com

nutch-defailt.xml has the default value

<property>
  <name>http.robots.403.allow</name>
  <value>true</value>
  <description>Some servers return HTTP status 403 (Forbidden) if
  /robots.txt doesn't exist. This should probably mean that we are
  allowed to crawl the site nonetheless. If this is set to false,
  then such sites will be treated as forbidden.</description>
</property>

Am i missing something? For what reason am I getting failed fetch.


-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Reply via email to