Hi Jorge, Thanks for that. But I'm not sure exactly what changes I should make to the UA to "fool" the server its not a nutch bot. I'm still a noob at this, so I'm not sure what to tweak.
I also looked up a couple of other companies that also deploy Varnish. So far only one of them responded. NYT is one of the companies that has been listed as using Varnish. When I tried to run a parsechecker for a link, I got the temp_moved(13) error, which I found to be a redirect error. I changed the http.redirect.max to 5, but I still got the same temp_moved error. Just wondering if Varnish servers have a thing for nutch :) On Fri, May 1, 2015 at 12:28 PM, Jorge Luis Betancourt González < [email protected]> wrote: > Try tweaking your user agent, if I do a parsechecker using Nutch I can > replicate the problem, apparently that site have implemented some sort of > user agent filtering, check in your hadoop.log the UA you're using. I did a > little experiment and in Chrome spoofing the same UA as I've configured in > Nutch I get the same behavior, apparently the Varnish server they have > configured as a reverse proxy has some kind of rule to block every request > that has a UA that match with Nutch. > > This is the relevant output of my logs/hadoop.log file: > > 2015-05-01 02:51:17,034 INFO parse.ParserChecker - fetching: > http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977 > 2015-05-01 02:51:17,643 INFO http.Http - http.proxy.host = 127.0.0.1 > 2015-05-01 02:51:17,643 INFO http.Http - http.proxy.port = 3128 > 2015-05-01 02:51:17,643 INFO http.Http - http.timeout = 10000 > 2015-05-01 02:51:17,643 INFO http.Http - http.content.limit = 65536 > 2015-05-01 02:51:17,643 INFO http.Http - http.agent = > orion/Nutch-1.10-SNAPSHOT > 2015-05-01 02:51:17,643 INFO http.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 2015-05-01 02:51:17,643 INFO http.Http - http.accept = > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q= > > Hope it helps, > > ----- Original Message ----- > From: "Ankit Goel" <[email protected]> > To: [email protected] > Sent: Friday, May 1, 2015 2:16:43 AM > Subject: [MASSMAIL]Nutch 1.9 Error 403 : Failed fetch > > Hi I'm using Nutch 1.9 with Solr 4.9.1 on OSX. I am trying to exract news > articles. Nutch works well for some sites, but for others I get error 403 > failed fetch. > > This is the output when I run parsechecker: > > dumpText > > http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977 > > fetching: > > http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977 > > Fetch failed with protocol status: exception(16), lastModified=0: Http > code=403, url= > > http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977 > > When I run bin/crawl I get the following : > > fetch of > > http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977 > failed > with: Http code=403, url= > > http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977 > > The regex filter for the site is > +^http://([a-z0-9]*\.)*dnaindia.com > > nutch-defailt.xml has the default value > > <property> > <name>http.robots.403.allow</name> > <value>true</value> > <description>Some servers return HTTP status 403 (Forbidden) if > /robots.txt doesn't exist. This should probably mean that we are > allowed to crawl the site nonetheless. If this is set to false, > then such sites will be treated as forbidden.</description> > </property> > > Am i missing something? For what reason am I getting failed fetch. > > > -- > Regards, > Ankit Goel > http://about.me/ankitgoel > -- Regards, Ankit Goel http://about.me/ankitgoel

