Hi Jorge,
Thanks for that. But I'm not sure exactly what changes I should make to the
UA to "fool" the server its not a nutch bot. I'm still a noob at this, so
I'm not sure what to tweak.

I also looked up a couple of other companies that also deploy Varnish. So
far only one of them responded. NYT is one of the companies that has been
listed as using Varnish. When I tried to run a parsechecker for a link, I
got the temp_moved(13) error, which I found to be a redirect error. I
changed the http.redirect.max to 5, but I still got the same temp_moved
error. Just wondering if Varnish servers have a thing for nutch :)

On Fri, May 1, 2015 at 12:28 PM, Jorge Luis Betancourt González <
[email protected]> wrote:

> Try tweaking your user agent, if I do a parsechecker using Nutch I can
> replicate the problem, apparently that site have implemented some sort of
> user agent filtering, check in your hadoop.log the UA you're using. I did a
> little experiment and in Chrome spoofing the same UA as I've configured in
> Nutch I get the same behavior, apparently the Varnish server they have
> configured as a reverse proxy has some kind of rule to block every request
> that has a UA that match with Nutch.
>
> This is the relevant output of my logs/hadoop.log file:
>
> 2015-05-01 02:51:17,034 INFO  parse.ParserChecker - fetching:
> http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
> 2015-05-01 02:51:17,643 INFO  http.Http - http.proxy.host = 127.0.0.1
> 2015-05-01 02:51:17,643 INFO  http.Http - http.proxy.port = 3128
> 2015-05-01 02:51:17,643 INFO  http.Http - http.timeout = 10000
> 2015-05-01 02:51:17,643 INFO  http.Http - http.content.limit = 65536
> 2015-05-01 02:51:17,643 INFO  http.Http - http.agent =
> orion/Nutch-1.10-SNAPSHOT
> 2015-05-01 02:51:17,643 INFO  http.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2015-05-01 02:51:17,643 INFO  http.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=
>
> Hope it helps,
>
> ----- Original Message -----
> From: "Ankit Goel" <[email protected]>
> To: [email protected]
> Sent: Friday, May 1, 2015 2:16:43 AM
> Subject: [MASSMAIL]Nutch 1.9 Error 403 : Failed fetch
>
> Hi I'm using Nutch 1.9 with Solr 4.9.1 on OSX. I am trying to exract news
> articles. Nutch works well for some sites, but for others I get error 403
> failed fetch.
>
> This is the output when I run parsechecker:
>
> dumpText
>
> http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
>
> fetching:
>
> http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
>
> Fetch failed with protocol status: exception(16), lastModified=0: Http
> code=403, url=
>
> http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
>
> When I run bin/crawl I get the following :
>
> fetch of
>
> http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
> failed
> with: Http code=403, url=
>
> http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
>
> The regex filter for the site is
> +^http://([a-z0-9]*\.)*dnaindia.com
>
> nutch-defailt.xml has the default value
>
> <property>
>   <name>http.robots.403.allow</name>
>   <value>true</value>
>   <description>Some servers return HTTP status 403 (Forbidden) if
>   /robots.txt doesn't exist. This should probably mean that we are
>   allowed to crawl the site nonetheless. If this is set to false,
>   then such sites will be treated as forbidden.</description>
> </property>
>
> Am i missing something? For what reason am I getting failed fetch.
>
>
> --
> Regards,
> Ankit Goel
> http://about.me/ankitgoel
>



-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Reply via email to