Hi Ankit,

No problem at all!, take into account that the redirect HTTP status code is not 
a problem, with the 1.9 version there is a bug with the "http.redirect.max" 
parameter, search the JIRA for a workaround. Mainly the temporary moved error 
is basically a way to tell the bot/browser that the URL is not where you 
expected and instead where to find it, Nutch is able to follow this suggestion 
and find the resource in its new location, this is not the same as the initial 
403 error. The 403 error is basically that you don't have permission to request 
that resource, and for what I could see it was generated by Varnish. Perhaps 
you can set your User Agent in Nutch to resemble a browser? and see if that 
works, keep in mind that this is not very polite and you should contact the 
webmaster and request permission to crawl the site, one more test that I can 
think of is setting the Nutch UA to be the same as googlebot :) and see if that 
works, I don't recommend this a usual practice but could work to see if the 
problem is as it seems the UA filtering.

Regards,

----- Original Message -----
From: "Ankit Goel" <[email protected]>
To: [email protected]
Sent: Monday, May 4, 2015 9:22:59 AM
Subject: Re: [MASSMAIL]Nutch 1.9 Error 403 : Failed fetch

Hi Jorge,
Thanks for that. But I'm not sure exactly what changes I should make to the
UA to "fool" the server its not a nutch bot. I'm still a noob at this, so
I'm not sure what to tweak.

I also looked up a couple of other companies that also deploy Varnish. So
far only one of them responded. NYT is one of the companies that has been
listed as using Varnish. When I tried to run a parsechecker for a link, I
got the temp_moved(13) error, which I found to be a redirect error. I
changed the http.redirect.max to 5, but I still got the same temp_moved
error. Just wondering if Varnish servers have a thing for nutch :)

On Fri, May 1, 2015 at 12:28 PM, Jorge Luis Betancourt González <
[email protected]> wrote:

> Try tweaking your user agent, if I do a parsechecker using Nutch I can
> replicate the problem, apparently that site have implemented some sort of
> user agent filtering, check in your hadoop.log the UA you're using. I did a
> little experiment and in Chrome spoofing the same UA as I've configured in
> Nutch I get the same behavior, apparently the Varnish server they have
> configured as a reverse proxy has some kind of rule to block every request
> that has a UA that match with Nutch.
>
> This is the relevant output of my logs/hadoop.log file:
>
> 2015-05-01 02:51:17,034 INFO  parse.ParserChecker - fetching:
> http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
> 2015-05-01 02:51:17,643 INFO  http.Http - http.proxy.host = 127.0.0.1
> 2015-05-01 02:51:17,643 INFO  http.Http - http.proxy.port = 3128
> 2015-05-01 02:51:17,643 INFO  http.Http - http.timeout = 10000
> 2015-05-01 02:51:17,643 INFO  http.Http - http.content.limit = 65536
> 2015-05-01 02:51:17,643 INFO  http.Http - http.agent =
> orion/Nutch-1.10-SNAPSHOT
> 2015-05-01 02:51:17,643 INFO  http.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2015-05-01 02:51:17,643 INFO  http.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=
>
> Hope it helps,
>
> ----- Original Message -----
> From: "Ankit Goel" <[email protected]>
> To: [email protected]
> Sent: Friday, May 1, 2015 2:16:43 AM
> Subject: [MASSMAIL]Nutch 1.9 Error 403 : Failed fetch
>
> Hi I'm using Nutch 1.9 with Solr 4.9.1 on OSX. I am trying to exract news
> articles. Nutch works well for some sites, but for others I get error 403
> failed fetch.
>
> This is the output when I run parsechecker:
>
> dumpText
>
> http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
>
> fetching:
>
> http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
>
> Fetch failed with protocol status: exception(16), lastModified=0: Http
> code=403, url=
>
> http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
>
> When I run bin/crawl I get the following :
>
> fetch of
>
> http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
> failed
> with: Http code=403, url=
>
> http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977
>
> The regex filter for the site is
> +^http://([a-z0-9]*\.)*dnaindia.com
>
> nutch-defailt.xml has the default value
>
> <property>
>   <name>http.robots.403.allow</name>
>   <value>true</value>
>   <description>Some servers return HTTP status 403 (Forbidden) if
>   /robots.txt doesn't exist. This should probably mean that we are
>   allowed to crawl the site nonetheless. If this is set to false,
>   then such sites will be treated as forbidden.</description>
> </property>
>
> Am i missing something? For what reason am I getting failed fetch.
>
>
> --
> Regards,
> Ankit Goel
> http://about.me/ankitgoel
>



-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Reply via email to