Hey Lewis,

I was able to fetch it:

MacBookPro2014:crawls2 almohsin$ NUTCH parsechecker "http://www.nature.com/";

fetching: http://www.nature.com/

parsing: http://www.nature.com/

contentType: text/html

signature: 6cee25dd58e27e7cb0394a1325f3df6e

---------

Url

---------------


http://www.nature.com/

---------

ParseData

---------


Version: 5

Status: success(1,0)

Title:

Outlinks: 120

  outlink: toUrl: http://www.nature.com/#content anchor:
Jump to main content

..........

  outlink: toUrl: http://static.chartbeat.com/js/chartbeat.js anchor:

Content Metadata: Vary=Accept-Encoding Date=Fri, 27 Feb 2015 18:01:59 GMT
P3P=CP="CAO DSP LAW IVA IVD HIS OUR UNR STP UNI COM" Expires=Thu,
01-Jan-1970 00:00:00 GMT nutch.crawl.score=0.0 Content-Encoding=gzip
webserver=npgj2ee16.nature.com
Set-Cookie=JSESSIONID=1mxz9o0ewy9dwk18aqzndtyiq;Path=/oa;Domain=.nature.com
Connection=close Content-Type=text/html; charset=utf-8 Server=Jetty(6.1.26)

Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
language=en

Best regards,
Mohammad Al-Mohsin

On Fri, Feb 27, 2015 at 9:55 AM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Folks,
> I was getting 500 internal server error using Nutch trunk when attempting
> to fetch content from this domain.
> http://www.nature.com
> Just for detail, Nature.com is a catalogue of journals and science
> resources, including the journal *Nature*. Publishes science news and
> articles across a wide range of scientific fields. So it is nothing
> malicious or sensitive/offending content-wise.
> Can anyone else fetch this URL?
> I can get it with curl and wget but not Nutch.
> Thanks
> Lewis
>
>
> --
> *Lewis*
>

Reply via email to