Re: Retrieve HTTP Status code from crawl

Markus Jelsma Mon, 21 Nov 2011 08:55:55 -0800

AFAIK Nutch won't store the HTTP code at all. Instead, it encodes it as a 
single status byte. You can check the CrawlDatum class for status codes and 
their meaning.


However, if you must you can modify the Fetcher to store ProtocolStatus' value 
in the CrawlDatum metadata.

On Monday 21 November 2011 17:43:28 Tim Fletcher wrote:
> Hi All,
> 
> I'm trying to get the status code associated with each page. But can't find
> a way to do this
> 
> I have tried getting the status CrawlDatum.PARSE_DIR_NAME however this
> gives me values such as "Status: 67 (linked)"
> 
> Also, it is possible to extract data regarding things like 301-302
> redirects? For example i would like to trace the redirect path from page1
> to page 2 (i.e. all the intermediary pages followed)
> 
> Any help on how to get the "raw" HTTP status codes would be
> much appreciated.
> 
> Regards,
> Tim

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Retrieve HTTP Status code from crawl

Reply via email to