Hi,

> If nutch fetches a page and get a HTTP status which is not 200(e.g. 203
> 307 404 ...), what will it do?

First, HTTP status codes are "abstracted" to a protocol status:
 - HTTP codes with similar semantics (eg., 302, 303, 307)
   are mapped into one protocal status TEMP_MOVED
 - in addition, there are protocol statuses not covered by HTTP,
   eg. ROBOTS_DENIED
For details, see HttpBase.java and ProtocolStatus.java

Second, based on the protocol status and, possibly, outcomes of previous fetches
of a page a CrawlDatum state is assigned to a page (URL).
http://wiki.apache.org/nutch/CrawlDatumStates may help to
understand the flow of CrawlDatum states.

> For example, when we get a reply with 307, we
> may be requested to input a check code for verification, but how will the
> nutch do? Fetching it anyway and parse it like a correct one (that may not
> contain any links we want), or just missing it?
A HTTP 307 (temporary redirect) is mapped to protocol_status TEMP_MOVED
which will cause the state of the CrawlDatum in the CrawlDb to be
STATUS_DB_REDIR_TEMP.
Only the "Location" HTTP header field is utilized to determine the target
URL which is fetched later (or immediately, depending on the property
http.redirect.max). The content (if there is some) is ignored: not parsed,
no outlinks extracted.

> with 307, we may be requested to input a check code for verification
Really? Is it possible to let the user fill a form "for verification" in
a redirect? I must admit that I've never seen a 307 "in the wild".

Sebastian

On 07/17/2012 05:21 AM, IT_ailen wrote:
> Hi,
>     If nutch fetches a page and get a HTTP status which is not 200(e.g. 203
> 307 404 ...), what will it do? For example, when we get a reply with 307, we
> may be requested to input a check code for verification, but how will the
> nutch do? Fetching it anyway and parse it like a correct one (that may not
> contain any links we want), or just missing it?
> 
> -----
> I'm what I am.
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-does-nutch-reflect-with-HTTP-status-not-200-tp3995444.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


Reply via email to