Hello, weishenyun,
     As I know, when re-crawling a page, nutch will post some additional 
parameters to the destine server(such as update time), with which the 
server can decide either to return a state 304(unchanged) or to respond 
with the newly modified page. And the state 404 means the page you are 
fetching has gone.

Best regards,
     Ailen
On 2012?08?21? 17:44, weishenyun [via Lucene] wrote:
> Hi everyone here,
>        I want to know how Nutch update page after recrawl. For 
> example, a page was fetched successfully and stored in the DB or file 
> system by last crawl command. But it returns 404 when recrawl the same 
> page, will Nutch use this 404's page information to update the former 
> successful page information ? How about other situation, 301? 302? 503?
>       Thanks in advance.
>
> ------------------------------------------------------------------------
> If you reply to this email, your message will be added to the 
> discussion below:
> http://lucene.472066.n3.nabble.com/What-is-the-Nutch-page-update-mechanism-after-recrawl-tp4002366.html
>  
>
> To start a new topic under Nutch - User, email 
> [email protected]
> To unsubscribe from Nutch - User, click here 
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=603147&code=enlsZnJlZXBhcmFkaXNlQGdtYWlsLmNvbXw2MDMxNDd8NTIxMDAxODUx>.
> NAML 
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>  
>





-----
I'm what I am.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/What-is-the-Nutch-page-update-mechanism-after-recrawl-tp4002366p4002369.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to