Hello, weishenyun,
As I know, when re-crawling a page, nutch will post some additional
parameters to the destine server(such as update time), with which the
server can decide either to return a state 304(unchanged) or to respond
with the newly modified page. And the state 404 means the page you are
fetching has gone.
Best regards,
Ailen
On 2012?08?21? 17:44, weishenyun [via Lucene] wrote:
> Hi everyone here,
> I want to know how Nutch update page after recrawl. For
> example, a page was fetched successfully and stored in the DB or file
> system by last crawl command. But it returns 404 when recrawl the same
> page, will Nutch use this 404's page information to update the former
> successful page information ? How about other situation, 301? 302? 503?
> Thanks in advance.
>
> ------------------------------------------------------------------------
> If you reply to this email, your message will be added to the
> discussion below:
> http://lucene.472066.n3.nabble.com/What-is-the-Nutch-page-update-mechanism-after-recrawl-tp4002366.html
>
>
> To start a new topic under Nutch - User, email
> [email protected]
> To unsubscribe from Nutch - User, click here
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=603147&code=enlsZnJlZXBhcmFkaXNlQGdtYWlsLmNvbXw2MDMxNDd8NTIxMDAxODUx>.
> NAML
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
-----
I'm what I am.
--
View this message in context:
http://lucene.472066.n3.nabble.com/What-is-the-Nutch-page-update-mechanism-after-recrawl-tp4002366p4002369.html
Sent from the Nutch - User mailing list archive at Nabble.com.