[ https://issues.apache.org/jira/browse/NUTCH-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930588#action_12930588 ]
Sebastian Nagel commented on NUTCH-933: --------------------------------------- The modifiedTime stored in a CrawlDatum record is not the "Last-Modified" time sent by the responding server (or the time stamp of a file, in case protocol-file is used) but the time a document was fetched. Is there any reason? Determining the "Last-Modified" time is somewhat difficult since it may be specified in the HTTP header or in HTML as <META HTTP-EQUIV="Last-Modified" ...>. But it would be a nice-to-have information. In addition, the index-more indexing filter which provides a field "lastModified" does the job not very well: it should take the value from content meta data (which seems to be mostly correct) and not from parse meta data. Beside: re-crawling with if-modified-since is not affected: there is no difference if the time of the last fetch is sent because only if the document has been modified since the last fetch it must be re-fetched. > Fetcher does not save a pages Last-Modified value in CrawlDatum > --------------------------------------------------------------- > > Key: NUTCH-933 > URL: https://issues.apache.org/jira/browse/NUTCH-933 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 1.2 > Reporter: Joe Kemp > > I added the following code in the output method just after the If (content > !=null) statement. > String lastModified = metadata.get("Last-Modified"); > if (lastModified !=null && !lastModified.equals("")) { > try { > Date lastModifiedDate = > DateUtil.parseDate(lastModified); > > datum.setModifiedTime(lastModifiedDate.getTime()); > } catch (DateParseException e) { > > } > } > I now get 304 for pages that haven't changed when I recrawl. Need to do > further testing. Might also need a configuration parameter to turn off this > behavior, allowing pages to be forced to be refreshed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.