James Sullivan created NUTCH-1475:
-------------------------------------

             Summary: Nutch 2.1 Index-More Plugin -- A better fall back value 
for date field
                 Key: NUTCH-1475
                 URL: https://issues.apache.org/jira/browse/NUTCH-1475
             Project: Nutch
          Issue Type: Bug
    Affects Versions: nutchgora, 2.1
         Environment: All
            Reporter: James Sullivan
            Priority: Minor
         Attachments: index-more-2x.patch

Among other fields, the more plugin for Nutch 2.x provides a "last modified" 
and "date" field for the Solr index. The "last modified" field is the last 
modified date from the http headers if available, if not available it is left 
empty. Currently, the "date" field is the same as the "last modified" field 
unless that field is empty in which case getFetchTime is used as a fall back. I 
think getFetchTime is not a good fall back as it is the next fetch time and 
often a month or more in the future which doesn't make sense for the date 
field. Users do not expect webpages/documents with future dates. A more 
sensible fallback would be current date at the time it is indexed. 

This is possible by simply changing line 97 of 
https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
 from


time = page.getFetchTime(); // use fetch time

to

time = new Date().getTime();


Users interested in the getFetchTime value can still get it from the "tstamp" 
field.




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to