Hi Jessica and Brooks, On Fri, Jun 19, 2015 at 10:06 AM, <user-digest-h...@nutch.apache.org> wrote:
[snip] > > Notice the 'prevFetchTime' field has been updated to show the next > date when this URL should be crawled (30 days from now - July 19). I > assume this is exactly what SHOULD happen. > Correct. > > Note, the tstamp is a month from now. ack > I'm not sure if nutch relies on the data in elasticsearch to know when it > should reindex (though I don't see why it would - that decision would be > made based on when it needs to refetch and whether or not anything has > changed, right?). > No Nutch does not rely upon data in Elasticsearch. Crawling and Indexing are separate independent tasks. > > I would think that even IF Nutch needs to have the future date in > Elasticsearch, it should send in the actual fetch time (i.e. the > 'prevFetchTime' field). > Correct. There seems to be a big here which you have both identified. > > I've been looking through some of the source code and the problem > does NOT appear to be in the Elasticsearch Indexer plugin as it simply > iterates through all of the key/value pairs and inserts them. > Same and yes I confirm this is true. The bug is in BasicIndexingFilter https://github.com/apache/nutch/blob/2.x/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java#L127-L130 We need to do a check for preFetchTime being null, if so then use the fetchTime else use the prevFetchTime. Can one of you please open an issue and submit a patch fix for this? If not then I can create and submit. This is a trivial fix which is one which we need to implement. Good catch folks. By the way, this effects trunk as well https://github.com/apache/nutch/blob/trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java#L136 Lewis