Hello,

I have been trying to use nutch for some time. The configuration is mongodb
+ elasticsearch.

I am trying to setup some kind of incremental crawling, that is used to
track periodic published material, mostly in pdf format, so I am also using
tika.

After crawling and updateing index, I use elasticsearch to search for
matching keywords, that are to be searched among the published documents.

The search should be run against only newer items. So, I need a mechanism
to check if the indexed item is newer than the last search date. The
tstamp, last modified or date fields seem appropriate, though

1. nutch 2.3 sets the timestamp to a month later. date is 1970 Tried to use
index-more, but still lastmodified date is null. Investigating the
elasticsearch map, date, tstamp fields are set on mapping.

2. nutch 2.4.x (trunk) doesnot even set tstamp, setting it to 1970.

This is really important, as I have to crawl manually if I cannot acquire
this property. Any Ideas ?

Thanks in advance,

Alp

Reply via email to