Hello,
I am using Manifoldcf 2.0.1 to import documents from a Windows Share into Solr.
I noticed a small problem with the 'last_modfied' field that gets saved onto
each Solr document: for some documents it is present but for others it is
missing. There are, for instance, some PDFs documents that contain this field
while others are lacking it.
I did some tests to see what information does Manifoldcf send to Solr for these
documents and here is a snippet containing mainly the date fields that are
pushed to Solr. I even looked at the jcifs connector and it correctly reads the
last modified date of the files, which is obvious anyway judging by the text
below.
literal.createdOn=Tue+Jul+14+09:32:11+CEST+2015&
resource.resourceName=doc1.txt&
literal.fileLastModified=2015-07-14T14:42:34.771Z&
literal.X-Parsed-By=org.apache.tika.parser.DefaultParser&
literal.lastModified=Tue+Jul+14+16:42:34+CEST+2015&
literal.fileCreatedOn=2015-07-14T07:32:11.647Z&
literal.Creation-Date=2015-06-17T14:47:00Z&
literal.Last-Modified=2015-07-14T14:42:00Z&
literal.resourceName=doc2.docx&
literal.modified=2015-07-14T14:42:00Z&
literal.lastModified=Tue+Jul+14+16:42:48+CEST+2015&
literal.date=2015-07-14T14:42:00Z&
literal.createdOn=Wed+Jun+17+16:47:49+CEST+2015&
literal.Last-Save-Date=2015-07-14T14:42:00Z&
literal.fileLastModified=2015-07-14T14:42:48.161Z&
literal.fileCreatedOn=2015-06-17T14:47:49.382Z}
literal.Creation-Date=2015-06-17T15:37:00Z&
literal.createdOn=Wed+Jun+17+17:37:47+CEST+2015&
literal.resourceName=doc3.rtf&
literal.fileLastModified=2015-07-14T14:42:21.463Z&
literal.lastModified=Tue+Jul+14+16:42:21+CEST+2015&
literal.fileCreatedOn=2015-06-17T15:37:47.107Z&
literal.Creation-Date=2015-07-14T14:42:15Z&
literal.Last-Modified=2015-07-14T14:42:15Z&
resource.resourceName=doc4.pdf&
literal.modified=2015-07-14T14:42:15Z&
literal.lastModified=Tue+Jul+14+16:42:15+CEST+2015&
literal.date=2015-07-14T14:42:15Z&
literal.createdOn=Tue+Jul+14+16:38:37+CEST+2015&
literal.Last-Save-Date=2015-07-14T14:42:15Z&
literal.fileLastModified=2015-07-14T14:42:15.083Z&
literal.fileCreatedOn=2015-07-14T14:38:37.632Z&
literal.created=Tue+Jul+14+16:42:15+CEST+2015
The DOCX and PDF documents are alright but the TXT and RTF are not. The latter
ones seem to be missing the 'Last-Modified' and 'modified' fields. I tried to
use the 'Move metadata' and 'Field mapping' tabs on the job definition to map,
for example, the 'fileLastModified' field to the 'last_modified' Solr field but
for some of the documents I get an error 'multiple values encountered for non
multiValued field last_modified: [2015-03-11T12:32:00.000Z,
2015-03-11T12:32:30.468Z]' and I would not like to map the 'last_modified' Solr
field as a multivalued field, unless I have no other options.
So the question is if this is a bug or if there exists a workaround for this
particular scenario?
Regards,
vigi