Hello,

I am using Manifoldcf 2.0.1 to import documents from a Windows Share into Solr. 
I noticed a small problem with the 'last_modfied' field that gets saved onto 
each Solr document: for some documents it is present but for others it is 
missing. There are, for instance, some PDFs documents that contain this field 
while others are lacking it.

I did some tests to see what information does Manifoldcf send to Solr for these 
documents and here is a snippet containing mainly the date fields that are 
pushed to Solr. I even looked at the jcifs connector and it correctly reads the 
last modified date of the files, which is obvious anyway judging by the text 
below.

literal.createdOn=Tue+Jul+14+09:32:11+CEST+2015&
resource.resourceName=doc1.txt&
literal.fileLastModified=2015-07-14T14:42:34.771Z&
literal.X-Parsed-By=org.apache.tika.parser.DefaultParser&
literal.lastModified=Tue+Jul+14+16:42:34+CEST+2015&
literal.fileCreatedOn=2015-07-14T07:32:11.647Z&
                                                                                
                       
literal.Creation-Date=2015-06-17T14:47:00Z&
literal.Last-Modified=2015-07-14T14:42:00Z&
literal.resourceName=doc2.docx&
literal.modified=2015-07-14T14:42:00Z&
literal.lastModified=Tue+Jul+14+16:42:48+CEST+2015&
literal.date=2015-07-14T14:42:00Z&
literal.createdOn=Wed+Jun+17+16:47:49+CEST+2015&
literal.Last-Save-Date=2015-07-14T14:42:00Z&
literal.fileLastModified=2015-07-14T14:42:48.161Z&
literal.fileCreatedOn=2015-06-17T14:47:49.382Z}                     

literal.Creation-Date=2015-06-17T15:37:00Z&
literal.createdOn=Wed+Jun+17+17:37:47+CEST+2015&
literal.resourceName=doc3.rtf&
literal.fileLastModified=2015-07-14T14:42:21.463Z&
literal.lastModified=Tue+Jul+14+16:42:21+CEST+2015&
literal.fileCreatedOn=2015-06-17T15:37:47.107Z&
          
literal.Creation-Date=2015-07-14T14:42:15Z&
literal.Last-Modified=2015-07-14T14:42:15Z&
resource.resourceName=doc4.pdf&
literal.modified=2015-07-14T14:42:15Z&
literal.lastModified=Tue+Jul+14+16:42:15+CEST+2015&
literal.date=2015-07-14T14:42:15Z&
literal.createdOn=Tue+Jul+14+16:38:37+CEST+2015&
literal.Last-Save-Date=2015-07-14T14:42:15Z&
literal.fileLastModified=2015-07-14T14:42:15.083Z&
literal.fileCreatedOn=2015-07-14T14:38:37.632Z&
literal.created=Tue+Jul+14+16:42:15+CEST+2015

The DOCX and PDF documents are alright but the TXT and RTF are not. The latter 
ones seem to be missing the 'Last-Modified' and 'modified' fields. I tried to 
use the 'Move metadata' and 'Field mapping' tabs on the job definition to map, 
for example, the 'fileLastModified' field to the 'last_modified' Solr field but 
for some of the documents I get an error 'multiple values encountered for non 
multiValued field last_modified: [2015-03-11T12:32:00.000Z, 
2015-03-11T12:32:30.468Z]' and I would not like to map the 'last_modified' Solr 
field as a multivalued field, unless I have no other options.

So the question is if this is a bug or if there exists a workaround for this 
particular scenario?

Regards,
vigi


                                          

Reply via email to