Hello Karl, Indeed, it's probably Tika, it works for some PDFs but not for all of them. What can I do in order to have the field available for all my documents? I cannot make the last_modified field in Solr multivalued because I have to sort on it.
Regards, vigi Hi vigi, What I think is happening is that there are several different dates from different sources floating around. There's the date found by the JCIFS connector, but there is also a date maybe extracted by the Tika transformer. The latter would only show up for PDFs and for DOCXs, not for text and rtf. The reason you see a multiple value error is because when you map both sources onto one field, that field may have two distinct values. Solr doesn't like that, unless explicitly told it's OK. Karl On Tue, Jul 14, 2015 at 11:29 AM, Virgiliu R <[email protected]> wrote: > Hello, > > I am using Manifoldcf 2.0.1 to import documents from a Windows Share into > Solr. I noticed a small problem with the 'last_modfied' field that gets > saved onto each Solr document: for some documents it is present but for > others it is missing. There are, for instance, some PDFs documents that > contain this field while others are lacking it. > > I did some tests to see what information does Manifoldcf send to Solr for > these documents and here is a snippet containing mainly the date fields > that are pushed to Solr. I even looked at the jcifs connector and it > correctly reads the last modified date of the files, which is obvious > anyway judging by the text below. > > literal.createdOn=Tue+Jul+14+09:32:11+CEST+2015& > resource.resourceName=doc1.txt& > literal.fileLastModified=2015-07-14T14:42:34.771Z& > literal.X-Parsed-By=org.apache.tika.parser.DefaultParser& > literal.lastModified=Tue+Jul+14+16:42:34+CEST+2015& > literal.fileCreatedOn=2015-07-14T07:32:11.647Z& > > > literal.Creation-Date=2015-06-17T14:47:00Z& > literal.Last-Modified=2015-07-14T14:42:00Z& > literal.resourceName=doc2.docx& > literal.modified=2015-07-14T14:42:00Z& > literal.lastModified=Tue+Jul+14+16:42:48+CEST+2015& > literal.date=2015-07-14T14:42:00Z& > literal.createdOn=Wed+Jun+17+16:47:49+CEST+2015& > literal.Last-Save-Date=2015-07-14T14:42:00Z& > literal.fileLastModified=2015-07-14T14:42:48.161Z& > literal.fileCreatedOn=2015-06-17T14:47:49.382Z} > > literal.Creation-Date=2015-06-17T15:37:00Z& > literal.createdOn=Wed+Jun+17+17:37:47+CEST+2015& > literal.resourceName=doc3.rtf& > literal.fileLastModified=2015-07-14T14:42:21.463Z& > literal.lastModified=Tue+Jul+14+16:42:21+CEST+2015& > literal.fileCreatedOn=2015-06-17T15:37:47.107Z& > > literal.Creation-Date=2015-07-14T14:42:15Z& > literal.Last-Modified=2015-07-14T14:42:15Z& > resource.resourceName=doc4.pdf& > literal.modified=2015-07-14T14:42:15Z& > literal.lastModified=Tue+Jul+14+16:42:15+CEST+2015& > literal.date=2015-07-14T14:42:15Z& > literal.createdOn=Tue+Jul+14+16:38:37+CEST+2015& > literal.Last-Save-Date=2015-07-14T14:42:15Z& > literal.fileLastModified=2015-07-14T14:42:15.083Z& > literal.fileCreatedOn=2015-07-14T14:38:37.632Z& > literal.created=Tue+Jul+14+16:42:15+CEST+2015 > > The DOCX and PDF documents are alright but the TXT and RTF are not. The > latter ones seem to be missing the 'Last-Modified' and 'modified' fields. I > tried to use the 'Move metadata' and 'Field mapping' tabs on the job > definition to map, for example, the 'fileLastModified' field to the > 'last_modified' Solr field but for some of the documents I get an error > 'multiple values encountered for non multiValued field last_modified: > [2015-03-11T12:32:00.000Z, 2015-03-11T12:32:30.468Z]' and I would not like > to map the 'last_modified' Solr field as a multivalued field, unless I have > no other options. > > So the question is if this is a bug or if there exists a workaround for > this particular scenario? > > Regards, > vigi > > >
