Hi vigi, Since the date in question comes from the document itself, you simply cannot get it from documents that don't have it.
That is why I recommend you either use the date field from the repository, OR have TWO date fields -- one that comes from the document, and one that comes from the repository. I don't know precisely how you would sort it but at least you can make an intelligent decision. Karl On Thu, Jul 23, 2015 at 4:53 AM, Virgiliu R <[email protected]> wrote: > Hello Karl, > > Indeed, it's probably Tika, it works for some PDFs but not for all of > them. What can I do in order to have the field available for all my > documents? I cannot make the last_modified field in Solr multivalued > because I have to sort on it. > > Regards, > vigi > > ------------------------------ > Hi vigi, > > What I think is happening is that there are several different dates from > different sources floating around. There's the date found by the JCIFS > connector, but there is also a date maybe extracted by the Tika > transformer. The latter would only show up for PDFs and for DOCXs, not for > text and rtf. > > The reason you see a multiple value error is because when you map both > sources onto one field, that field may have two distinct values. Solr > doesn't like that, unless explicitly told it's OK. > > Karl > > > On Tue, Jul 14, 2015 at 11:29 AM, Virgiliu R <[email protected]> wrote: > > > Hello, > > > > I am using Manifoldcf 2.0.1 to import documents from a Windows Share into > > Solr. I noticed a small problem with the 'last_modfied' field that gets > > saved onto each Solr document: for some documents it is present but for > > others it is missing. There are, for instance, some PDFs documents that > > contain this field while others are lacking it. > > > > I did some tests to see what information does Manifoldcf send to Solr for > > these documents and here is a snippet containing mainly the date fields > > that are pushed to Solr. I even looked at the jcifs connector and it > > correctly reads the last modified date of the files, which is obvious > > anyway judging by the text below. > > > > literal.createdOn=Tue+Jul+14+09:32:11+CEST+2015& > > resource.resourceName=doc1.txt& > > literal.fileLastModified=2015-07-14T14:42:34.771Z& > > literal.X-Parsed-By=org.apache.tika.parser.DefaultParser& > > literal.lastModified=Tue+Jul+14+16:42:34+CEST+2015& > > literal.fileCreatedOn=2015-07-14T07:32:11.647Z& > > > > > > literal.Creation-Date=2015-06-17T14:47:00Z& > > literal.Last-Modified=2015-07-14T14:42:00Z& > > literal.resourceName=doc2.docx& > > literal.modified=2015-07-14T14:42:00Z& > > literal.lastModified=Tue+Jul+14+16:42:48+CEST+2015& > > literal.date=2015-07-14T14:42:00Z& > > literal.createdOn=Wed+Jun+17+16:47:49+CEST+2015& > > literal.Last-Save-Date=2015-07-14T14:42:00Z& > > literal.fileLastModified=2015-07-14T14:42:48.161Z& > > literal.fileCreatedOn=2015-06-17T14:47:49.382Z} > > > > literal.Creation-Date=2015-06-17T15:37:00Z& > > literal.createdOn=Wed+Jun+17+17:37:47+CEST+2015& > > literal.resourceName=doc3.rtf& > > literal.fileLastModified=2015-07-14T14:42:21.463Z& > > literal.lastModified=Tue+Jul+14+16:42:21+CEST+2015& > > literal.fileCreatedOn=2015-06-17T15:37:47.107Z& > > > > literal.Creation-Date=2015-07-14T14:42:15Z& > > literal.Last-Modified=2015-07-14T14:42:15Z& > > resource.resourceName=doc4.pdf& > > literal.modified=2015-07-14T14:42:15Z& > > literal.lastModified=Tue+Jul+14+16:42:15+CEST+2015& > > literal.date=2015-07-14T14:42:15Z& > > literal.createdOn=Tue+Jul+14+16:38:37+CEST+2015& > > literal.Last-Save-Date=2015-07-14T14:42:15Z& > > literal.fileLastModified=2015-07-14T14:42:15.083Z& > > literal.fileCreatedOn=2015-07-14T14:38:37.632Z& > > literal.created=Tue+Jul+14+16:42:15+CEST+2015 > > > > The DOCX and PDF documents are alright but the TXT and RTF are not. The > > latter ones seem to be missing the 'Last-Modified' and 'modified' > fields. I > > tried to use the 'Move metadata' and 'Field mapping' tabs on the job > > definition to map, for example, the 'fileLastModified' field to the > > 'last_modified' Solr field but for some of the documents I get an error > > 'multiple values encountered for non multiValued field last_modified: > > [2015-03-11T12:32:00.000Z, 2015-03-11T12:32:30.468Z]' and I would not > like > > to map the 'last_modified' Solr field as a multivalued field, unless I > have > > no other options. > > > > So the question is if this is a bug or if there exists a workaround for > > this particular scenario? > > > > Regards, > > vigi > > > > > > > > > >
