Hi vigi,

Since the date in question comes from the document itself, you simply
cannot get it from documents that don't have it.

That is why I recommend you either use the date field from the repository,
OR have TWO date fields -- one that comes from the document, and one that
comes from the repository.  I don't know precisely how you would sort it
but at least you can make an intelligent decision.

Karl


On Thu, Jul 23, 2015 at 4:53 AM, Virgiliu R <[email protected]> wrote:

> Hello Karl,
>
> Indeed, it's probably Tika, it works for some PDFs but not for all of
> them. What can I do in order to have the field available for all my
> documents? I cannot make the last_modified field in Solr multivalued
> because I have to sort on it.
>
> Regards,
> vigi
>
> ------------------------------
> Hi vigi,
>
> What I think is happening is that there are several different dates from
> different sources floating around.  There's the date found by the JCIFS
> connector, but there is also a date maybe extracted by the Tika
> transformer.  The latter would only show up for PDFs and for DOCXs, not for
> text and rtf.
>
> The reason you see a multiple value error is because when you map both
> sources onto one field, that field may have two distinct values.  Solr
> doesn't like that, unless explicitly told it's OK.
>
> Karl
>
>
> On Tue, Jul 14, 2015 at 11:29 AM, Virgiliu R <[email protected]> wrote:
>
> > Hello,
> >
> > I am using Manifoldcf 2.0.1 to import documents from a Windows Share into
> > Solr. I noticed a small problem with the 'last_modfied' field that gets
> > saved onto each Solr document: for some documents it is present but for
> > others it is missing. There are, for instance, some PDFs documents that
> > contain this field while others are lacking it.
> >
> > I did some tests to see what information does Manifoldcf send to Solr for
> > these documents and here is a snippet containing mainly the date fields
> > that are pushed to Solr. I even looked at the jcifs connector and it
> > correctly reads the last modified date of the files, which is obvious
> > anyway judging by the text below.
> >
> > literal.createdOn=Tue+Jul+14+09:32:11+CEST+2015&
> > resource.resourceName=doc1.txt&
> > literal.fileLastModified=2015-07-14T14:42:34.771Z&
> > literal.X-Parsed-By=org.apache.tika.parser.DefaultParser&
> > literal.lastModified=Tue+Jul+14+16:42:34+CEST+2015&
> > literal.fileCreatedOn=2015-07-14T07:32:11.647Z&
> >
> >
> > literal.Creation-Date=2015-06-17T14:47:00Z&
> > literal.Last-Modified=2015-07-14T14:42:00Z&
> > literal.resourceName=doc2.docx&
> > literal.modified=2015-07-14T14:42:00Z&
> > literal.lastModified=Tue+Jul+14+16:42:48+CEST+2015&
> > literal.date=2015-07-14T14:42:00Z&
> > literal.createdOn=Wed+Jun+17+16:47:49+CEST+2015&
> > literal.Last-Save-Date=2015-07-14T14:42:00Z&
> > literal.fileLastModified=2015-07-14T14:42:48.161Z&
> > literal.fileCreatedOn=2015-06-17T14:47:49.382Z}
> >
> > literal.Creation-Date=2015-06-17T15:37:00Z&
> > literal.createdOn=Wed+Jun+17+17:37:47+CEST+2015&
> > literal.resourceName=doc3.rtf&
> > literal.fileLastModified=2015-07-14T14:42:21.463Z&
> > literal.lastModified=Tue+Jul+14+16:42:21+CEST+2015&
> > literal.fileCreatedOn=2015-06-17T15:37:47.107Z&
> >
> > literal.Creation-Date=2015-07-14T14:42:15Z&
> > literal.Last-Modified=2015-07-14T14:42:15Z&
> > resource.resourceName=doc4.pdf&
> > literal.modified=2015-07-14T14:42:15Z&
> > literal.lastModified=Tue+Jul+14+16:42:15+CEST+2015&
> > literal.date=2015-07-14T14:42:15Z&
> > literal.createdOn=Tue+Jul+14+16:38:37+CEST+2015&
> > literal.Last-Save-Date=2015-07-14T14:42:15Z&
> > literal.fileLastModified=2015-07-14T14:42:15.083Z&
> > literal.fileCreatedOn=2015-07-14T14:38:37.632Z&
> > literal.created=Tue+Jul+14+16:42:15+CEST+2015
> >
> > The DOCX and PDF documents are alright but the TXT and RTF are not. The
> > latter ones seem to be missing the 'Last-Modified' and 'modified'
> fields. I
> > tried to use the 'Move metadata' and 'Field mapping' tabs on the job
> > definition to map, for example, the 'fileLastModified' field to the
> > 'last_modified' Solr field but for some of the documents I get an error
> > 'multiple values encountered for non multiValued field last_modified:
> > [2015-03-11T12:32:00.000Z, 2015-03-11T12:32:30.468Z]' and I would not
> like
> > to map the 'last_modified' Solr field as a multivalued field, unless I
> have
> > no other options.
> >
> > So the question is if this is a bug or if there exists a workaround for
> > this particular scenario?
> >
> > Regards,
> > vigi
> >
> >
> >
>
>
>
>

Reply via email to