Hi vigi,

What I think is happening is that there are several different dates from
different sources floating around.  There's the date found by the JCIFS
connector, but there is also a date maybe extracted by the Tika
transformer.  The latter would only show up for PDFs and for DOCXs, not for
text and rtf.

The reason you see a multiple value error is because when you map both
sources onto one field, that field may have two distinct values.  Solr
doesn't like that, unless explicitly told it's OK.

Karl


On Tue, Jul 14, 2015 at 11:29 AM, Virgiliu R <[email protected]> wrote:

> Hello,
>
> I am using Manifoldcf 2.0.1 to import documents from a Windows Share into
> Solr. I noticed a small problem with the 'last_modfied' field that gets
> saved onto each Solr document: for some documents it is present but for
> others it is missing. There are, for instance, some PDFs documents that
> contain this field while others are lacking it.
>
> I did some tests to see what information does Manifoldcf send to Solr for
> these documents and here is a snippet containing mainly the date fields
> that are pushed to Solr. I even looked at the jcifs connector and it
> correctly reads the last modified date of the files, which is obvious
> anyway judging by the text below.
>
> literal.createdOn=Tue+Jul+14+09:32:11+CEST+2015&
> resource.resourceName=doc1.txt&
> literal.fileLastModified=2015-07-14T14:42:34.771Z&
> literal.X-Parsed-By=org.apache.tika.parser.DefaultParser&
> literal.lastModified=Tue+Jul+14+16:42:34+CEST+2015&
> literal.fileCreatedOn=2015-07-14T07:32:11.647Z&
>
>
> literal.Creation-Date=2015-06-17T14:47:00Z&
> literal.Last-Modified=2015-07-14T14:42:00Z&
> literal.resourceName=doc2.docx&
> literal.modified=2015-07-14T14:42:00Z&
> literal.lastModified=Tue+Jul+14+16:42:48+CEST+2015&
> literal.date=2015-07-14T14:42:00Z&
> literal.createdOn=Wed+Jun+17+16:47:49+CEST+2015&
> literal.Last-Save-Date=2015-07-14T14:42:00Z&
> literal.fileLastModified=2015-07-14T14:42:48.161Z&
> literal.fileCreatedOn=2015-06-17T14:47:49.382Z}
>
> literal.Creation-Date=2015-06-17T15:37:00Z&
> literal.createdOn=Wed+Jun+17+17:37:47+CEST+2015&
> literal.resourceName=doc3.rtf&
> literal.fileLastModified=2015-07-14T14:42:21.463Z&
> literal.lastModified=Tue+Jul+14+16:42:21+CEST+2015&
> literal.fileCreatedOn=2015-06-17T15:37:47.107Z&
>
> literal.Creation-Date=2015-07-14T14:42:15Z&
> literal.Last-Modified=2015-07-14T14:42:15Z&
> resource.resourceName=doc4.pdf&
> literal.modified=2015-07-14T14:42:15Z&
> literal.lastModified=Tue+Jul+14+16:42:15+CEST+2015&
> literal.date=2015-07-14T14:42:15Z&
> literal.createdOn=Tue+Jul+14+16:38:37+CEST+2015&
> literal.Last-Save-Date=2015-07-14T14:42:15Z&
> literal.fileLastModified=2015-07-14T14:42:15.083Z&
> literal.fileCreatedOn=2015-07-14T14:38:37.632Z&
> literal.created=Tue+Jul+14+16:42:15+CEST+2015
>
> The DOCX and PDF documents are alright but the TXT and RTF are not. The
> latter ones seem to be missing the 'Last-Modified' and 'modified' fields. I
> tried to use the 'Move metadata' and 'Field mapping' tabs on the job
> definition to map, for example, the 'fileLastModified' field to the
> 'last_modified' Solr field but for some of the documents I get an error
> 'multiple values encountered for non multiValued field last_modified:
> [2015-03-11T12:32:00.000Z, 2015-03-11T12:32:30.468Z]' and I would not like
> to map the 'last_modified' Solr field as a multivalued field, unless I have
> no other options.
>
> So the question is if this is a bug or if there exists a workaround for
> this particular scenario?
>
> Regards,
> vigi
>
>
>

Reply via email to