I posted the pertinent question to the solr dev list. Let's see what they say.
Thanks, Karl On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright <[email protected]> wrote: > Hi, > > The exception in the solr.log should be reported as a Solr bug. It is not > emanating from the Tika extractor (Solr Cell), but is in Solr itself. > > I wish there was an easy fix for this. The problem is *not* an empty > stream; it's that Solr is attempting to do something with it that it > shouldn't. MCF just gets back a 500 error from Solr, and we can't recover > from that. > > >>>>>> > https://**********/webtop/component/drl?versionLabel=CURRENT&objectId=091e8486805142f5 > (500) > <<<<<< > > Karl > > > > > On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan < > [email protected]> wrote: > >> Hi Karl, >> >> >> >> After configuring Solr to ignore Tika errors by adding Tika transformer >> in the job, below behavior is observed. >> >> >> >> 1) ManifoldCF fetches the content from documentum, which contains >> null content and tries to push it to the output connector(Solr). >> >> 2) Solr couldn’t accept the null as a value and throwing “Missing >> content stream” error. >> >> 3) Each agent thread In ManifoldCF internally held-up with >> different r_object_id’s that don’t have body content and keeps trying to >> push the content to Solr after each failure, but Solr couldn’t accept the >> content and throws the same error. >> >> 4) Over the time, the manifold job stops with the error thrown by >> Solr >> >> >> >> Please let know if there is any configuration change which can help us >> resolve this issue. >> >> >> >> Please find the attached manifoldCF error log,Solr error log and agent >> log. >> >> >> >> Regards, >> >> Tamizh Kumaran. >> >> >> >> *From:* Karl Wright [mailto:[email protected]] >> *Sent:* Tuesday, June 13, 2017 2:23 PM >> *To:* [email protected] >> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani >> *Subject:* Re: ManifoldCF documentum indexing issue >> >> >> >> Hi Tamizh, >> >> >> >> The reported error is 'Error from server at http://localhost:8983/solr/ >> documentum_manifoldcf_stg: String index out of range: -188'. The >> message seemingly indicates that the error was *received* from the solr >> server for one specific document. ManifoldCF does not recognize the error >> as being innocuous and therefore it will retry for a while until it >> eventually gives up and halts the job. However, I cannot find that exact >> text anywhere in the Solr output connector code, so I wonder if you >> transcribed it correctly? >> >> There should also be the following: >> >> (1) A record of the attempts in the manifoldcf.log file, with a MCF stack >> trace attached to each one; >> >> (2) Simple history records for that document that are of the type >> INGESTDOCUMENT. >> >> (3) Solr log entries that have a Solr stack trace. >> >> >> >> The last one is the one that would be the most helpful. It is possible >> that you are seeing a problem in Solr Cell (Tika) that is manifesting >> itself in this way. You can (and should) configure your Solr to ignore >> Tika errors. >> >> >> >> Thanks, >> >> Karl >> >> >> >> >> >> >> >> >> >> On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan < >> [email protected]> wrote: >> >> Hi, >> >> >> >> The Manifoldcf 2.7.1 is running in the multiprocess zk model and >> integrated with PostgreSQL 9.3. The expected setup is to crawl the >> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui >> app is installed on the tomcat and startup script is pointed with the MF >> properties.xml during server startup. Manifold along with the bundled ZK, >> tomcat are running on the same host with OS as Red Hat Enterprise Linux >> Server release 6.9 (Santiago). The DB is running on a windows box. >> >> The ZK is integrated with the DB through the properties.xml and >> properties-global.xml >> >> The ZK, the documentum related processes(registry and server) are up and >> the two agents (start-agents.sh and start-agents-2.sh) are started which >> produce multiple threads to index the documemtum contents into SOLR through >> ManifoldCF. >> >> >> >> The Current no of the connections configured on the MF are as below. >> >> SOLR Output max connection : 25 >> >> Document repository Max Connections: 25 >> >> Properties.xml: >> >> <property name="org.apache.manifoldcf.database.maxhandles" value="50"/> >> >> <property name="org.apache.manifoldcf.crawler.threads" value="25"/> >> >> Total documentum document count : 0.5 million >> >> >> >> After the Job is started, it indexed some 20000+ documents and gets >> terminated with the below error on the Manifold JOB. >> >> Error: Repeated service interruptions - failure processing document: >> Error from server at http://localhost:8983/solr/documentum_manifoldcf_stg: >> String index out of range: -188 >> >> >> >> Please find the attached manifoldCF error log and agent log. >> >> >> >> Please let me know the observations on the cause of the issue and the >> configuration on the threads used for crawling. Please share your thoughts. >> >> >> >> Regards, >> >> Tamizh Kumaran >> >> >> >> >> > >
