I've attached a tentative patch to the ticket CONNECTORS-1434. Please confirm whether or not the patch works for you before I commit it to trunk.
Karl On Wed, Jun 21, 2017 at 6:49 AM, Tamizh Kumaran Thamizharasan < [email protected]> wrote: > Thanks Karl. > > > > Please find the below steps to recreate the issue on file system > repository. > > > > Output connector : Solr > > Repository : File system > > File name in repository : “dummy” file “name.pdf > > > > Additional Solr parameter : expandMacros=false > > > > On starting the job with above configuration, we are getting “missing > content stream” . > > Please find the attached file for complete log trace. > > > > Regards, > > Tamizh Kumaran Thamizharasan > > > > *From:* Karl Wright [mailto:[email protected]] > *Sent:* Wednesday, June 21, 2017 3:35 PM > > *To:* [email protected] > *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani > *Subject:* Re: ManifoldCF documentum indexing issue > > > > I've created a ticket, CONNECTORS-1434, to look at the file name issues. > > > > Karl > > > > > > On Wed, Jun 21, 2017 at 5:44 AM, Karl Wright <[email protected]> wrote: > > There is no good way to handle a case where Solr doesn't like the file > name. About the only thing that could be done would be to encode the > filename using something like URL encoding. This might have some effects > on existing users, but more importantly, we really would need to know what > characters were legal before adopting that solution. > > > > I am not entirely sure how the file name is transmitted to Solr when using > multipart forms, but how that is done is critical to know what to do. > > > > Karl > > > > > > On Wed, Jun 21, 2017 at 4:55 AM, Tamizh Kumaran Thamizharasan < > [email protected]> wrote: > > Hi Karl, > > > > Thanks for the update!!! > > > > As per the response from Solr team, expandMacros=false is added to the > output connector as additional parameter. > > After adding expandMacros=false, the indexing job is getting completed > with “Missing content stream” error for few of the documents and those are > not indexed into Solr. > > > > As per our analysis, the pdf document’s file name we are trying to index > from documentum contains whitespace and special characters like double > quotes. > > Which makes the file non readable and missing content stream error is > thrown. > > > > If there is any work around to overcome this issue, kindly share it with > us. > > > > Regards, > > Tamizh Kumaran Thamizharasan > > > > *From:* Karl Wright [mailto:[email protected]] > *Sent:* Wednesday, June 14, 2017 7:20 PM > > > *To:* [email protected] > *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani > *Subject:* Re: ManifoldCF documentum indexing issue > > > > Here's the response: > > > > >>>>>> > > Karl - > > There’s expandMacros=false, as covered here: https://cwiki.apache. > org/confluence/display/solr/Parameter+Substitution > > But… what exactly is being sent to Solr? Is there some kind of “${…” > being sent as a parameter? Just curious what’s getting you into this in > the first place. But disabling probably is your most desired solution. > > Erik > > <<<<<< > > > > Karl > > > > > > On Wed, Jun 14, 2017 at 9:36 AM, Karl Wright <[email protected]> wrote: > > Here's the question I posted: > > > > >>>>>> > > Hi all, > > > > I've got a ManifoldCF user who is posting content to Solr using the MCF > Solr output connector. This connector uses SolrJ under the covers -- a > fairly recent version -- but also has overridden some classes to insure > that multipart form posts will be used for most content. > > > > The problem is that, for a specific document, the user is getting an > ArrayIndexOutOfBounds exception in Solr, as follows: > > > > >>>>>> > > 2017-06-14T08:25:16,546 - ERROR [qtp862890654-69725:SolrException@148] - > {collection=c:documentum_manifoldcf_stg, > core=x:documentum_manifoldcf_stg_shard1_replica1, > node_name=n:**********:8983_solr, replica=r:core_node1, shard=s:shard1} - > java.lang.StringIndexOutOfBoundsException: String index out of range: -296 > > at java.lang.String.substring(String.java:1911) > > at org.apache.solr.request.macro.MacroExpander._expand( > MacroExpander.java:143) > > at org.apache.solr.request.macro.MacroExpander.expand( > MacroExpander.java:93) > > at org.apache.solr.request.macro.MacroExpander.expand( > MacroExpander.java:59) > > at org.apache.solr.request.macro.MacroExpander.expand( > MacroExpander.java:45) > > at org.apache.solr.request.json.RequestUtil.processParams( > RequestUtil.java:157) > > at org.apache.solr.util.SolrPluginUtils.setDefaults( > SolrPluginUtils.java:172) > > at org.apache.solr.handler.RequestHandlerBase.handleRequest( > RequestHandlerBase.java:152) > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2102) > > at org.apache.solr.servlet.HttpSolrCall.execute( > HttpSolrCall.java:654) > > at org.apache.solr.servlet.HttpSolrCall.call( > HttpSolrCall.java:460) > > at org.apache.solr.servlet.SolrDispatchFilter.doFilter( > SolrDispatchFilter.java:257) > > at org.apache.solr.servlet.SolrDispatchFilter.doFilter( > SolrDispatchFilter.java:208) > > at org.eclipse.jetty.servlet.ServletHandler$CachedChain. > doFilter(ServletHandler.java:1652) > > at org.eclipse.jetty.servlet.ServletHandler.doHandle( > ServletHandler.java:585) > > at org.eclipse.jetty.server.handler.ScopedHandler.handle( > ScopedHandler.java:143) > > at org.eclipse.jetty.security.SecurityHandler.handle( > SecurityHandler.java:577) > > at org.eclipse.jetty.server.session.SessionHandler. > doHandle(SessionHandler.java:223) > > at org.eclipse.jetty.server.handler.ContextHandler. > doHandle(ContextHandler.java:1127) > > at org.eclipse.jetty.servlet.ServletHandler.doScope( > ServletHandler.java:515) > > at org.eclipse.jetty.server.session.SessionHandler. > doScope(SessionHandler.java:185) > > at org.eclipse.jetty.server.handler.ContextHandler. > doScope(ContextHandler.java:1061) > > at org.eclipse.jetty.server.handler.ScopedHandler.handle( > ScopedHandler.java:141) > > at org.eclipse.jetty.server.handler.ContextHandlerCollection. > handle(ContextHandlerCollection.java:215) > > at org.eclipse.jetty.server.handler.HandlerCollection. > handle(HandlerCollection.java:110) > > at org.eclipse.jetty.server.handler.HandlerWrapper.handle( > HandlerWrapper.java:97) > > at org.eclipse.jetty.server.Server.handle(Server.java:499) > > at org.eclipse.jetty.server.HttpChannel.handle( > HttpChannel.java:310) > > at org.eclipse.jetty.server.HttpConnection.onFillable( > HttpConnection.java:257) > > at org.eclipse.jetty.io.AbstractConnection$2.run( > AbstractConnection.java:540) > > at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob( > QueuedThreadPool.java:635) > > at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run( > QueuedThreadPool.java:555) > > at java.lang.Thread.run(Thread.java:745) > > <<<<<< > > > > It looks worrisome to me that there's now possibly some kind of "macro > expansion" that is being triggered within parameters being sent to Solr. > Can anyone tell me either how to (a) disable this feature, or (b) how the > MCF Solr output connector should escape parameters being posted so that > Solr does not attempt any macro expansion? If the latter, I also need to > know when this feature appeared, since obviously whether or not to do the > escaping will depend on the precise version of the Solr instance involved. > > > > I'm also quite concerned that considerations of backwards compatibility > may have been lost at some point with Solr, since heretofore I could count > on older versions of SolrJ working with newer versions of Solr. Please > clarify what the current policy is.... > > > > > > Thanks, > > Karl > > <<<<<< > > > > > > > > On Wed, Jun 14, 2017 at 9:35 AM, Karl Wright <[email protected]> wrote: > > I posted the pertinent question to the solr dev list. Let's see what they > say. > > > > Thanks, > > Karl > > > > > > On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright <[email protected]> wrote: > > Hi, > > > > The exception in the solr.log should be reported as a Solr bug. It is not > emanating from the Tika extractor (Solr Cell), but is in Solr itself. > > > > I wish there was an easy fix for this. The problem is *not* an empty > stream; it's that Solr is attempting to do something with it that it > shouldn't. MCF just gets back a 500 error from Solr, and we can't recover > from that. > > >>>>>> > > https://**********/webtop/component/drl?versionLabel=CURRENT&objectId= > 091e8486805142f5 (500) > > <<<<<< > > > > Karl > > > > > > > > > > On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan < > [email protected]> wrote: > > Hi Karl, > > > > After configuring Solr to ignore Tika errors by adding Tika transformer in > the job, below behavior is observed. > > > > 1) ManifoldCF fetches the content from documentum, which contains > null content and tries to push it to the output connector(Solr). > > 2) Solr couldn’t accept the null as a value and throwing “Missing > content stream” error. > > 3) Each agent thread In ManifoldCF internally held-up with different > r_object_id’s that don’t have body content and keeps trying to push the > content to Solr after each failure, but Solr couldn’t accept the content > and throws the same error. > > 4) Over the time, the manifold job stops with the error thrown by > Solr > > > > Please let know if there is any configuration change which can help us > resolve this issue. > > > > Please find the attached manifoldCF error log,Solr error log and agent log. > > > > Regards, > > Tamizh Kumaran. > > > > *From:* Karl Wright [mailto:[email protected]] > *Sent:* Tuesday, June 13, 2017 2:23 PM > *To:* [email protected] > *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani > *Subject:* Re: ManifoldCF documentum indexing issue > > > > Hi Tamizh, > > > > The reported error is 'Error from server at http://localhost:8983/solr/ > documentum_manifoldcf_stg: String index out of range: -188'. The message > seemingly indicates that the error was *received* from the solr server for > one specific document. ManifoldCF does not recognize the error as being > innocuous and therefore it will retry for a while until it eventually gives > up and halts the job. However, I cannot find that exact text anywhere in > the Solr output connector code, so I wonder if you transcribed it correctly? > > There should also be the following: > > (1) A record of the attempts in the manifoldcf.log file, with a MCF stack > trace attached to each one; > > (2) Simple history records for that document that are of the type > INGESTDOCUMENT. > > (3) Solr log entries that have a Solr stack trace. > > > > The last one is the one that would be the most helpful. It is possible > that you are seeing a problem in Solr Cell (Tika) that is manifesting > itself in this way. You can (and should) configure your Solr to ignore > Tika errors. > > > > Thanks, > > Karl > > > > > > > > > > On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan < > [email protected]> wrote: > > Hi, > > > > The Manifoldcf 2.7.1 is running in the multiprocess zk model and > integrated with PostgreSQL 9.3. The expected setup is to crawl the > Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui > app is installed on the tomcat and startup script is pointed with the MF > properties.xml during server startup. Manifold along with the bundled ZK, > tomcat are running on the same host with OS as Red Hat Enterprise Linux > Server release 6.9 (Santiago). The DB is running on a windows box. > > The ZK is integrated with the DB through the properties.xml and > properties-global.xml > > The ZK, the documentum related processes(registry and server) are up and > the two agents (start-agents.sh and start-agents-2.sh) are started which > produce multiple threads to index the documemtum contents into SOLR through > ManifoldCF. > > > > The Current no of the connections configured on the MF are as below. > > SOLR Output max connection : 25 > > Document repository Max Connections: 25 > > Properties.xml: > > <property name="org.apache.manifoldcf.database.maxhandles" value="50"/> > > <property name="org.apache.manifoldcf.crawler.threads" value="25"/> > > Total documentum document count : 0.5 million > > > > After the Job is started, it indexed some 20000+ documents and gets > terminated with the below error on the Manifold JOB. > > Error: Repeated service interruptions - failure processing document: Error > from server at http://localhost:8983/solr/documentum_manifoldcf_stg: > String index out of range: -188 > > > > Please find the attached manifoldCF error log and agent log. > > > > Please let me know the observations on the cause of the issue and the > configuration on the threads used for crawling. Please share your thoughts. > > > > Regards, > > Tamizh Kumaran > > > > > > > > > > > > > > > > >
