Re: ManifoldCF documentum indexing issue

Karl Wright Wed, 21 Jun 2017 04:37:28 -0700

I've attached a tentative patch to the ticket CONNECTORS-1434.  Please
confirm whether or not the patch works for you before I commit it to trunk.


Karl


On Wed, Jun 21, 2017 at 6:49 AM, Tamizh Kumaran Thamizharasan <
[email protected]> wrote:

> Thanks Karl.
>
>
>
> Please find the below steps to recreate the issue on file system
> repository.
>
>
>
> Output connector : Solr
>
> Repository : File system
>
> File name in repository : “dummy” file “name.pdf
>
>
>
> Additional Solr parameter : expandMacros=false
>
>
>
> On starting the job with above configuration, we are getting “missing
> content stream” .
>
> Please find the attached file for complete log trace.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>
> *From:* Karl Wright [mailto:[email protected]]
> *Sent:* Wednesday, June 21, 2017 3:35 PM
>
> *To:* [email protected]
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> I've created a ticket, CONNECTORS-1434, to look at the file name issues.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jun 21, 2017 at 5:44 AM, Karl Wright <[email protected]> wrote:
>
> There is no good way to handle a case where Solr doesn't like the file
> name.  About the only thing that could be done would be to encode the
> filename using something like URL encoding.  This might have some effects
> on existing users, but more importantly, we really would need to know what
> characters were legal before adopting that solution.
>
>
>
> I am not entirely sure how the file name is transmitted to Solr when using
> multipart forms, but how that is done is critical to know what to do.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jun 21, 2017 at 4:55 AM, Tamizh Kumaran Thamizharasan <
> [email protected]> wrote:
>
> Hi Karl,
>
>
>
> Thanks for the update!!!
>
>
>
> As per the response from Solr team, expandMacros=false is added to the
> output connector as additional parameter.
>
> After adding  expandMacros=false, the indexing job is getting completed
> with “Missing content stream” error for few of the documents and those are
> not indexed into Solr.
>
>
>
> As per our analysis, the pdf document’s file name we are trying to index
> from documentum  contains whitespace and special characters like double
> quotes.
>
> Which makes the file non readable and missing content stream error is
> thrown.
>
>
>
> If there is any work around to overcome this issue, kindly share it with
> us.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>
> *From:* Karl Wright [mailto:[email protected]]
> *Sent:* Wednesday, June 14, 2017 7:20 PM
>
>
> *To:* [email protected]
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> Here's the response:
>
>
>
> >>>>>>
>
> Karl -
>
> There’s expandMacros=false, as covered here: https://cwiki.apache.
> org/confluence/display/solr/Parameter+Substitution
>
> But… what exactly is being sent to Solr?    Is there some kind of “${…”
> being sent as a parameter?   Just curious what’s getting you into this in
> the first place.   But disabling probably is your most desired solution.
>
>         Erik
>
> <<<<<<
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jun 14, 2017 at 9:36 AM, Karl Wright <[email protected]> wrote:
>
> Here's the question I posted:
>
>
>
> >>>>>>
>
> Hi all,
>
>
>
> I've got a ManifoldCF user who is posting content to Solr using the MCF
> Solr output connector.  This connector uses SolrJ under the covers -- a
> fairly recent version -- but also has overridden some classes to insure
> that multipart form posts will be used for most content.
>
>
>
> The problem is that, for a specific document, the user is getting an
> ArrayIndexOutOfBounds exception in Solr, as follows:
>
>
>
> >>>>>>
>
> 2017-06-14T08:25:16,546 - ERROR [qtp862890654-69725:SolrException@148] -
> {collection=c:documentum_manifoldcf_stg, 
> core=x:documentum_manifoldcf_stg_shard1_replica1,
> node_name=n:**********:8983_solr, replica=r:core_node1, shard=s:shard1} -
> java.lang.StringIndexOutOfBoundsException: String index out of range: -296
>
>         at java.lang.String.substring(String.java:1911)
>
>         at org.apache.solr.request.macro.MacroExpander._expand(
> MacroExpander.java:143)
>
>         at org.apache.solr.request.macro.MacroExpander.expand(
> MacroExpander.java:93)
>
>         at org.apache.solr.request.macro.MacroExpander.expand(
> MacroExpander.java:59)
>
>         at org.apache.solr.request.macro.MacroExpander.expand(
> MacroExpander.java:45)
>
>         at org.apache.solr.request.json.RequestUtil.processParams(
> RequestUtil.java:157)
>
>         at org.apache.solr.util.SolrPluginUtils.setDefaults(
> SolrPluginUtils.java:172)
>
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(
> RequestHandlerBase.java:152)
>
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:2102)
>
>         at org.apache.solr.servlet.HttpSolrCall.execute(
> HttpSolrCall.java:654)
>
>         at org.apache.solr.servlet.HttpSolrCall.call(
> HttpSolrCall.java:460)
>
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:257)
>
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:208)
>
>         at org.eclipse.jetty.servlet.ServletHandler$CachedChain.
> doFilter(ServletHandler.java:1652)
>
>         at org.eclipse.jetty.servlet.ServletHandler.doHandle(
> ServletHandler.java:585)
>
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:143)
>
>         at org.eclipse.jetty.security.SecurityHandler.handle(
> SecurityHandler.java:577)
>
>         at org.eclipse.jetty.server.session.SessionHandler.
> doHandle(SessionHandler.java:223)
>
>         at org.eclipse.jetty.server.handler.ContextHandler.
> doHandle(ContextHandler.java:1127)
>
>         at org.eclipse.jetty.servlet.ServletHandler.doScope(
> ServletHandler.java:515)
>
>         at org.eclipse.jetty.server.session.SessionHandler.
> doScope(SessionHandler.java:185)
>
>         at org.eclipse.jetty.server.handler.ContextHandler.
> doScope(ContextHandler.java:1061)
>
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:141)
>
>         at org.eclipse.jetty.server.handler.ContextHandlerCollection.
> handle(ContextHandlerCollection.java:215)
>
>         at org.eclipse.jetty.server.handler.HandlerCollection.
> handle(HandlerCollection.java:110)
>
>         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> HandlerWrapper.java:97)
>
>         at org.eclipse.jetty.server.Server.handle(Server.java:499)
>
>         at org.eclipse.jetty.server.HttpChannel.handle(
> HttpChannel.java:310)
>
>         at org.eclipse.jetty.server.HttpConnection.onFillable(
> HttpConnection.java:257)
>
>         at org.eclipse.jetty.io.AbstractConnection$2.run(
> AbstractConnection.java:540)
>
>         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
> QueuedThreadPool.java:635)
>
>         at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(
> QueuedThreadPool.java:555)
>
>         at java.lang.Thread.run(Thread.java:745)
>
> <<<<<<
>
>
>
> It looks worrisome to me that there's now possibly some kind of "macro
> expansion" that is being triggered within parameters being sent to Solr.
> Can anyone tell me either how to (a) disable this feature, or (b) how the
> MCF Solr output connector should escape parameters being posted so that
> Solr does not attempt any macro expansion?  If the latter, I also need to
> know when this feature appeared, since obviously whether or not to do the
> escaping will depend on the precise version of the Solr instance involved.
>
>
>
> I'm also quite concerned that considerations of backwards compatibility
> may have been lost at some point with Solr, since heretofore I could count
> on older versions of SolrJ working with newer versions of Solr.  Please
> clarify what the current policy is....
>
>
>
>
>
> Thanks,
>
> Karl
>
> <<<<<<
>
>
>
>
>
>
>
> On Wed, Jun 14, 2017 at 9:35 AM, Karl Wright <[email protected]> wrote:
>
> I posted the pertinent question to the solr dev list.  Let's see what they
> say.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright <[email protected]> wrote:
>
> Hi,
>
>
>
> The exception in the solr.log should be reported as a Solr bug.  It is not
> emanating from the Tika extractor (Solr Cell), but is in Solr itself.
>
>
>
> I wish there was an easy fix for this.  The problem is *not* an empty
> stream; it's that Solr is attempting to do something with it that it
> shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover
> from that.
>
> >>>>>>
>
> https://**********/webtop/component/drl?versionLabel=CURRENT&objectId=
> 091e8486805142f5 (500)
>
> <<<<<<
>
>
>
> Karl
>
>
>
>
>
>
>
>
>
> On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <
> [email protected]> wrote:
>
> Hi Karl,
>
>
>
> After configuring Solr to ignore Tika errors by adding Tika transformer in
> the job, below behavior is observed.
>
>
>
> 1)      ManifoldCF fetches the content from documentum, which contains
> null content and tries to push it to the output connector(Solr).
>
> 2)      Solr couldn’t accept the null as a value and throwing “Missing
> content stream” error.
>
> 3)      Each agent thread In ManifoldCF internally held-up with different
> r_object_id’s that don’t have body content and keeps trying to push the
> content to Solr  after each failure, but Solr couldn’t accept the content
> and throws the same error.
>
> 4)      Over the time, the manifold job stops with the error thrown by
> Solr
>
>
>
> Please let know if there is any configuration change which can help us
> resolve this issue.
>
>
>
> Please find the attached manifoldCF error log,Solr error log and agent log.
>
>
>
> Regards,
>
> Tamizh Kumaran.
>
>
>
> *From:* Karl Wright [mailto:[email protected]]
> *Sent:* Tuesday, June 13, 2017 2:23 PM
> *To:* [email protected]
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> Hi Tamizh,
>
>
>
> The reported error is 'Error from server at http://localhost:8983/solr/
> documentum_manifoldcf_stg: String index out of range: -188'.  The message
> seemingly indicates that the error was *received* from the solr server for
> one specific document.  ManifoldCF does not recognize the error as being
> innocuous and therefore it will retry for a while until it eventually gives
> up and halts the job.  However, I cannot find that exact text anywhere in
> the Solr output connector code, so I wonder if you transcribed it correctly?
>
> There should also be the following:
>
> (1) A record of the attempts in the manifoldcf.log file, with a MCF stack
> trace attached to each one;
>
> (2) Simple history records for that document that are of the type
> INGESTDOCUMENT.
>
> (3) Solr log entries that have a Solr stack trace.
>
>
>
> The last one is the one that would be the most helpful.  It is possible
> that you are seeing a problem in Solr Cell (Tika) that is manifesting
> itself in this way.  You can (and should) configure your Solr to ignore
> Tika errors.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
>
>
>
>
> On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <
> [email protected]> wrote:
>
> Hi,
>
>
>
> The Manifoldcf 2.7.1 is running in the multiprocess zk model and
> integrated with PostgreSQL 9.3. The expected setup is to crawl the
> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui
> app is installed on the tomcat and startup script is pointed with the MF
> properties.xml during server startup. Manifold along with the bundled ZK,
> tomcat are running on the same host with OS as  Red Hat Enterprise Linux
> Server release 6.9 (Santiago). The DB is running on a windows box.
>
> The ZK is integrated with the DB through the properties.xml and
> properties-global.xml
>
> The ZK, the documentum related processes(registry and server) are up and
> the  two agents (start-agents.sh and start-agents-2.sh) are started  which
> produce multiple threads to index the documemtum contents into SOLR through
> ManifoldCF.
>
>
>
> The Current no of the connections configured on the MF are as below.
>
> SOLR Output max connection : 25
>
> Document repository  Max Connections: 25
>
> Properties.xml:
>
> <property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
>
> <property name="org.apache.manifoldcf.crawler.threads" value="25"/>
>
> Total documentum document count : 0.5 million
>
>
>
> After the Job is started, it indexed some 20000+ documents and gets
> terminated with the below error on the Manifold JOB.
>
> Error: Repeated service interruptions - failure processing document: Error
> from server at http://localhost:8983/solr/documentum_manifoldcf_stg:
> String index out of range: -188
>
>
>
> Please find the attached manifoldCF error log and agent log.
>
>
>
> Please let me know the observations on the cause of the issue and the
> configuration on the threads used  for crawling. Please share your thoughts.
>
>
>
> Regards,
>
> Tamizh Kumaran
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: ManifoldCF documentum indexing issue

Reply via email to