Thanks Karl.
Please find the below steps to recreate the issue on file system repository.
Output connector : Solr
Repository : File system
File name in repository : “dummy” file “name.pdf
Additional Solr parameter : expandMacros=false
On starting the job with above configuration, we are getting “missing content
stream” .
Please find the attached file for complete log trace.
Regards,
Tamizh Kumaran Thamizharasan
From: Karl Wright [mailto:[email protected]]
Sent: Wednesday, June 21, 2017 3:35 PM
To: [email protected]
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF documentum indexing issue
I've created a ticket, CONNECTORS-1434, to look at the file name issues.
Karl
On Wed, Jun 21, 2017 at 5:44 AM, Karl Wright
<[email protected]<mailto:[email protected]>> wrote:
There is no good way to handle a case where Solr doesn't like the file name.
About the only thing that could be done would be to encode the filename using
something like URL encoding. This might have some effects on existing users,
but more importantly, we really would need to know what characters were legal
before adopting that solution.
I am not entirely sure how the file name is transmitted to Solr when using
multipart forms, but how that is done is critical to know what to do.
Karl
On Wed, Jun 21, 2017 at 4:55 AM, Tamizh Kumaran Thamizharasan
<[email protected]<mailto:[email protected]>>
wrote:
Hi Karl,
Thanks for the update!!!
As per the response from Solr team, expandMacros=false is added to the output
connector as additional parameter.
After adding expandMacros=false, the indexing job is getting completed with
“Missing content stream” error for few of the documents and those are not
indexed into Solr.
As per our analysis, the pdf document’s file name we are trying to index from
documentum contains whitespace and special characters like double quotes.
Which makes the file non readable and missing content stream error is thrown.
If there is any work around to overcome this issue, kindly share it with us.
Regards,
Tamizh Kumaran Thamizharasan
From: Karl Wright [mailto:[email protected]<mailto:[email protected]>]
Sent: Wednesday, June 14, 2017 7:20 PM
To: [email protected]<mailto:[email protected]>
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF documentum indexing issue
Here's the response:
>>>>>>
Karl -
There’s expandMacros=false, as covered here:
https://cwiki.apache.org/confluence/display/solr/Parameter+Substitution
But… what exactly is being sent to Solr? Is there some kind of “${…” being
sent as a parameter? Just curious what’s getting you into this in the first
place. But disabling probably is your most desired solution.
Erik
<<<<<<
Karl
On Wed, Jun 14, 2017 at 9:36 AM, Karl Wright
<[email protected]<mailto:[email protected]>> wrote:
Here's the question I posted:
>>>>>>
Hi all,
I've got a ManifoldCF user who is posting content to Solr using the MCF Solr
output connector. This connector uses SolrJ under the covers -- a fairly
recent version -- but also has overridden some classes to insure that multipart
form posts will be used for most content.
The problem is that, for a specific document, the user is getting an
ArrayIndexOutOfBounds exception in Solr, as follows:
>>>>>>
2017-06-14T08:25:16,546 - ERROR [qtp862890654-69725:SolrException@148] -
{collection=c:documentum_manifoldcf_stg,
core=x:documentum_manifoldcf_stg_shard1_replica1,
node_name=n:**********:8983_solr, replica=r:core_node1, shard=s:shard1} -
java.lang.StringIndexOutOfBoundsException: String index out of range: -296
at java.lang.String.substring(String.java:1911)
at
org.apache.solr.request.macro.MacroExpander._expand(MacroExpander.java:143)
at
org.apache.solr.request.macro.MacroExpander.expand(MacroExpander.java:93)
at
org.apache.solr.request.macro.MacroExpander.expand(MacroExpander.java:59)
at
org.apache.solr.request.macro.MacroExpander.expand(MacroExpander.java:45)
at
org.apache.solr.request.json.RequestUtil.processParams(RequestUtil.java:157)
at
org.apache.solr.util.SolrPluginUtils.setDefaults(SolrPluginUtils.java:172)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:152)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2102)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:499)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at
org.eclipse.jetty.io<http://org.eclipse.jetty.io>.AbstractConnection$2.run(AbstractConnection.java:540)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)
<<<<<<
It looks worrisome to me that there's now possibly some kind of "macro
expansion" that is being triggered within parameters being sent to Solr. Can
anyone tell me either how to (a) disable this feature, or (b) how the MCF Solr
output connector should escape parameters being posted so that Solr does not
attempt any macro expansion? If the latter, I also need to know when this
feature appeared, since obviously whether or not to do the escaping will depend
on the precise version of the Solr instance involved.
I'm also quite concerned that considerations of backwards compatibility may
have been lost at some point with Solr, since heretofore I could count on older
versions of SolrJ working with newer versions of Solr. Please clarify what the
current policy is....
Thanks,
Karl
<<<<<<
On Wed, Jun 14, 2017 at 9:35 AM, Karl Wright
<[email protected]<mailto:[email protected]>> wrote:
I posted the pertinent question to the solr dev list. Let's see what they say.
Thanks,
Karl
On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright
<[email protected]<mailto:[email protected]>> wrote:
Hi,
The exception in the solr.log should be reported as a Solr bug. It is not
emanating from the Tika extractor (Solr Cell), but is in Solr itself.
I wish there was an easy fix for this. The problem is *not* an empty stream;
it's that Solr is attempting to do something with it that it shouldn't. MCF
just gets back a 500 error from Solr, and we can't recover from that.
>>>>>>
https://**********/webtop/component/drl?versionLabel=CURRENT&objectId=091e8486805142f5
(500)
<<<<<<
Karl
On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan
<[email protected]<mailto:[email protected]>>
wrote:
Hi Karl,
After configuring Solr to ignore Tika errors by adding Tika transformer in the
job, below behavior is observed.
1) ManifoldCF fetches the content from documentum, which contains null
content and tries to push it to the output connector(Solr).
2) Solr couldn’t accept the null as a value and throwing “Missing content
stream” error.
3) Each agent thread In ManifoldCF internally held-up with different
r_object_id’s that don’t have body content and keeps trying to push the content
to Solr after each failure, but Solr couldn’t accept the content and throws
the same error.
4) Over the time, the manifold job stops with the error thrown by Solr
Please let know if there is any configuration change which can help us resolve
this issue.
Please find the attached manifoldCF error log,Solr error log and agent log.
Regards,
Tamizh Kumaran.
From: Karl Wright [mailto:[email protected]<mailto:[email protected]>]
Sent: Tuesday, June 13, 2017 2:23 PM
To: [email protected]<mailto:[email protected]>
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF documentum indexing issue
Hi Tamizh,
The reported error is 'Error from server at
http://localhost:8983/solr/documentum_manifoldcf_stg: String index out of
range: -188'. The message seemingly indicates that the error was *received*
from the solr server for one specific document. ManifoldCF does not recognize
the error as being innocuous and therefore it will retry for a while until it
eventually gives up and halts the job. However, I cannot find that exact text
anywhere in the Solr output connector code, so I wonder if you transcribed it
correctly?
There should also be the following:
(1) A record of the attempts in the manifoldcf.log file, with a MCF stack trace
attached to each one;
(2) Simple history records for that document that are of the type
INGESTDOCUMENT.
(3) Solr log entries that have a Solr stack trace.
The last one is the one that would be the most helpful. It is possible that
you are seeing a problem in Solr Cell (Tika) that is manifesting itself in this
way. You can (and should) configure your Solr to ignore Tika errors.
Thanks,
Karl
On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan
<[email protected]<mailto:[email protected]>>
wrote:
Hi,
The Manifoldcf 2.7.1 is running in the multiprocess zk model and integrated
with PostgreSQL 9.3. The expected setup is to crawl the Documentum contents and
pushed on to the output SOLR 5.3.2. The crawler-ui app is installed on the
tomcat and startup script is pointed with the MF properties.xml during server
startup. Manifold along with the bundled ZK, tomcat are running on the same
host with OS as Red Hat Enterprise Linux Server release 6.9 (Santiago). The DB
is running on a windows box.
The ZK is integrated with the DB through the properties.xml and
properties-global.xml
The ZK, the documentum related processes(registry and server) are up and the
two agents (start-agents.sh and start-agents-2.sh) are started which produce
multiple threads to index the documemtum contents into SOLR through ManifoldCF.
The Current no of the connections configured on the MF are as below.
SOLR Output max connection : 25
Document repository Max Connections: 25
Properties.xml:
<property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
<property
name="org.apache.manifoldcf.cr<http://org.apache.manifoldcf.cr>awler.threads"
value="25"/>
Total documentum document count : 0.5 million
After the Job is started, it indexed some 20000+ documents and gets terminated
with the below error on the Manifold JOB.
Error: Repeated service interruptions - failure processing document: Error from
server at http://localhost:8983/solr/documentum_manifoldcf_stg: String index
out of range: -188
Please find the attached manifoldCF error log and agent log.
Please let me know the observations on the cause of the issue and the
configuration on the threads used for crawling. Please share your thoughts.
Regards,
Tamizh Kumaran
org.apache.solr.common.SolrException: missing content stream
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:64)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2102)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:499)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)