You can also configure Solr to ignore TikaExceptions by adding the
following to <requestHandler name="/update/extract" ...> in solrconfig.xml:
<bool name="ignoreTikaException">true</bool>
This will prevent the MCF job from stopping.
For efficiency reasons, I will strongly recommend to filter out all
kinds of documents you're not interested in, such as media files.
Filtering out media files by filename extension may not be enough, so I
suggest filtering out mime types as well by adding a few lines in the
"Excluded mime types" field in Solr Output Connection.
Filtering out mp3s for instance, might be done by adding this to the
"Exclude from crawl" field:
\.mp3$
- and the following to the "Excluded mime types" field:
audio/mpeg
audio/mpeg3
Erlend
On 9/16/13 11:40 AM, Karl Wright wrote:
I would second what Erlend said.
If you nevertheless want to index mp3's, I'd bring this up on the Solr
or Tika boards.
Karl
On Mon, Sep 16, 2013 at 5:15 AM, Erlend Garåsen <[email protected]
<mailto:[email protected]>> wrote:
It seems that Tika is involved and tries to parse large files, i.e.
MP3s.
Do you really need to index such files? If not, try to filter them
out by adding a rule in the "exclude from crawl" field for the
configured job.
Erlend
On 9/16/13 7:13 AM, Yossi Nachum wrote:
Hi,
I am trying to index my windows pc files with manifoldcf version
1.3 and
solr version 4.4.
I create output connection and repository connection and started
a new
job that scan my E drive.
Everything seems like it work ok but after a few minutes solr stop
getting new, I am seeing that through tomcat log file.
On manifold crawler ui I see that the job is still running but
after few
minutes I am getting the following error:
"Error: Repeated service interruptions - failure processing
document:
Server at http://localhost:8080/solr/__collection1
<http://localhost:8080/solr/collection1> returned non ok
status:500, message:Internal Server Error"
I am seeing that tomcat process is constantly consume 100% of
one cpu (I
have two cpu's) even after I get the error message from manifolfcf
crawler ui.
I check the thread dump in solr admin and saw that the following
threads
take the most cpu/user time
"
http-8080-3 (32)
* java.io.FileInputStream.__readBytes(Native Method)
* java.io.FileInputStream.read(__FileInputStream.java:236)
*
java.io.BufferedInputStream.__fill(BufferedInputStream.java:__235)
*
java.io.BufferedInputStream.__read1(BufferedInputStream.__java:275)
*
java.io.BufferedInputStream.__read(BufferedInputStream.java:__334)
* org.apache.tika.io
<http://org.apache.tika.io>.__ProxyInputStream.read(__ProxyInputStream.java:99)
* java.io.FilterInputStream.__read(FilterInputStream.java:__133)
* org.apache.tika.io.TailStream.__read(TailStream.java:117)
* org.apache.tika.io.TailStream.__skip(TailStream.java:140)
*
org.apache.tika.parser.mp3.__MpegStream.skipStream(__MpegStream.java:283)
*
org.apache.tika.parser.mp3.__MpegStream.skipFrame(__MpegStream.java:160)
*
org.apache.tika.parser.mp3.__Mp3Parser.getAllTagHandlers(__Mp3Parser.java:193)
*
org.apache.tika.parser.mp3.__Mp3Parser.parse(Mp3Parser.__java:71)
*
org.apache.tika.parser.__CompositeParser.parse(__CompositeParser.java:242)
*
org.apache.tika.parser.__CompositeParser.parse(__CompositeParser.java:242)
*
org.apache.tika.parser.__AutoDetectParser.parse(__AutoDetectParser.java:120)
*
org.apache.solr.handler.__extraction.__ExtractingDocumentLoader.load(__ExtractingDocumentLoader.java:__219)
*
org.apache.solr.handler.__ContentStreamHandlerBase.__handleRequestBody(__ContentStreamHandlerBase.java:__74)
*
org.apache.solr.handler.__RequestHandlerBase.__handleRequest(__RequestHandlerBase.java:135)
*
org.apache.solr.core.__RequestHandlers$__LazyRequestHandlerWrapper.__handleRequest(RequestHandlers.__java:241)
* org.apache.solr.core.SolrCore.__execute(SolrCore.java:1904)
*
org.apache.solr.servlet.__SolrDispatchFilter.execute(__SolrDispatchFilter.java:659)
*
org.apache.solr.servlet.__SolrDispatchFilter.doFilter(__SolrDispatchFilter.java:362)
*
org.apache.solr.servlet.__SolrDispatchFilter.doFilter(__SolrDispatchFilter.java:158)
*
org.apache.catalina.core.__ApplicationFilterChain.__internalDoFilter(__ApplicationFilterChain.java:__235)
*
org.apache.catalina.core.__ApplicationFilterChain.__doFilter(__ApplicationFilterChain.java:__206)
*
org.apache.catalina.core.__StandardWrapperValve.invoke(__StandardWrapperValve.java:233)
*
org.apache.catalina.core.__StandardContextValve.invoke(__StandardContextValve.java:191)
*
org.apache.catalina.core.__StandardHostValve.invoke(__StandardHostValve.java:127)
*
org.apache.catalina.valves.__ErrorReportValve.invoke(__ErrorReportValve.java:102)
*
org.apache.catalina.core.__StandardEngineValve.invoke(__StandardEngineValve.java:109)
*
org.apache.catalina.connector.__CoyoteAdapter.service(__CoyoteAdapter.java:298)
*
org.apache.coyote.http11.__Http11Processor.process(__Http11Processor.java:857)
*
org.apache.coyote.http11.__Http11Protocol$__Http11ConnectionHandler.__process(Http11Protocol.java:__588)
* org.apache.tomcat.util.net
<http://org.apache.tomcat.util.net>.__JIoEndpoint$Worker.run(__JIoEndpoint.java:489)
* java.lang.Thread.run(Thread.__java:679)
"
does anyone know what can I do? how to debug this issue? I don't see
anything in the log files and I am stuck
Thanks,
Yossi