You can also configure Solr to ignore TikaExceptions by adding the following to <requestHandler name="/update/extract" ...> in solrconfig.xml:
<bool name="ignoreTikaException">true</bool>

This will prevent the MCF job from stopping.

For efficiency reasons, I will strongly recommend to filter out all kinds of documents you're not interested in, such as media files.

Filtering out media files by filename extension may not be enough, so I suggest filtering out mime types as well by adding a few lines in the "Excluded mime types" field in Solr Output Connection.

Filtering out mp3s for instance, might be done by adding this to the "Exclude from crawl" field:
\.mp3$

- and the following to the "Excluded mime types" field:
audio/mpeg
audio/mpeg3

Erlend

On 9/16/13 11:40 AM, Karl Wright wrote:
I would second what Erlend said.

If you nevertheless want to index mp3's, I'd bring this up on the Solr
or Tika boards.

Karl



On Mon, Sep 16, 2013 at 5:15 AM, Erlend Garåsen <[email protected]
<mailto:[email protected]>> wrote:


    It seems that Tika is involved and tries to parse large files, i.e.
    MP3s.

    Do you really need to index such files? If not, try to filter them
    out by adding a rule in the "exclude from crawl" field for the
    configured job.

    Erlend


    On 9/16/13 7:13 AM, Yossi Nachum wrote:

        Hi,

        I am trying to index my windows pc files with manifoldcf version
        1.3 and
        solr version 4.4.

        I create output connection and repository connection and started
        a new
        job that scan my E drive.

        Everything seems like it work ok but after a few minutes solr stop
        getting new, I am seeing that through tomcat log file.

        On manifold crawler ui I see that the job is still running but
        after few
        minutes I am getting the following error:
        "Error: Repeated service interruptions - failure processing
        document:
        Server at http://localhost:8080/solr/__collection1
        <http://localhost:8080/solr/collection1> returned non ok
        status:500, message:Internal Server Error"

        I am seeing that tomcat process is constantly consume 100% of
        one cpu (I
        have two cpu's) even after I get the error message from manifolfcf
        crawler ui.

        I check the thread dump in solr admin and saw that the following
        threads
        take the most cpu/user time
        "
        http-8080-3 (32)

           * java.io.FileInputStream.__readBytes(Native Method)
           * java.io.FileInputStream.read(__FileInputStream.java:236)
           *
        java.io.BufferedInputStream.__fill(BufferedInputStream.java:__235)
           *
        java.io.BufferedInputStream.__read1(BufferedInputStream.__java:275)
           *
        java.io.BufferedInputStream.__read(BufferedInputStream.java:__334)
           * org.apache.tika.io
        
<http://org.apache.tika.io>.__ProxyInputStream.read(__ProxyInputStream.java:99)
           * java.io.FilterInputStream.__read(FilterInputStream.java:__133)
           * org.apache.tika.io.TailStream.__read(TailStream.java:117)
           * org.apache.tika.io.TailStream.__skip(TailStream.java:140)
           *
        
org.apache.tika.parser.mp3.__MpegStream.skipStream(__MpegStream.java:283)
           *
        org.apache.tika.parser.mp3.__MpegStream.skipFrame(__MpegStream.java:160)
           *
        
org.apache.tika.parser.mp3.__Mp3Parser.getAllTagHandlers(__Mp3Parser.java:193)
           *
        org.apache.tika.parser.mp3.__Mp3Parser.parse(Mp3Parser.__java:71)
           *
        
org.apache.tika.parser.__CompositeParser.parse(__CompositeParser.java:242)
           *
        
org.apache.tika.parser.__CompositeParser.parse(__CompositeParser.java:242)
           *
        
org.apache.tika.parser.__AutoDetectParser.parse(__AutoDetectParser.java:120)
           *
        
org.apache.solr.handler.__extraction.__ExtractingDocumentLoader.load(__ExtractingDocumentLoader.java:__219)
           *
        
org.apache.solr.handler.__ContentStreamHandlerBase.__handleRequestBody(__ContentStreamHandlerBase.java:__74)
           *
        
org.apache.solr.handler.__RequestHandlerBase.__handleRequest(__RequestHandlerBase.java:135)
           *
        
org.apache.solr.core.__RequestHandlers$__LazyRequestHandlerWrapper.__handleRequest(RequestHandlers.__java:241)
           * org.apache.solr.core.SolrCore.__execute(SolrCore.java:1904)
           *
        
org.apache.solr.servlet.__SolrDispatchFilter.execute(__SolrDispatchFilter.java:659)
           *
        
org.apache.solr.servlet.__SolrDispatchFilter.doFilter(__SolrDispatchFilter.java:362)
           *
        
org.apache.solr.servlet.__SolrDispatchFilter.doFilter(__SolrDispatchFilter.java:158)
           *
        
org.apache.catalina.core.__ApplicationFilterChain.__internalDoFilter(__ApplicationFilterChain.java:__235)
           *
        
org.apache.catalina.core.__ApplicationFilterChain.__doFilter(__ApplicationFilterChain.java:__206)
           *
        
org.apache.catalina.core.__StandardWrapperValve.invoke(__StandardWrapperValve.java:233)
           *
        
org.apache.catalina.core.__StandardContextValve.invoke(__StandardContextValve.java:191)
           *
        
org.apache.catalina.core.__StandardHostValve.invoke(__StandardHostValve.java:127)
           *
        
org.apache.catalina.valves.__ErrorReportValve.invoke(__ErrorReportValve.java:102)
           *
        
org.apache.catalina.core.__StandardEngineValve.invoke(__StandardEngineValve.java:109)
           *
        
org.apache.catalina.connector.__CoyoteAdapter.service(__CoyoteAdapter.java:298)
           *
        
org.apache.coyote.http11.__Http11Processor.process(__Http11Processor.java:857)
           *
        
org.apache.coyote.http11.__Http11Protocol$__Http11ConnectionHandler.__process(Http11Protocol.java:__588)
           * org.apache.tomcat.util.net
        
<http://org.apache.tomcat.util.net>.__JIoEndpoint$Worker.run(__JIoEndpoint.java:489)
           * java.lang.Thread.run(Thread.__java:679)


        "

        does anyone know what can I do? how to debug this issue? I don't see
        anything in the log files and I am stuck
        Thanks,
        Yossi






Reply via email to