Hi Luca, There is a solr setting which configures Solr Cell to ignore tika errors. I don't remember what it is offhand, but you will want to set it properly to disable tika errors.
Thanks, Karl On Mon, Oct 6, 2014 at 7:08 AM, Basso Luca <[email protected] > wrote: > Hi Karl, > > we’re using the Web repository connector in conjunction with the Solr > output connector to crawl a number of web portals (MCF vers. 1.6.1). > Unfortunately the crawl job often stops giving the following error: > > “Repeated service interruptions – failure processing documents: Server at > http://vm97lnx:9474/solr/rerweb5 returned non ok status: 500, message: > Internal Server Error”. > > From the MCF and SOLR logs (which we report hereafter) it seems that the > problem is arising from Tika and apply to various types of documents (.rtf, > .pdf, etc.). > > How can we fix it? > > Thank you. > > > > Best regards, > > Luca > > > > MCF log: > > > > WARN 2014-10-03 17:00:53,982 (Worker thread '37') - Solr exception during > indexing > http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf > (500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok > status:500, message:Internal Server Error > > org.apache.solr.common.SolrException: Server at > http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, > message:Internal Server Error > > WARN 2014-10-03 17:00:53,985 (Worker thread '37') - Service interruption > reported for job 1412340881687 connection 'Webcrawler': Solr exception > during indexing > http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf > (500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok > status:500, message:Internal Server Error > > ERROR 2014-10-03 17:00:53,998 (Worker thread '37') - Exception tossed: > Repeated service interruptions - failure processing document: Server at > http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, > message:Internal Server Error > > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated > service interruptions - failure processing document: Server at > http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, > message:Internal Server Error > > Caused by: org.apache.solr.common.SolrException: Server at > http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, > message:Internal Server Error > > > > WARN 2014-10-03 18:05:22,636 (Worker thread '0') - Solr exception during > indexing > http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf > (500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok > status:500, message:Internal Server Error > > org.apache.solr.common.SolrException: Server at > http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, > message:Internal Server Error > > WARN 2014-10-03 18:05:22,638 (Worker thread '0') - Service interruption > reported for job 1412252016695 connection 'Webcrawler': Solr exception > during indexing > http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf > (500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok > status:500, message:Internal Server Error > > ERROR 2014-10-03 18:05:22,649 (Worker thread '0') - Exception tossed: > Repeated service interruptions - failure processing document: Server at > http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, > message:Internal Server Error > > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated > service interruptions - failure processing document: Server at > http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, > message:Internal Server Error > > Caused by: org.apache.solr.common.SolrException: Server at > http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, > message:Internal Server Error > > > > SOLR log: > > > > 8:05:10,908 ERROR [org.apache.solr.servlet.SolrDispatchFilter] > (http-/10.10.80.97:9474-2) null:org.apache.solr.common.SolrException: > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from > org.apache.tika.parser.pdf.PDFParser@6533a82a > > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) > > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > > at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241) > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916) > > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768) > > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415) > > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205) > > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280) > > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248) > > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275) > > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161) > > at > org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:165) > > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155) > > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372) > > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877) > > at > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679) > > at > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931) > > at java.lang.Thread.run(Thread.java:745) > > Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal > IOException from org.apache.tika.parser.pdf.PDFParser@6533a82a > > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248) > > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) > > ... 20 more > > Caused by: org.apache.pdfbox.exceptions.WrappedIOException > > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244) > > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1206) > > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1171) > > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:124) > > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > > ... 23 more > > Caused by: java.lang.StringIndexOutOfBoundsException: String index out of > range: 2047 > > at > java.lang.AbstractStringBuilder.deleteCharAt(AbstractStringBuilder.java:762) > > at java.lang.StringBuilder.deleteCharAt(StringBuilder.java:258) > > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1000) > > at > org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808) > > at > org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1241) > > at > org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:558) > > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:188) > > ... 27 more > > > > 17:00:42,273 ERROR [org.apache.solr.servlet.SolrDispatchFilter] > (http-/10.10.80.97:9474-2) null:org.apache.solr.common.SolrException: > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.rtf.RTFParser@73361285 > > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) > > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > > at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241) > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916) > > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768) > > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415) > > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205) > > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280) > > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248) > > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275) > > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161) > > at > org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:165) > > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155) > > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372) > > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877) > > at > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679) > > at > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931) > > at java.lang.Thread.run(Thread.java:745) > > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.rtf.RTFParser@73361285 > > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) > > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) > > ... 20 more > > Caused by: java.lang.ArrayIndexOutOfBoundsException: 9 > > at > org.apache.tika.parser.rtf.TextExtractor.processControlWord(TextExtractor.java:872) > > at > org.apache.tika.parser.rtf.TextExtractor.parseControlWord(TextExtractor.java:566) > > at > org.apache.tika.parser.rtf.TextExtractor.parseControlToken(TextExtractor.java:492) > > at > org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:459) > > at > org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:448) > > at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:56) > > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > > ... 23 more >
