Hi Shinichiro,
we found the right configuration just before your suggestion.
Thank you!

Nevertheless, applying "ignoreTikaException" reduces somewhat the problem but 
doesn't resolve it completely.
Specifically, the problem still persist for some pdf files (not only for 
scanned pdf and/or pdf converted from ms-office documents).
Given that the Tika project is not resolving this issue, we suggest that the 
problem could be bypassed at the MCF job or output connector level, 
by means of a specific flag telling the MCF webcrawler to skip "non ok status: 
500, message: Internal Server Error” and keep on crawling.

Dear Karl, can you insert this possibility in the next MCF release?
Thanks a lot, as ever.

Luca


-----Messaggio originale-----
Da: Shinichiro Abe [mailto:[email protected]] 
Inviato: martedì 7 ottobre 2014 03:21
A: [email protected]
Cc: [email protected]
Oggetto: Re: Internal server error (500) causing a crawl interruption

Hi Luca,

Please try to configure ignoreTikaException=true.

  <requestHandler name="/update/extract"
                  
class="org.apache.solr.handler.extraction.ExtractingRequestHandler" 
startup="lazy">
    <lst name="defaults">
      <str name="fmap.content">text</str>
      <str name="lowernames">true</str>
      <bool name="ignoreTikaException">true</bool>
      <str name="uprefix">ignored_</str>
      <str name="captureAttr">true</str>
    </lst>
  </requestHandler>

Regards,
Shinichiro Abe

On 2014/10/06, at 20:15, Karl Wright <[email protected]> wrote:

> Hi Luca,
> 
> There is a solr setting which configures Solr Cell to ignore tika errors.  I 
> don't remember what it is offhand, but you will want to set it properly to 
> disable tika errors.
> 
> Thanks,
> Karl
> 
> 
> On Mon, Oct 6, 2014 at 7:08 AM, Basso Luca <[email protected]> 
> wrote:
> Hi Karl,
> 
> we’re using the Web repository connector in conjunction with the Solr output 
> connector to crawl a number of web portals (MCF vers. 1.6.1). Unfortunately 
> the crawl job often stops giving the following error:
> 
> “Repeated service interruptions – failure processing documents: Server at 
> http://vm97lnx:9474/solr/rerweb5 returned non ok status: 500, message: 
> Internal Server Error”.
> 
> From the MCF and SOLR logs (which we report hereafter) it seems that the 
> problem is arising from Tika and apply to various types of documents (.rtf, 
> .pdf, etc.).
> 
> How can we fix it?
> 
> Thank you.
> 
>  
> 
> Best regards,
> 
> Luca
> 
>  
> 
> MCF log:
> 
>  
> 
> WARN 2014-10-03 17:00:53,982 (Worker thread '37') - Solr exception during 
> indexing 
> http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf
>  (500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok 
> status:500, message:Internal Server Error
> 
> org.apache.solr.common.SolrException: Server at 
> http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal 
> Server Error
> 
> WARN 2014-10-03 17:00:53,985 (Worker thread '37') - Service interruption 
> reported for job 1412340881687 connection 'Webcrawler': Solr exception during 
> indexing 
> http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf
>  (500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok 
> status:500, message:Internal Server Error
> 
> ERROR 2014-10-03 17:00:53,998 (Worker thread '37') - Exception tossed: 
> Repeated service interruptions - failure processing document: Server at 
> http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal 
> Server Error
> 
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service 
> interruptions - failure processing document: Server at 
> http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal 
> Server Error
> 
> Caused by: org.apache.solr.common.SolrException: Server at 
> http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal 
> Server Error
> 
>  
> 
> WARN 2014-10-03 18:05:22,636 (Worker thread '0') - Solr exception during 
> indexing 
> http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf
>  (500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok 
> status:500, message:Internal Server Error
> 
> org.apache.solr.common.SolrException: Server at 
> http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal 
> Server Error
> 
> WARN 2014-10-03 18:05:22,638 (Worker thread '0') - Service interruption 
> reported for job 1412252016695 connection 'Webcrawler': Solr exception during 
> indexing 
> http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf
>  (500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok 
> status:500, message:Internal Server Error
> 
> ERROR 2014-10-03 18:05:22,649 (Worker thread '0') - Exception tossed: 
> Repeated service interruptions - failure processing document: Server at 
> http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal 
> Server Error
> 
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service 
> interruptions - failure processing document: Server at 
> http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal 
> Server Error
> 
> Caused by: org.apache.solr.common.SolrException: Server at 
> http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal 
> Server Error
> 
>  
> 
> SOLR log:
> 
>  
> 
> 8:05:10,908 ERROR [org.apache.solr.servlet.SolrDispatchFilter] 
> (http-/10.10.80.97:9474-2) null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.pdf.PDFParser@6533a82a
> 
>        at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
> 
>         at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> 
>         at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> 
>         at 
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
> 
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
> 
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768)
> 
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415)
> 
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205)
> 
>         at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280)
> 
>         at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248)
> 
>         at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275)
> 
>         at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
> 
>         at 
> org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:165)
> 
>         at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
> 
>         at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> 
>         at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> 
>         at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372)
> 
>         at 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877)
> 
>         at 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679)
> 
>         at 
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931)
> 
>         at java.lang.Thread.run(Thread.java:745)
> 
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
> IOException from org.apache.tika.parser.pdf.PDFParser@6533a82a
> 
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
> 
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 
>         at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 
>         at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
> 
>         ... 20 more
> 
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> 
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244)
> 
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1206)
> 
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1171)
> 
>         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:124)
> 
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 
>         ... 23 more
> 
> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of 
> range: 2047
> 
>         at 
> java.lang.AbstractStringBuilder.deleteCharAt(AbstractStringBuilder.java:762)
> 
>         at java.lang.StringBuilder.deleteCharAt(StringBuilder.java:258)
> 
>         at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1000)
> 
>         at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808)
> 
>         at 
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1241)
> 
>         at 
> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:558)
> 
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:188)
> 
>         ... 27 more
> 
>  
> 
> 17:00:42,273 ERROR [org.apache.solr.servlet.SolrDispatchFilter] 
> (http-/10.10.80.97:9474-2) null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@73361285
> 
>         at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
> 
>         at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> 
>         at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> 
>         at 
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
> 
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
> 
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768)
> 
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415)
> 
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205)
> 
>         at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280)
> 
>         at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248)
> 
>         at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275)
> 
>         at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
> 
>         at 
> org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:165)
> 
>         at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
> 
>         at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> 
>         at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> 
>         at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372)
> 
>         at 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877)
> 
>         at 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679)
> 
>         at 
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931)
> 
>         at java.lang.Thread.run(Thread.java:745)
> 
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.rtf.RTFParser@73361285
> 
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> 
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 
>         at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 
>         at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
> 
>         ... 20 more
> 
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 9
> 
>         at 
> org.apache.tika.parser.rtf.TextExtractor.processControlWord(TextExtractor.java:872)
> 
>         at 
> org.apache.tika.parser.rtf.TextExtractor.parseControlWord(TextExtractor.java:566)
> 
>         at 
> org.apache.tika.parser.rtf.TextExtractor.parseControlToken(TextExtractor.java:492)
> 
>         at 
> org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:459)
> 
>         at 
> org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:448)
> 
>         at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:56)
> 
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 
>         ... 23 more
> 
> 

Reply via email to