http://pastebin.com/AWkgVeUh
K On Mon, Oct 20, 2014 at 12:13:42PM -0400, Karl Wright wrote: > Can you provide the solr exception, from the solr log? > Karl > On Mon, Oct 20, 2014 at 12:11 PM, Kamil Żyta <[1][email protected]> > wrote: > > Hi, > I have some bad files too and get 500 errors from Solr, tested on > Solr stable and trunk (Tika 1.5, 1.6). ManifoldCF job hang and never > end. > ManifoldCF have 'Transformation Connections' where I added Tika > extractor. > How this works? It's only metadata extraction or mime detection? > If manifoldCF had complete Tika extraction it would had better handle > Tika > errors. > > Regards, > KŻ > On Mon, Oct 20, 2014 at 06:15:52AM -0400, Karl Wright wrote: > > Hi Luca, > > I am sorry, but we only get back a 500 error from Solr, and that is > not > > enough information to determine that Tika failed. Having a general > policy > > of ignoring 500 errors, which occur when *any* solr exception is > thrown, > > seems like a bad idea to me. Indeed, I am concerned that it is not > a Tika > > failure that you are seeing, but rather something like Solr running > out of > > memory, which should definitely never be ignored. > > You can tell by looking at the actual exception Solr logs to > determine > > what the underlying cause is. > > Thanks, > > Karl > > On Mon, Oct 20, 2014 at 5:00 AM, Basso Luca > > <[1][2][email protected]> wrote: > > > > Hi Shinichiro, > > we found the right configuration just before your suggestion. > > Thank you! > > > > Nevertheless, applying "ignoreTikaException" reduces somewhat the > > problem but doesn't resolve it completely. > > Specifically, the problem still persist for some pdf files (not > only for > > scanned pdf and/or pdf converted from ms-office documents). > > Given that the Tika project is not resolving this issue, we > suggest that > > the problem could be bypassed at the MCF job or output connector > level, > > by means of a specific flag telling the MCF webcrawler to skip > "non ok > > status: 500, message: Internal Server Error” and keep on > crawling. > > > > Dear Karl, can you insert this possibility in the next MCF > release? > > Thanks a lot, as ever. > > > > Luca > > > > -----Messaggio originale----- > > Da: Shinichiro Abe [mailto:[2][3][email protected]] > > Inviato: martedì 7 ottobre 2014 03:21 > > A: [3][4][email protected] > > Cc: [4][5][email protected] > > Oggetto: Re: Internal server error (500) causing a crawl > interruption > > Hi Luca, > > > > Please try to configure ignoreTikaException=true. > > > > <requestHandler name="/update/extract" > > > > > class="org.apache.solr.handler.extraction.ExtractingRequestHandler" > > startup="lazy"> > > <lst name="defaults"> > > <str name="fmap.content">text</str> > > <str name="lowernames">true</str> > > <bool name="ignoreTikaException">true</bool> > > <str name="uprefix">ignored_</str> > > <str name="captureAttr">true</str> > > </lst> > > </requestHandler> > > > > Regards, > > Shinichiro Abe > > > > On 2014/10/06, at 20:15, Karl Wright <[5][6][email protected]> > wrote: > > > > > Hi Luca, > > > > > > There is a solr setting which configures Solr Cell to ignore > tika > > errors. I don't remember what it is offhand, but you will want > to set > > it properly to disable tika errors. > > > > > > Thanks, > > > Karl > > > > > > > > > On Mon, Oct 6, 2014 at 7:08 AM, Basso Luca > > <[6][7][email protected]> wrote: > > > Hi Karl, > > > > > > we’re using the Web repository connector in conjunction with > the Solr > > output connector to crawl a number of web portals (MCF vers. > 1.6.1). > > Unfortunately the crawl job often stops giving the following > error: > > > > > > “Repeated service interruptions – failure processing documents: > Server > > at [7][8]http://vm97lnx:9474/solr/rerweb5 returned non ok status: > 500, > > message: Internal Server Error”. > > > > > > From the MCF and SOLR logs (which we report hereafter) it seems > that > > the problem is arising from Tika and apply to various types of > documents > > (.rtf, .pdf, etc.). > > > > > > How can we fix it? > > > > > > Thank you. > > > > > > > > > > > > Best regards, > > > > > > Luca > > > > > > > > > > > > MCF log: > > > > > > > > > > > > WARN 2014-10-03 17:00:53,982 (Worker thread '37') - Solr > exception > > during indexing > > > > [8][9]http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf > > (500): Server at [9][10]http://vm97lnx:9474/solr/rerweb5 returned > non ok > > status:500, message:Internal Server Error > > > > > > org.apache.solr.common.SolrException: Server at > > [10][11]http://vm97lnx:9474/solr/rerweb5 returned non ok > status:500, > > message:Internal Server Error > > > > > > WARN 2014-10-03 17:00:53,985 (Worker thread '37') - Service > > interruption reported for job 1412340881687 connection > 'Webcrawler': > > Solr exception during indexing > > > > [11][12]http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf > > (500): Server at [12][13]http://vm97lnx:9474/solr/rerweb5 > returned non ok > > status:500, message:Internal Server Error > > > > > > ERROR 2014-10-03 17:00:53,998 (Worker thread '37') - Exception > tossed: > > Repeated service interruptions - failure processing document: > Server at > > [13][14]http://vm97lnx:9474/solr/rerweb5 returned non ok > status:500, > > message:Internal Server Error > > > > > > org.apache.manifoldcf.core.interfaces.ManifoldCFException: > Repeated > > service interruptions - failure processing document: Server at > > [14][15]http://vm97lnx:9474/solr/rerweb5 returned non ok > status:500, > > message:Internal Server Error > > > > > > Caused by: org.apache.solr.common.SolrException: Server at > > [15][16]http://vm97lnx:9474/solr/rerweb5 returned non ok > status:500, > > message:Internal Server Error > > > > > > > > > > > > WARN 2014-10-03 18:05:22,636 (Worker thread '0') - Solr > exception > > during indexing > > > > [16][17]http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf > > (500): Server at [17][18]http://vm97lnx:9474/solr/rerweb5 > returned non ok > > status:500, message:Internal Server Error > > > > > > org.apache.solr.common.SolrException: Server at > > [18][19]http://vm97lnx:9474/solr/rerweb5 returned non ok > status:500, > > message:Internal Server Error > > > > > > WARN 2014-10-03 18:05:22,638 (Worker thread '0') - Service > > interruption reported for job 1412252016695 connection > 'Webcrawler': > > Solr exception during indexing > > > > [19][20]http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf > > (500): Server at [20][21]http://vm97lnx:9474/solr/rerweb5 > returned non ok > > status:500, message:Internal Server Error > > > > > > ERROR 2014-10-03 18:05:22,649 (Worker thread '0') - Exception > tossed: > > Repeated service interruptions - failure processing document: > Server at > > [21][22]http://vm97lnx:9474/solr/rerweb5 returned non ok > status:500, > > message:Internal Server Error > > > > > > org.apache.manifoldcf.core.interfaces.ManifoldCFException: > Repeated > > service interruptions - failure processing document: Server at > > [22][23]http://vm97lnx:9474/solr/rerweb5 returned non ok > status:500, > > message:Internal Server Error > > > > > > Caused by: org.apache.solr.common.SolrException: Server at > > [23][24]http://vm97lnx:9474/solr/rerweb5 returned non ok > status:500, > > message:Internal Server Error > > > > > > > > > > > > SOLR log: > > > > > > > > > > > > 8:05:10,908 ERROR [org.apache.solr.servlet.SolrDispatchFilter] > > (http-/10.10.80.97:9474-2) > null:org.apache.solr.common.SolrException: > > org.apache.tika.exception.TikaException: TIKA-198: Illegal > IOException > > from org.apache.tika.parser.pdf.PDFParser@6533a82a > > > > > > at > > > > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) > > > > > > at > > > > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > > > > > > at > > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > > > > > > at > > > > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241) > > > > > > at > org.apache.solr.core.SolrCore.execute(SolrCore.java:1916) > > > > > > at > > > > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768) > > > > > > at > > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415) > > > > > > at > > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205) > > > > > > at > > > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280) > > > > > > at > > > > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248) > > > > > > at > > > > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275) > > > > > > at > > > > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161) > > > > > > at > > > > org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:165) > > > > > > at > > > > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155) > > > > > > at > > > > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > > > > > > at > > > > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > > > > > > at > > > > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372) > > > > > > at > > > > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877) > > > > > > at > > > > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679) > > > > > > at > > > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931) > > > > > > at java.lang.Thread.run(Thread.java:745) > > > > > > Caused by: org.apache.tika.exception.TikaException: TIKA-198: > Illegal > > IOException from org.apache.tika.parser.pdf.PDFParser@6533a82a > > > > > > at > > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248) > > > > > > at > > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > > > > > > at > > > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > > > > > > at > > > > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) > > > > > > ... 20 more > > > > > > Caused by: org.apache.pdfbox.exceptions.WrappedIOException > > > > > > at > > org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244) > > > > > > at > > org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1206) > > > > > > at > > org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1171) > > > > > > at > > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:124) > > > > > > at > > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > > > > > > ... 23 more > > > > > > Caused by: java.lang.StringIndexOutOfBoundsException: String > index out > > of range: 2047 > > > > > > at > > > > java.lang.AbstractStringBuilder.deleteCharAt(AbstractStringBuilder.java:762) > > > > > > at > > java.lang.StringBuilder.deleteCharAt(StringBuilder.java:258) > > > > > > at > > > > org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1000) > > > > > > at > > > > org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808) > > > > > > at > > > > org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1241) > > > > > > at > > > org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:558) > > > > > > at > > org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:188) > > > > > > ... 27 more > > > > > > > > > > > > 17:00:42,273 ERROR [org.apache.solr.servlet.SolrDispatchFilter] > > (http-/10.10.80.97:9474-2) > null:org.apache.solr.common.SolrException: > > org.apache.tika.exception.TikaException: Unexpected > RuntimeException > > from org.apache.tika.parser.rtf.RTFParser@73361285 > > > > > > at > > > > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) > > > > > > at > > > > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > > > > > > at > > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > > > > > > at > > > > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241) > > > > > > at > org.apache.solr.core.SolrCore.execute(SolrCore.java:1916) > > > > > > at > > > > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768) > > > > > > at > > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415) > > > > > > at > > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205) > > > > > > at > > > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280) > > > > > > at > > > > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248) > > > > > > at > > > > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275) > > > > > > at > > > > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161) > > > > > > at > > > > org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:165) > > > > > > at > > > > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155) > > > > > > at > > > > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > > > > > > at > > > > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > > > > > > at > > > > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372) > > > > > > at > > > > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877) > > > > > > at > > > > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679) > > > > > > at > > > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931) > > > > > > at java.lang.Thread.run(Thread.java:745) > > > > > > Caused by: org.apache.tika.exception.TikaException: Unexpected > > RuntimeException from > org.apache.tika.parser.rtf.RTFParser@73361285 > > > > > > at > > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) > > > > > > at > > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > > > > > > at > > > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > > > > > > at > > > > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) > > > > > > ... 20 more > > > > > > Caused by: java.lang.ArrayIndexOutOfBoundsException: 9 > > > > > > at > > > > org.apache.tika.parser.rtf.TextExtractor.processControlWord(TextExtractor.java:872) > > > > > > at > > > > org.apache.tika.parser.rtf.TextExtractor.parseControlWord(TextExtractor.java:566) > > > > > > at > > > > org.apache.tika.parser.rtf.TextExtractor.parseControlToken(TextExtractor.java:492) > > > > > > at > > > org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:459) > > > > > > at > > > org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:448) > > > > > > at > > org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:56) > > > > > > at > > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > > > > > > ... 23 more > > > > > > > > > > References > > > > Visible links > > 1. mailto:[25][email protected] > > 2. mailto:[26][email protected] > > 3. mailto:[27][email protected] > > 4. mailto:[28][email protected] > > 5. mailto:[29][email protected] > > 6. mailto:[30][email protected] > > 7. [31]http://vm97lnx:9474/solr/rerweb5 > > 8. > > [32]http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf > > 9. [33]http://vm97lnx:9474/solr/rerweb5 > > 10. [34]http://vm97lnx:9474/solr/rerweb5 > > 11. > > [35]http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf > > 12. [36]http://vm97lnx:9474/solr/rerweb5 > > 13. [37]http://vm97lnx:9474/solr/rerweb5 > > 14. [38]http://vm97lnx:9474/solr/rerweb5 > > 15. [39]http://vm97lnx:9474/solr/rerweb5 > > 16. > > [40]http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf > > 17. [41]http://vm97lnx:9474/solr/rerweb5 > > 18. [42]http://vm97lnx:9474/solr/rerweb5 > > 19. > > [43]http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf > > 20. [44]http://vm97lnx:9474/solr/rerweb5 > > 21. [45]http://vm97lnx:9474/solr/rerweb5 > > 22. [46]http://vm97lnx:9474/solr/rerweb5 > > 23. [47]http://vm97lnx:9474/solr/rerweb5 > > References > > Visible links > 1. mailto:[email protected] > 2. mailto:[email protected] > 3. mailto:[email protected] > 4. mailto:[email protected] > 5. mailto:[email protected] > 6. mailto:[email protected] > 7. mailto:[email protected] > 8. http://vm97lnx:9474/solr/rerweb5 > 9. > http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf > 10. http://vm97lnx:9474/solr/rerweb5 > 11. http://vm97lnx:9474/solr/rerweb5 > 12. > http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf > 13. http://vm97lnx:9474/solr/rerweb5 > 14. http://vm97lnx:9474/solr/rerweb5 > 15. http://vm97lnx:9474/solr/rerweb5 > 16. http://vm97lnx:9474/solr/rerweb5 > 17. > http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf > 18. http://vm97lnx:9474/solr/rerweb5 > 19. http://vm97lnx:9474/solr/rerweb5 > 20. > http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf > 21. http://vm97lnx:9474/solr/rerweb5 > 22. http://vm97lnx:9474/solr/rerweb5 > 23. http://vm97lnx:9474/solr/rerweb5 > 24. http://vm97lnx:9474/solr/rerweb5 > 25. mailto:[email protected] > 26. mailto:[email protected] > 27. mailto:[email protected] > 28. mailto:[email protected] > 29. mailto:[email protected] > 30. mailto:[email protected] > 31. http://vm97lnx:9474/solr/rerweb5 > 32. > http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf > 33. http://vm97lnx:9474/solr/rerweb5 > 34. http://vm97lnx:9474/solr/rerweb5 > 35. > http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf > 36. http://vm97lnx:9474/solr/rerweb5 > 37. http://vm97lnx:9474/solr/rerweb5 > 38. http://vm97lnx:9474/solr/rerweb5 > 39. http://vm97lnx:9474/solr/rerweb5 > 40. > http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf > 41. http://vm97lnx:9474/solr/rerweb5 > 42. http://vm97lnx:9474/solr/rerweb5 > 43. > http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf > 44. http://vm97lnx:9474/solr/rerweb5 > 45. http://vm97lnx:9474/solr/rerweb5 > 46. http://vm97lnx:9474/solr/rerweb5 > 47. http://vm97lnx:9474/solr/rerweb5
