[jira] [Resolved] (CONNECTORS-1307) Tika extractor infinite loop on error
[ https://issues.apache.org/jira/browse/CONNECTORS-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1307. - Resolution: Fixed The fix for CONNECTORS-1308 should also have resolved this. > Tika extractor infinite loop on error > - > > Key: CONNECTORS-1307 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1307 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.4 > Environment: windows 64bit, java version "1.8.0_77", > pdfbox-1.8.10.jar, tika-parsers-1.10.jar >Reporter: Konstantin Avdeev >Assignee: Karl Wright > Fix For: ManifoldCF 2.5 > > > The Tika extractor gets stuck (is trying to parse the same document again and > again) on the following error: > {code} > FATAL 2016-04-29 10:55:45,505 (Worker thread '41') - Error tossed: null > java.lang.StackOverflowError > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:296) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:348) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > {code} > -Xss - is the default one, which is, I believe, 512k. > We can increase the stack trace size, but I think, this error should not lead > to such situation. > Thanks a lot! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (CONNECTORS-1308) Upgrade to Tika 1.12
[ https://issues.apache.org/jira/browse/CONNECTORS-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1308. - Resolution: Fixed r1741790 > Upgrade to Tika 1.12 > > > Key: CONNECTORS-1308 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1308 > Project: ManifoldCF > Issue Type: Task > Components: Tika extractor >Affects Versions: ManifoldCF 2.5 >Reporter: Karl Wright >Assignee: Karl Wright > Fix For: ManifoldCF 2.5 > > > Need to upgrade to Tika 1.12. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1305) Windows Share connector: SmbException tossed: 0xC0000205
[ https://issues.apache.org/jira/browse/CONNECTORS-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265421#comment-15265421 ] Konstantin Avdeev commented on CONNECTORS-1305: --- Seems to be working! before: {code} ERROR 2016-04-30 20:35:17,664 (Worker thread '11') - JCIFS: SmbException tossed processing smb://localhost/share/longDir-0/longDir-1/longDir-2/longDir-3/longDir-4/longDir-5/longDir-6/longDir-7/longDir-8/longDir-9/longDir-10/longDir-11/longDir-12/longDir-13/longDir-14/longDir-15/longDir-16/longDir-17/longDir-18/longDir-19/longDir-20/longDir-21/longDir-22/longDir-23/longDir-24/longDir-25/longDir-26/longDir-27/longDir-28/longDir-29/longDir-30/longDir-31/longDir-32/longDir-33/longDir-34/longDir-35/longDir-36/longDir-37/longDir-38/longDir-39/longDir-40/longDir-41/longDir-42/longDir-43/longDir-44/longDir-45/ jcifs.smb.SmbException: 0xC205 INFO 2016-04-30 20:35:17,671 (Worker thread '11') - Aborting job 1460647853267 due to error 'SmbException tossed: 0xC205' {code} after: {code} WARN 2016-04-30 20:38:45,769 (Worker thread '86') - JCIFS: Out of resources exception reading document/directory smb://localhost/share/longDir-0/longDir-1/longDir-2/longDir-3/longDir-4/longDir-5/longDir-6/longDir-7/longDir-8/longDir-9/longDir-10/longDir-11/longDir-12/longDir-13/longDir-14/longDir-15/longDir-16/longDir-17/longDir-18/longDir-19/longDir-20/longDir-21/longDir-22/longDir-23/longDir-24/longDir-25/longDir-26/longDir-27/longDir-28/longDir-29/longDir-30/longDir-31/longDir-32/longDir-33/longDir-34/longDir-35/longDir-36/longDir-37/longDir-38/longDir-39/longDir-40/longDir-41/longDir-42/longDir-43/longDir-44/longDir-45/ - skipping {code} Thank you for the patch! > Windows Share connector: SmbException tossed: 0xC205 > > > Key: CONNECTORS-1305 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1305 > Project: ManifoldCF > Issue Type: Bug > Components: JCIFS connector >Affects Versions: ManifoldCF 2.4 > Environment: Windows server 2012 >Reporter: Konstantin Avdeev >Assignee: Karl Wright > Fix For: ManifoldCF 2.5 > > Attachments: CONNECTORS-1305.patch > > > Windows share jobs stop when encountering an [Insufficient server resources > exist to complete the > request|https://msdn.microsoft.com/en-us/library/cc704588.aspx] server reply > (0xC205 - STATUS_INSUFF_SERVER_RESOURCES). > Is it possible to catch that exception as well? > Thank you! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1308) Upgrade to Tika 1.12
[ https://issues.apache.org/jira/browse/CONNECTORS-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265392#comment-15265392 ] Karl Wright commented on CONNECTORS-1308: - Well, the Searchblox connector tests fail: {code} run-tests: [junit] Testsuite: org.apache.manifoldcf.agents.output.searchblox.SearchBloxDocumentTest [junit] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.31 1 sec [junit] [junit] Testcase: updateJsonString(org.apache.manifoldcf.agents.output.searchblox.SearchBloxDocumentTest): FAILED [junit] expected:but was: [junit] junit.framework.AssertionFailedError: expected: but was: [junit] at org.apache.manifoldcf.agents.output.searchblox.SearchBloxDocumentTest.updateJsonString(SearchBloxDocumentTest.java:204) [junit] [junit] {code} I doubt however this is due to the Tika changes. > Upgrade to Tika 1.12 > > > Key: CONNECTORS-1308 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1308 > Project: ManifoldCF > Issue Type: Task > Components: Tika extractor >Affects Versions: ManifoldCF 2.5 >Reporter: Karl Wright >Assignee: Karl Wright > Fix For: ManifoldCF 2.5 > > > Need to upgrade to Tika 1.12. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1308) Upgrade to Tika 1.12
[ https://issues.apache.org/jira/browse/CONNECTORS-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265377#comment-15265377 ] Karl Wright commented on CONNECTORS-1308: - Turns out that the 2.0.1 spec is fine and is consistent with the SearchBlox connector. Somehow, though, we're pulling in a javax.ws.rs.core.Response class that is incompatible. Still trying to figure out where that might be coming from. > Upgrade to Tika 1.12 > > > Key: CONNECTORS-1308 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1308 > Project: ManifoldCF > Issue Type: Task > Components: Tika extractor >Affects Versions: ManifoldCF 2.5 >Reporter: Karl Wright >Assignee: Karl Wright > Fix For: ManifoldCF 2.5 > > > Need to upgrade to Tika 1.12. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1308) Upgrade to Tika 1.12
[ https://issues.apache.org/jira/browse/CONNECTORS-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265368#comment-15265368 ] Karl Wright commented on CONNECTORS-1308: - Having some trouble with resteasy in the searchblox connector. I need to upgrade to 3.0.16.Final for the resteasy version in order to get the javax.ws.rs-api jar to be compatible with what Tika needs. But that means that the connector doesn't compile. I've created a branch to work on this: branches/CONNECTORS-1308 > Upgrade to Tika 1.12 > > > Key: CONNECTORS-1308 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1308 > Project: ManifoldCF > Issue Type: Task > Components: Tika extractor >Affects Versions: ManifoldCF 2.5 >Reporter: Karl Wright >Assignee: Karl Wright > Fix For: ManifoldCF 2.5 > > > Need to upgrade to Tika 1.12. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Searchblox connector conflict with latest Tika
Hi all, The current Searchblox connector uses Resteasy, which has an embedded (and not package-forked) version of javax.ws.rs-api embedded in one of its jars (jaxrs-api-3.0.8.Final.jar). This basically breaks the build since Tika 1.12 has a subdependency on javax.ws.rs-api.jar version 2.0.1, which is not apparently compatible with whatever javax api code that resteasy packaged up. Because of this, we're going to need to redevelop the Searchblox connector, or at least perform major surgery on it. It's possible that later versions of Resteasy use a more modern api spec -- but still we will need to work on the connector. The work so far is in branches/CONNECTORS-1308. You can show the problem by: ant make-core-deps ant build Any ideas? Karl
[jira] [Updated] (CONNECTORS-1307) Tika extractor infinite loop on error
[ https://issues.apache.org/jira/browse/CONNECTORS-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1307: Fix Version/s: ManifoldCF 2.5 > Tika extractor infinite loop on error > - > > Key: CONNECTORS-1307 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1307 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.4 > Environment: windows 64bit, java version "1.8.0_77", > pdfbox-1.8.10.jar, tika-parsers-1.10.jar >Reporter: Konstantin Avdeev >Assignee: Karl Wright > Fix For: ManifoldCF 2.5 > > > The Tika extractor gets stuck (is trying to parse the same document again and > again) on the following error: > {code} > FATAL 2016-04-29 10:55:45,505 (Worker thread '41') - Error tossed: null > java.lang.StackOverflowError > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:296) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:348) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > {code} > -Xss - is the default one, which is, I believe, 512k. > We can increase the stack trace size, but I think, this error should not lead > to such situation. > Thanks a lot! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (CONNECTORS-1307) Tika extractor infinite loop on error
[ https://issues.apache.org/jira/browse/CONNECTORS-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1307: --- Assignee: Karl Wright > Tika extractor infinite loop on error > - > > Key: CONNECTORS-1307 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1307 > Project: ManifoldCF > Issue Type: Bug > Components: Tika extractor >Affects Versions: ManifoldCF 2.4 > Environment: windows 64bit, java version "1.8.0_77", > pdfbox-1.8.10.jar, tika-parsers-1.10.jar >Reporter: Konstantin Avdeev >Assignee: Karl Wright > > The Tika extractor gets stuck (is trying to parse the same document again and > again) on the following error: > {code} > FATAL 2016-04-29 10:55:45,505 (Worker thread '41') - Error tossed: null > java.lang.StackOverflowError > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:296) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:348) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > at > org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) > {code} > -Xss - is the default one, which is, I believe, 512k. > We can increase the stack trace size, but I think, this error should not lead > to such situation. > Thanks a lot! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CONNECTORS-1308) Upgrade to Tika 1.12
Karl Wright created CONNECTORS-1308: --- Summary: Upgrade to Tika 1.12 Key: CONNECTORS-1308 URL: https://issues.apache.org/jira/browse/CONNECTORS-1308 Project: ManifoldCF Issue Type: Task Components: Tika extractor Affects Versions: ManifoldCF 2.5 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 2.5 Need to upgrade to Tika 1.12. -- This message was sent by Atlassian JIRA (v6.3.4#6332)