[jira] [Resolved] (CONNECTORS-1307) Tika extractor infinite loop on error

2016-04-30 Thread Karl Wright (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1307.
-
Resolution: Fixed

The fix for CONNECTORS-1308 should also have resolved this.


> Tika extractor infinite loop on error
> -
>
> Key: CONNECTORS-1307
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1307
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.4
> Environment: windows 64bit, java version "1.8.0_77", 
> pdfbox-1.8.10.jar, tika-parsers-1.10.jar
>Reporter: Konstantin Avdeev
>Assignee: Karl Wright
> Fix For: ManifoldCF 2.5
>
>
> The Tika extractor gets stuck (is trying to parse the same document again and 
> again) on the following error:
> {code}
> FATAL 2016-04-29 10:55:45,505 (Worker thread '41') - Error tossed: null
> java.lang.StackOverflowError
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:296)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:348)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
> {code}
> -Xss - is the default one, which is, I believe, 512k.
> We can increase the stack trace size, but I think, this error should not lead 
> to such situation.
> Thanks a lot!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (CONNECTORS-1308) Upgrade to Tika 1.12

2016-04-30 Thread Karl Wright (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1308.
-
Resolution: Fixed

r1741790


> Upgrade to Tika 1.12
> 
>
> Key: CONNECTORS-1308
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1308
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.5
>Reporter: Karl Wright
>Assignee: Karl Wright
> Fix For: ManifoldCF 2.5
>
>
> Need to upgrade to Tika 1.12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1305) Windows Share connector: SmbException tossed: 0xC0000205

2016-04-30 Thread Konstantin Avdeev (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265421#comment-15265421
 ] 

Konstantin Avdeev commented on CONNECTORS-1305:
---

Seems to be working!
before:
{code}
ERROR 2016-04-30 20:35:17,664 (Worker thread '11') - JCIFS: SmbException tossed 
processing 
smb://localhost/share/longDir-0/longDir-1/longDir-2/longDir-3/longDir-4/longDir-5/longDir-6/longDir-7/longDir-8/longDir-9/longDir-10/longDir-11/longDir-12/longDir-13/longDir-14/longDir-15/longDir-16/longDir-17/longDir-18/longDir-19/longDir-20/longDir-21/longDir-22/longDir-23/longDir-24/longDir-25/longDir-26/longDir-27/longDir-28/longDir-29/longDir-30/longDir-31/longDir-32/longDir-33/longDir-34/longDir-35/longDir-36/longDir-37/longDir-38/longDir-39/longDir-40/longDir-41/longDir-42/longDir-43/longDir-44/longDir-45/
jcifs.smb.SmbException: 0xC205
 INFO 2016-04-30 20:35:17,671 (Worker thread '11') - Aborting job 1460647853267 
due to error 'SmbException tossed: 0xC205'
{code}
after:
{code}
 WARN 2016-04-30 20:38:45,769 (Worker thread '86') - JCIFS: Out of resources 
exception reading document/directory 
smb://localhost/share/longDir-0/longDir-1/longDir-2/longDir-3/longDir-4/longDir-5/longDir-6/longDir-7/longDir-8/longDir-9/longDir-10/longDir-11/longDir-12/longDir-13/longDir-14/longDir-15/longDir-16/longDir-17/longDir-18/longDir-19/longDir-20/longDir-21/longDir-22/longDir-23/longDir-24/longDir-25/longDir-26/longDir-27/longDir-28/longDir-29/longDir-30/longDir-31/longDir-32/longDir-33/longDir-34/longDir-35/longDir-36/longDir-37/longDir-38/longDir-39/longDir-40/longDir-41/longDir-42/longDir-43/longDir-44/longDir-45/
 - skipping
{code}

Thank you for the patch!

> Windows Share connector: SmbException tossed: 0xC205
> 
>
> Key: CONNECTORS-1305
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1305
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: JCIFS connector
>Affects Versions: ManifoldCF 2.4
> Environment: Windows server 2012
>Reporter: Konstantin Avdeev
>Assignee: Karl Wright
> Fix For: ManifoldCF 2.5
>
> Attachments: CONNECTORS-1305.patch
>
>
> Windows share jobs stop when encountering an [Insufficient server resources 
> exist to complete the 
> request|https://msdn.microsoft.com/en-us/library/cc704588.aspx] server reply 
> (0xC205 - STATUS_INSUFF_SERVER_RESOURCES).
> Is it possible to catch that exception as well?
> Thank you!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1308) Upgrade to Tika 1.12

2016-04-30 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265392#comment-15265392
 ] 

Karl Wright commented on CONNECTORS-1308:
-

Well, the Searchblox connector tests fail:

{code}
run-tests:
[junit] Testsuite: 
org.apache.manifoldcf.agents.output.searchblox.SearchBloxDocumentTest
[junit] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.31
1 sec
[junit]
[junit] Testcase: 
updateJsonString(org.apache.manifoldcf.agents.output.searchblox.SearchBloxDocumentTest):
  FAILED
[junit] expected: but was:
[junit] junit.framework.AssertionFailedError: expected: but 
was:
[junit] at 
org.apache.manifoldcf.agents.output.searchblox.SearchBloxDocumentTest.updateJsonString(SearchBloxDocumentTest.java:204)
[junit]
[junit]
{code}

I doubt however this is due to the Tika changes.

> Upgrade to Tika 1.12
> 
>
> Key: CONNECTORS-1308
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1308
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.5
>Reporter: Karl Wright
>Assignee: Karl Wright
> Fix For: ManifoldCF 2.5
>
>
> Need to upgrade to Tika 1.12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1308) Upgrade to Tika 1.12

2016-04-30 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265377#comment-15265377
 ] 

Karl Wright commented on CONNECTORS-1308:
-

Turns out that the 2.0.1 spec is fine and is consistent with the SearchBlox 
connector.  Somehow, though, we're pulling in a javax.ws.rs.core.Response class 
that is incompatible.  Still trying to figure out where that might be coming 
from.


> Upgrade to Tika 1.12
> 
>
> Key: CONNECTORS-1308
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1308
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.5
>Reporter: Karl Wright
>Assignee: Karl Wright
> Fix For: ManifoldCF 2.5
>
>
> Need to upgrade to Tika 1.12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1308) Upgrade to Tika 1.12

2016-04-30 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265368#comment-15265368
 ] 

Karl Wright commented on CONNECTORS-1308:
-

Having some trouble with resteasy in the searchblox connector.  I need to 
upgrade to 3.0.16.Final for the resteasy version in order to get the 
javax.ws.rs-api jar to be compatible with what Tika needs.  But that means that 
the connector doesn't compile.

I've created a branch to work on this: branches/CONNECTORS-1308


> Upgrade to Tika 1.12
> 
>
> Key: CONNECTORS-1308
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1308
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.5
>Reporter: Karl Wright
>Assignee: Karl Wright
> Fix For: ManifoldCF 2.5
>
>
> Need to upgrade to Tika 1.12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Searchblox connector conflict with latest Tika

2016-04-30 Thread Karl Wright
Hi all,

The current Searchblox connector uses Resteasy, which has an embedded (and
not package-forked) version of javax.ws.rs-api embedded in one of its jars
(jaxrs-api-3.0.8.Final.jar).  This basically breaks the build since Tika
1.12 has a subdependency on javax.ws.rs-api.jar version 2.0.1, which is not
apparently compatible with whatever javax api code that resteasy packaged
up.

Because of this, we're going to need to redevelop the Searchblox connector,
or at least perform major surgery on it.  It's possible that later versions
of Resteasy use a more modern api spec -- but still we will need to work on
the connector.

The work so far is in branches/CONNECTORS-1308.  You can show the problem
by:

ant make-core-deps
ant build

Any ideas?
Karl


[jira] [Updated] (CONNECTORS-1307) Tika extractor infinite loop on error

2016-04-30 Thread Karl Wright (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1307:

Fix Version/s: ManifoldCF 2.5

> Tika extractor infinite loop on error
> -
>
> Key: CONNECTORS-1307
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1307
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.4
> Environment: windows 64bit, java version "1.8.0_77", 
> pdfbox-1.8.10.jar, tika-parsers-1.10.jar
>Reporter: Konstantin Avdeev
>Assignee: Karl Wright
> Fix For: ManifoldCF 2.5
>
>
> The Tika extractor gets stuck (is trying to parse the same document again and 
> again) on the following error:
> {code}
> FATAL 2016-04-29 10:55:45,505 (Worker thread '41') - Error tossed: null
> java.lang.StackOverflowError
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:296)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:348)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
> {code}
> -Xss - is the default one, which is, I believe, 512k.
> We can increase the stack trace size, but I think, this error should not lead 
> to such situation.
> Thanks a lot!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (CONNECTORS-1307) Tika extractor infinite loop on error

2016-04-30 Thread Karl Wright (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1307:
---

Assignee: Karl Wright

> Tika extractor infinite loop on error
> -
>
> Key: CONNECTORS-1307
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1307
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika extractor
>Affects Versions: ManifoldCF 2.4
> Environment: windows 64bit, java version "1.8.0_77", 
> pdfbox-1.8.10.jar, tika-parsers-1.10.jar
>Reporter: Konstantin Avdeev
>Assignee: Karl Wright
>
> The Tika extractor gets stuck (is trying to parse the same document again and 
> again) on the following error:
> {code}
> FATAL 2016-04-29 10:55:45,505 (Worker thread '41') - Error tossed: null
> java.lang.StackOverflowError
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:296)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:348)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
> {code}
> -Xss - is the default one, which is, I believe, 512k.
> We can increase the stack trace size, but I think, this error should not lead 
> to such situation.
> Thanks a lot!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CONNECTORS-1308) Upgrade to Tika 1.12

2016-04-30 Thread Karl Wright (JIRA)
Karl Wright created CONNECTORS-1308:
---

 Summary: Upgrade to Tika 1.12
 Key: CONNECTORS-1308
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1308
 Project: ManifoldCF
  Issue Type: Task
  Components: Tika extractor
Affects Versions: ManifoldCF 2.5
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 2.5


Need to upgrade to Tika 1.12.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)