I opened TIKA-2384 for this. Let’s move discussion there. From: Luís Filipe Nassif [mailto:[email protected]] Sent: Friday, June 2, 2017 9:00 AM To: [email protected] Subject: RE: "Stream closed" error when extracting text using Tika Server
I think resources should be closed where they are opened, like parser.parse() API contract, no? Luis Em 2 de jun de 2017 8:27 AM, "Allison, Timothy B." <[email protected]<mailto:[email protected]>> escreveu: Haris is correct. The static "parse()" closes the InputStream so we shouldn't wrap the call to parse in an autoclose try(InputStream is = xyz) { TikaResource.parse(...) } Once I remove the autoclosing try, the test passes. -----Original Message----- From: Sergey Beryozkin [mailto:[email protected]<mailto:[email protected]>] Sent: Friday, June 2, 2017 7:20 AM To: [email protected]<mailto:[email protected]> Subject: Re: "Stream closed" error when extracting text using Tika Server Hi Tim, sorry, I'm not sure now what I was planning to fix :-), I've looked at the source again and it is not a case of InputStream returned directly from the method... try/catch will most likely work better, though may be it would hide some issue to do with some of the parsers closing the stream early somewhere... Thanks, Sergey On 02/06/17 12:13, Allison, Timothy B. wrote: > Thank you for sharing this with us. > > Oddly, I’m able to reproduce this with our 2pic.docx test file, but > not with our “test_recursive_embedded.docx”. > > Please open a ticket on our JIRA. > > *From:*Haris Osmanagic > [mailto:[email protected]<mailto:[email protected]>] > *Sent:* Friday, June 2, 2017 6:28 AM > *To:* [email protected]<mailto:[email protected]> > *Subject:* "Stream closed" error when extracting text using Tika > Server > > Hi everyone! > > I am using Tika Server, and I have faced a weird thing when extracting > text and requiring a plain text response. Tests can be found here: > https://github.com/hariso/tika/commit/2a0dc37a4427070360c7ebe147712d9c > 873a4e7b > > *Version used*: 1.15 > > *File used*: Any I tried (MS Word, DOCX, PDF) > > *Method used*: Multipart upload, using Accept: text/plain > > *Expected result*: extracted text > > *Actual result*: extract text PLUS an error saying > > <ns1:XMLFault > xmlns:ns1="http://cxf.apache.org/bindings/xformat"><ns1:faultstring > xmlns:ns1="http://cxf.apache.org/bindings/xformat">java.io<http://java.io>.IOException: > Stream Closed</ns1:faultstring></ns1:XMLFault> > > Looking at the code, it seems like the method used for producing text > is using try-with-resources > <https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c8 > 73a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/Tika > Resource.java#L408-L411>, and the used input stream has already been > closed. The method used for producing XML doesn't do it > <https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c873a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java#L476>. > > In my use case, the parsed text is processed in an additional, where > using XML/HTML is not really desired, hence I cannot use it as a > workaround (at least not now). > > Any help or comments are appreciated! > > Haris > -- Sergey Beryozkin Talend Community Coders http://coders.talend.com/
