What I meant was that while working with the issue 1 you see Tika reporting a Zip Bomb issue - this is different to the issue you were facing initially, which was OOM.

Thus, we can assume the 1st option of dealing with submitting a massive file (where you submit) works, except that you see a Zip Bomb issue which is about a Zip content being problematic. This is what I suggested you'd investigate in a different thread. One thing I can suggest, that with HTTP client sending a massive payload, it makes sense to set connect and receive timeouts on the client side to some large values, check HttpClient docs on how to do it. The stacktrace was saying something about the connection being aborted - the low receive/connect timeouts might've affected it - and in turn - it might've affected Tika mistakenly reporting a Zip Bomb...

So - try the 1st option again, with the timeouts set on the client side, if you will be still seeing a Zip Bomb issue - then investigate it separately, and also continue looking at other options suggested in this thread...

HTH, Sergey


On 29/02/16 14:41, raghu vittal wrote:
Thanks for your reply.

I actually started this thread for finding a way to extract content out of 
chunked portion of  file

Will TIKA supports to extract content from file chunk.?

Regards,
Raghu.
________________________________________
From: Sergey Beryozkin <[email protected]>
Sent: Monday, February 29, 2016 7:23 PM
To: [email protected]
Subject: Re: Unable to extract content from chunked portion of large file

Well, it is a different issue now, the server is processing a 250MB
payload and throws an error:

org.apache.tika.exception.TikaException: Zip bomb detected!

So may be you need to start a new thread...

Cheers, Sergey
On 29/02/16 13:49, raghu vittal wrote:
it is working. thx

i have tried sending 250MB file using multipart/form-data it is giving 
exception.

ERROR:
Feb 29, 2016 7:07:27 PM org.apache.tika.server.resource.TikaResource logRequest
INFO: tika/form (autodetecting type)
Feb 29, 2016 7:09:02 PM org.apache.tika.server.resource.TikaResource parse
WARNING: tika/form: Text extraction failed
org.apache.tika.exception.TikaException: Zip bomb detected!
          at 
org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContent
Handler.java:192)
          ... 31 more
Feb 29, 2016 7:09:02 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
roblem
SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
aResource$4, ContentType: text/plain
Feb 29, 2016 7:09:02 PM org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogg
ing
WARNING: Interceptor for {http://resource.server.tika.apache.org/}MetadataResour
ce has thrown exception, unwinding now
org.apache.cxf.interceptor.Fault: Could not send Message.
          ... 24 more
Caused by: java.io.IOException: An established connection was aborted by the sof
tware in your host machine
          ... 35 more


and i have tried to get the chunk of file data and passed to tika using 
multipart/form-data getting exception.

ERROR:
Feb 29, 2016 7:02:43 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
roblem
SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
aResource$4, ContentType: text/plain
Feb 29, 2016 7:04:30 PM org.apache.tika.server.resource.TikaResource logRequest
INFO: tika/form (autodetecting type)
Feb 29, 2016 7:04:30 PM org.apache.tika.server.resource.TikaResource parse
WARNING: tika/form: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.ap
ache.tika.parser.microsoft.ooxml.OOXMLParser@41530372
Feb 29, 2016 7:04:30 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
roblem
SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
aResource$4, ContentType: text/plain


we are struck up handling this scenarios. In our production we have documents  
of this size. we need to handle this.

please help us.

Regards,
Raghu.

________________________________________
From: Sergey Beryozkin <[email protected]>
Sent: Monday, February 29, 2016 6:50 PM
To: [email protected]
Subject: Re: Unable to extract content from chunked portion of large file

Hi

In the first case it should be

http://localhost:9998/tika/form

Sergey
On 29/02/16 13:09, raghu vittal wrote:
Hi ken,


these are my observations ..


scenario -1


Tika Url : http://localhost:9998/tika


I have tried the multipart/form-data  suggested by Sergey . i am getting
below error (we are using tika 1.11 server)

var data = File.ReadAllBytes(filename);
using (var client = new HttpClient())
{
using (var content = new MultipartFormDataContent())
{
ByteArrayContent byteArrayContent = new ByteArrayContent(data);
byteArrayContent.Headers.Add("Content-Type", "application/octet-stream");
content.Add(byteArrayContent);
var str = client.PutAsync(tikaServerUrl,
content).Result.Content.ReadAsStringAsync().Result;
}

*ERROR*:

Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource
logRequest
INFO: tika
(multipart/form-data;boundary="03cc158f-3213-439f-a0be-3aba14c7036b")

Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource parse
WARNING: tika: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.ap
ache.tika.server.resource.TikaResource$1@36b1a1ec
           at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282
)
           at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
20)
       ................................................
           at java.lang.Thread.run(Thread.java:745)
Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported
Media Type
           at
org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.jav
a:116)
           at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280
)
           ... 32 more

Feb 29, 2016 5:26:01 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
logMessageHandlerP
roblem
SEVERE: Problem with writing the data, class
org.apache.tika.server.resource.Tik
aResource$4, ContentType: text/plain

I think TIKA does not support POST request.

*
*

*Passing 240 MB file to tika for content extraction it is giving me the
Errors.*

*Scenario -2*


Tika Url : http://localhost:9998/unpack/all


Rather than ReadStringAsync() i have used ReadStreamAsync()  and
captured the output stream to "ZipArchive"


*ERROR:*

Feb 29, 2016 6:03:26 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
logMessageHandlerP
roblem
SEVERE: Problem with writing the data, class java.util.HashMap,
ContentType: app
lication/zip
Feb 29, 2016 6:03:26 PM org.apache.cxf.phase.PhaseInterceptorChain
doDefaultLogg
ing
WARNING: Interceptor for
{http://resource.server.tika.apache.org/}MetadataResour
ce has thrown exception, unwinding now
org.apache.cxf.interceptor.Fault
           at
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleWriteExcep
tion(JAXRSOutInterceptor.java:363)
           ... 41 more

Feb 29, 2016 6:03:28 PM org.apache.cxf.phase.PhaseInterceptorChain
doDefaultLogg
ing
WARNING: Interceptor for
{http://resource.server.tika.apache.org/}MetadataResour
ce has thrown exception, unwinding now
org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
           at
org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
leMessage(JAXRSDefaultFaultOutInterceptor.java:102)
Caused by: com.ctc.wstx.exc.WstxIOException: null
           at
com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:255)
           at
org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
leMessage(JAXRSDefaultFaultOutInterceptor.java:100)
           ... 26 more

*Scenario -3*

Tika url : http://localhost:9998/tika

*
*

*ERROR:*
****
Feb 29, 2016 6:05:55 PM org.apache.tika.server.resource.TikaResource
logRequest
INFO: tika (autodetecting type)
Feb 29, 2016 6:07:35 PM org.apache.tika.server.resource.TikaResource parse
WARNING: tika: Text extraction failed
org.apache.tika.exception.TikaException: Zip bomb detected!
           at
org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContent
Handler.java:192)
           at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
23)
           at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
20)
           ... 31 more

Feb 29, 2016 6:07:35 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
logMessageHandlerP
roblem
SEVERE: Problem with writing the data, class
org.apache.tika.server.resource.Tik
aResource$4, ContentType: text/plain
Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain
doDefaultLogg
ing
WARNING: Interceptor for
{http://resource.server.tika.apache.org/}MetadataResour
ce has thrown exception, unwinding now
org.apache.cxf.interceptor.Fault: Could not send Message.
           at
org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndi
ngInterceptor.handleMessage(MessageSenderInterceptor.java:64)
           ... 31 more

Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain
doDefaultLogg
ing
WARNING: Interceptor for
{http://resource.server.tika.apache.org/}MetadataResour
ce has thrown exception, unwinding now
org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
           at
org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
leMessage(JAXRSDefaultFaultOutInterceptor.java:102)
           at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseIntercept
orChain.java:307)
....


i was able to extract the content using 80 MB document.

If i split the large file in to chunks and pass it to Tika  giving me
exceptions.

i am building the solution  in .NET

Regards,
Raghu.


------------------------------------------------------------------------
*From:* Ken Krugler <[email protected]>
*Sent:* Saturday, February 27, 2016 6:22 AM
*To:* [email protected]
*Subject:* RE: Unable to extract content from chunked portion of large file
Hi Raghu,

Previously you'd said

"sending very large files to Tika will cause out of memory exception"

and

"sending that large file to Tika will causing timeout issues"

I assume these are two different issues, as the second one seems related
to how you're connecting to the Tika server via HTTP, correct?

For out of memory issues, I'd suggested creating an input stream that
can read from a chunked file *stored on disk*, thus alleviating at least
part of the memory usage constraint. If the problem is that the
resulting extracted text is also too big for memory, and you need to
send it as a single document to Elasticsearch, then that's a separate
(non-Tika) issue.

For the timeout when sending the file to the Tika server, Sergey has
already mentioned that you should be able to send it
as multipart/form-data. And that will construct a temp file on disk from
the chunks, and (I assume) stream it to Tika, so that also would take
care of the same memory issue on the input side.

Given the above, it seems like you've got enough ideas to try to solve
this issue, yes?

Regards,

-- Ken

------------------------------------------------------------------------

*From:* raghu vittal

*Sent:* February 24, 2016 10:50:29pm PST

*To:* [email protected] <mailto:[email protected]>

*Subject:* Re: Unable to extract content from chunked portion of large
file


Hi Ken,

Thanks for the reply.
i understood your point.

what i have tried.

   byte[] srcBytes = File.ReadAllBytes(filePath);

get the chunk  of 1 MB out of  srcBytes

when i pass this 1 MB chunk to Tika it is giving me the error.

As the WIKI Tika needs the entire file to extract content.

this is where i struck. i don't wan't to pass entire file to Tika.

correct me if i am wrong.

--Raghu.

------------------------------------------------------------------------
*From:*Ken Krugler <[email protected]
<mailto:[email protected]>>
*Sent:*Wednesday, February 24, 2016 9:07 PM
*To:*[email protected] <mailto:[email protected]>
*Subject:*RE: Unable to extract content from chunked portion of large
file
Hi Raghu,

I don't think you understood what I was proposing.

I suggested creating a service that could receive chunks of the file
(persisted to local disk). Then this service could implement an input
stream class that would read sequentially from these pieces. This
input stream would be passed to Tika, thus giving Tika a single
continuous stream of data to the entire file content.

-- Ken

------------------------------------------------------------------------
*From:*raghu vittal
*Sent:*February 24, 2016 4:32:01am PST
*To:*[email protected] <mailto:[email protected]>
*Subject:*Re: Unable to extract content from chunked portion of large
file

Thanks for your reply.

In our application user can upload large files. Our intention is to
extract the content out of large file and dump that in Elastic for
contented based search.
we have > 300 MB size .xlsx and .doc files. sending that large file
to Tika will causing timeout issues.

i tried getting chunk of file and pass to Tika. Tika given me invalid
data exception.

I Think for Tika we need to pass entire file at once to extract content.

Raghu.

------------------------------------------------------------------------
*From:*Ken Krugler <[email protected]
<mailto:[email protected]>>
*Sent:*Friday, February 19, 2016 8:22 PM
*To:*[email protected] <mailto:[email protected]>
*Subject:*RE: Unable to extract content from chunked portion of large
file
One option is to create your own RESTful API that lets you send
chunks of the file, and then you can provide an input stream that
provides the seamless data view of the chunks to Tika (which is what
it needs).

-- Ken

------------------------------------------------------------------------
*From:*raghu vittal
*Sent:*February 19, 2016 1:37:49am PST
*To:*[email protected] <mailto:[email protected]>
*Subject:*Unable to extract content from chunked portion of large file

Hi All

we have very large PDF,.docx,.xlsx. We are using Tika to extract
content and dump data in Elastic Search for full-text search.
sending very large files to Tika will cause out of memory exception.

we want to chunk the file and send it to TIKA for content
extraction. when we passed chunked portion of file to Tika it is
giving empty text.
I assume Tika is relied on file structure that why it is not giving
any content.

we are using Tika Server(REST api) in our .net application.

please suggest us better approach for this scenario.

Regards,
Raghu.

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/



--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/


Reply via email to