Re: Unable to extract content from chunked portion of large file

raghu vittal Mon, 29 Feb 2016 06:41:49 -0800

Thanks for your reply.

I actually started this thread for finding a way to extract content out of 
chunked portion of  file


Will TIKA supports to extract content from file chunk.?
 
Regards,
Raghu.
________________________________________
From: Sergey Beryozkin <[email protected]>
Sent: Monday, February 29, 2016 7:23 PM
To: [email protected]
Subject: Re: Unable to extract content from chunked portion of large file

Well, it is a different issue now, the server is processing a 250MB
payload and throws an error:

org.apache.tika.exception.TikaException: Zip bomb detected!

So may be you need to start a new thread...

Cheers, Sergey
On 29/02/16 13:49, raghu vittal wrote:
> it is working. thx
>
> i have tried sending 250MB file using multipart/form-data it is giving 
> exception.
>
> ERROR:
> Feb 29, 2016 7:07:27 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: tika/form (autodetecting type)
> Feb 29, 2016 7:09:02 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika/form: Text extraction failed
> org.apache.tika.exception.TikaException: Zip bomb detected!
>          at 
> org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContent
> Handler.java:192)
>          ... 31 more
> Feb 29, 2016 7:09:02 PM org.apache.cxf.jaxrs.utils.JAXRSUtils 
> logMessageHandlerP
> roblem
> SEVERE: Problem with writing the data, class 
> org.apache.tika.server.resource.Tik
> aResource$4, ContentType: text/plain
> Feb 29, 2016 7:09:02 PM org.apache.cxf.phase.PhaseInterceptorChain 
> doDefaultLogg
> ing
> WARNING: Interceptor for 
> {http://resource.server.tika.apache.org/}MetadataResour
> ce has thrown exception, unwinding now
> org.apache.cxf.interceptor.Fault: Could not send Message.
>          ... 24 more
> Caused by: java.io.IOException: An established connection was aborted by the 
> sof
> tware in your host machine
>          ... 35 more
>
>
> and i have tried to get the chunk of file data and passed to tika using 
> multipart/form-data getting exception.
>
> ERROR:
> Feb 29, 2016 7:02:43 PM org.apache.cxf.jaxrs.utils.JAXRSUtils 
> logMessageHandlerP
> roblem
> SEVERE: Problem with writing the data, class 
> org.apache.tika.server.resource.Tik
> aResource$4, ContentType: text/plain
> Feb 29, 2016 7:04:30 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: tika/form (autodetecting type)
> Feb 29, 2016 7:04:30 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika/form: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.ap
> ache.tika.parser.microsoft.ooxml.OOXMLParser@41530372
> Feb 29, 2016 7:04:30 PM org.apache.cxf.jaxrs.utils.JAXRSUtils 
> logMessageHandlerP
> roblem
> SEVERE: Problem with writing the data, class 
> org.apache.tika.server.resource.Tik
> aResource$4, ContentType: text/plain
>
>
> we are struck up handling this scenarios. In our production we have documents 
>  of this size. we need to handle this.
>
> please help us.
>
> Regards,
> Raghu.
>
> ________________________________________
> From: Sergey Beryozkin <[email protected]>
> Sent: Monday, February 29, 2016 6:50 PM
> To: [email protected]
> Subject: Re: Unable to extract content from chunked portion of large file
>
> Hi
>
> In the first case it should be
>
> http://localhost:9998/tika/form
>
> Sergey
> On 29/02/16 13:09, raghu vittal wrote:
>> Hi ken,
>>
>>
>> these are my observations ..
>>
>>
>> scenario -1
>>
>>
>> Tika Url : http://localhost:9998/tika
>>
>>
>> I have tried the multipart/form-data  suggested by Sergey . i am getting
>> below error (we are using tika 1.11 server)
>>
>> var data = File.ReadAllBytes(filename);
>> using (var client = new HttpClient())
>> {
>> using (var content = new MultipartFormDataContent())
>> {
>> ByteArrayContent byteArrayContent = new ByteArrayContent(data);
>> byteArrayContent.Headers.Add("Content-Type", "application/octet-stream");
>> content.Add(byteArrayContent);
>> var str = client.PutAsync(tikaServerUrl,
>> content).Result.Content.ReadAsStringAsync().Result;
>> }
>>
>> *ERROR*:
>>
>> Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource
>> logRequest
>> INFO: tika
>> (multipart/form-data;boundary="03cc158f-3213-439f-a0be-3aba14c7036b")
>>
>> Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource parse
>> WARNING: tika: Text extraction failed
>> org.apache.tika.exception.TikaException: Unexpected RuntimeException
>> from org.ap
>> ache.tika.server.resource.TikaResource$1@36b1a1ec
>>           at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282
>> )
>>           at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
>> 20)
>>       ................................................
>>           at java.lang.Thread.run(Thread.java:745)
>> Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported
>> Media Type
>>           at
>> org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.jav
>> a:116)
>>           at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280
>> )
>>           ... 32 more
>>
>> Feb 29, 2016 5:26:01 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
>> logMessageHandlerP
>> roblem
>> SEVERE: Problem with writing the data, class
>> org.apache.tika.server.resource.Tik
>> aResource$4, ContentType: text/plain
>>
>> I think TIKA does not support POST request.
>>
>> *
>> *
>>
>> *Passing 240 MB file to tika for content extraction it is giving me the
>> Errors.*
>>
>> *Scenario -2*
>>
>>
>> Tika Url : http://localhost:9998/unpack/all
>>
>>
>> Rather than ReadStringAsync() i have used ReadStreamAsync()  and
>> captured the output stream to "ZipArchive"
>>
>>
>> *ERROR:*
>>
>> Feb 29, 2016 6:03:26 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
>> logMessageHandlerP
>> roblem
>> SEVERE: Problem with writing the data, class java.util.HashMap,
>> ContentType: app
>> lication/zip
>> Feb 29, 2016 6:03:26 PM org.apache.cxf.phase.PhaseInterceptorChain
>> doDefaultLogg
>> ing
>> WARNING: Interceptor for
>> {http://resource.server.tika.apache.org/}MetadataResour
>> ce has thrown exception, unwinding now
>> org.apache.cxf.interceptor.Fault
>>           at
>> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleWriteExcep
>> tion(JAXRSOutInterceptor.java:363)
>>           ... 41 more
>>
>> Feb 29, 2016 6:03:28 PM org.apache.cxf.phase.PhaseInterceptorChain
>> doDefaultLogg
>> ing
>> WARNING: Interceptor for
>> {http://resource.server.tika.apache.org/}MetadataResour
>> ce has thrown exception, unwinding now
>> org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
>>           at
>> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
>> leMessage(JAXRSDefaultFaultOutInterceptor.java:102)
>> Caused by: com.ctc.wstx.exc.WstxIOException: null
>>           at
>> com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:255)
>>           at
>> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
>> leMessage(JAXRSDefaultFaultOutInterceptor.java:100)
>>           ... 26 more
>>
>> *Scenario -3*
>>
>> Tika url : http://localhost:9998/tika
>>
>> *
>> *
>>
>> *ERROR:*
>> ****
>> Feb 29, 2016 6:05:55 PM org.apache.tika.server.resource.TikaResource
>> logRequest
>> INFO: tika (autodetecting type)
>> Feb 29, 2016 6:07:35 PM org.apache.tika.server.resource.TikaResource parse
>> WARNING: tika: Text extraction failed
>> org.apache.tika.exception.TikaException: Zip bomb detected!
>>           at
>> org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContent
>> Handler.java:192)
>>           at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
>> 23)
>>           at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
>> 20)
>>           ... 31 more
>>
>> Feb 29, 2016 6:07:35 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
>> logMessageHandlerP
>> roblem
>> SEVERE: Problem with writing the data, class
>> org.apache.tika.server.resource.Tik
>> aResource$4, ContentType: text/plain
>> Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain
>> doDefaultLogg
>> ing
>> WARNING: Interceptor for
>> {http://resource.server.tika.apache.org/}MetadataResour
>> ce has thrown exception, unwinding now
>> org.apache.cxf.interceptor.Fault: Could not send Message.
>>           at
>> org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndi
>> ngInterceptor.handleMessage(MessageSenderInterceptor.java:64)
>>           ... 31 more
>>
>> Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain
>> doDefaultLogg
>> ing
>> WARNING: Interceptor for
>> {http://resource.server.tika.apache.org/}MetadataResour
>> ce has thrown exception, unwinding now
>> org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
>>           at
>> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
>> leMessage(JAXRSDefaultFaultOutInterceptor.java:102)
>>           at
>> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseIntercept
>> orChain.java:307)
>> ....
>>
>>
>> i was able to extract the content using 80 MB document.
>>
>> If i split the large file in to chunks and pass it to Tika  giving me
>> exceptions.
>>
>> i am building the solution  in .NET
>>
>> Regards,
>> Raghu.
>>
>>
>> ------------------------------------------------------------------------
>> *From:* Ken Krugler <[email protected]>
>> *Sent:* Saturday, February 27, 2016 6:22 AM
>> *To:* [email protected]
>> *Subject:* RE: Unable to extract content from chunked portion of large file
>> Hi Raghu,
>>
>> Previously you'd said
>>
>> "sending very large files to Tika will cause out of memory exception"
>>
>> and
>>
>> "sending that large file to Tika will causing timeout issues"
>>
>> I assume these are two different issues, as the second one seems related
>> to how you're connecting to the Tika server via HTTP, correct?
>>
>> For out of memory issues, I'd suggested creating an input stream that
>> can read from a chunked file *stored on disk*, thus alleviating at least
>> part of the memory usage constraint. If the problem is that the
>> resulting extracted text is also too big for memory, and you need to
>> send it as a single document to Elasticsearch, then that's a separate
>> (non-Tika) issue.
>>
>> For the timeout when sending the file to the Tika server, Sergey has
>> already mentioned that you should be able to send it
>> as multipart/form-data. And that will construct a temp file on disk from
>> the chunks, and (I assume) stream it to Tika, so that also would take
>> care of the same memory issue on the input side.
>>
>> Given the above, it seems like you've got enough ideas to try to solve
>> this issue, yes?
>>
>> Regards,
>>
>> -- Ken
>>
>>> ------------------------------------------------------------------------
>>>
>>> *From:* raghu vittal
>>>
>>> *Sent:* February 24, 2016 10:50:29pm PST
>>>
>>> *To:* [email protected] <mailto:[email protected]>
>>>
>>> *Subject:* Re: Unable to extract content from chunked portion of large
>>> file
>>>
>>>
>>> Hi Ken,
>>>
>>> Thanks for the reply.
>>> i understood your point.
>>>
>>> what i have tried.
>>>
>>>>   byte[] srcBytes = File.ReadAllBytes(filePath);
>>>
>>>> get the chunk  of 1 MB out of  srcBytes
>>>
>>>> when i pass this 1 MB chunk to Tika it is giving me the error.
>>>
>>>> As the WIKI Tika needs the entire file to extract content.
>>>
>>> this is where i struck. i don't wan't to pass entire file to Tika.
>>>
>>> correct me if i am wrong.
>>>
>>> --Raghu.
>>>
>>> ------------------------------------------------------------------------
>>> *From:*Ken Krugler <[email protected]
>>> <mailto:[email protected]>>
>>> *Sent:*Wednesday, February 24, 2016 9:07 PM
>>> *To:*[email protected] <mailto:[email protected]>
>>> *Subject:*RE: Unable to extract content from chunked portion of large
>>> file
>>> Hi Raghu,
>>>
>>> I don't think you understood what I was proposing.
>>>
>>> I suggested creating a service that could receive chunks of the file
>>> (persisted to local disk). Then this service could implement an input
>>> stream class that would read sequentially from these pieces. This
>>> input stream would be passed to Tika, thus giving Tika a single
>>> continuous stream of data to the entire file content.
>>>
>>> -- Ken
>>>
>>>> ------------------------------------------------------------------------
>>>> *From:*raghu vittal
>>>> *Sent:*February 24, 2016 4:32:01am PST
>>>> *To:*[email protected] <mailto:[email protected]>
>>>> *Subject:*Re: Unable to extract content from chunked portion of large
>>>> file
>>>>
>>>> Thanks for your reply.
>>>>
>>>> In our application user can upload large files. Our intention is to
>>>> extract the content out of large file and dump that in Elastic for
>>>> contented based search.
>>>> we have > 300 MB size .xlsx and .doc files. sending that large file
>>>> to Tika will causing timeout issues.
>>>>
>>>> i tried getting chunk of file and pass to Tika. Tika given me invalid
>>>> data exception.
>>>>
>>>> I Think for Tika we need to pass entire file at once to extract content.
>>>>
>>>> Raghu.
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:*Ken Krugler <[email protected]
>>>> <mailto:[email protected]>>
>>>> *Sent:*Friday, February 19, 2016 8:22 PM
>>>> *To:*[email protected] <mailto:[email protected]>
>>>> *Subject:*RE: Unable to extract content from chunked portion of large
>>>> file
>>>> One option is to create your own RESTful API that lets you send
>>>> chunks of the file, and then you can provide an input stream that
>>>> provides the seamless data view of the chunks to Tika (which is what
>>>> it needs).
>>>>
>>>> -- Ken
>>>>
>>>>> ------------------------------------------------------------------------
>>>>> *From:*raghu vittal
>>>>> *Sent:*February 19, 2016 1:37:49am PST
>>>>> *To:*[email protected] <mailto:[email protected]>
>>>>> *Subject:*Unable to extract content from chunked portion of large file
>>>>>
>>>>> Hi All
>>>>>
>>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract
>>>>> content and dump data in Elastic Search for full-text search.
>>>>> sending very large files to Tika will cause out of memory exception.
>>>>>
>>>>> we want to chunk the file and send it to TIKA for content
>>>>> extraction. when we passed chunked portion of file to Tika it is
>>>>> giving empty text.
>>>>> I assume Tika is relied on file structure that why it is not giving
>>>>> any content.
>>>>>
>>>>> we are using Tika Server(REST api) in our .net application.
>>>>>
>>>>> please suggest us better approach for this scenario.
>>>>>
>>>>> Regards,
>>>>> Raghu.
>>
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>>
>>
>>
>>
>>
>
>
> --
> Sergey Beryozkin
>
> Talend Community Coders
> http://coders.talend.com/
>


--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: Unable to extract content from chunked portion of large file

Reply via email to