Re: Unable to extract content from chunked portion of large file

raghu vittal Mon, 29 Feb 2016 05:51:07 -0800

it is working. thx

i have tried sending 250MB file using multipart/form-data it is giving 
exception.


ERROR:
Feb 29, 2016 7:07:27 PM org.apache.tika.server.resource.TikaResource logRequest
INFO: tika/form (autodetecting type)
Feb 29, 2016 7:09:02 PM org.apache.tika.server.resource.TikaResource parse
WARNING: tika/form: Text extraction failed
org.apache.tika.exception.TikaException: Zip bomb detected!
        at org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContent
Handler.java:192)
        ... 31 more
Feb 29, 2016 7:09:02 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
roblem
SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
aResource$4, ContentType: text/plain
Feb 29, 2016 7:09:02 PM org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogg
ing
WARNING: Interceptor for {http://resource.server.tika.apache.org/}MetadataResour
ce has thrown exception, unwinding now
org.apache.cxf.interceptor.Fault: Could not send Message.
        ... 24 more
Caused by: java.io.IOException: An established connection was aborted by the sof
tware in your host machine
        ... 35 more


and i have tried to get the chunk of file data and passed to tika using 
multipart/form-data getting exception.
 
ERROR:
Feb 29, 2016 7:02:43 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
roblem
SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
aResource$4, ContentType: text/plain
Feb 29, 2016 7:04:30 PM org.apache.tika.server.resource.TikaResource logRequest
INFO: tika/form (autodetecting type)
Feb 29, 2016 7:04:30 PM org.apache.tika.server.resource.TikaResource parse
WARNING: tika/form: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.ap
ache.tika.parser.microsoft.ooxml.OOXMLParser@41530372
Feb 29, 2016 7:04:30 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
roblem
SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
aResource$4, ContentType: text/plain            


we are struck up handling this scenarios. In our production we have documents  
of this size. we need to handle this.

please help us.

Regards,
Raghu.

________________________________________
From: Sergey Beryozkin <[email protected]>
Sent: Monday, February 29, 2016 6:50 PM
To: [email protected]
Subject: Re: Unable to extract content from chunked portion of large file

Hi

In the first case it should be

http://localhost:9998/tika/form

Sergey
On 29/02/16 13:09, raghu vittal wrote:
> Hi ken,
>
>
> these are my observations ..
>
>
> scenario -1
>
>
> Tika Url : http://localhost:9998/tika
>
>
> I have tried the multipart/form-data  suggested by Sergey . i am getting
> below error (we are using tika 1.11 server)
>
> var data = File.ReadAllBytes(filename);
> using (var client = new HttpClient())
> {
> using (var content = new MultipartFormDataContent())
> {
> ByteArrayContent byteArrayContent = new ByteArrayContent(data);
> byteArrayContent.Headers.Add("Content-Type", "application/octet-stream");
> content.Add(byteArrayContent);
> var str = client.PutAsync(tikaServerUrl,
> content).Result.Content.ReadAsStringAsync().Result;
> }
>
> *ERROR*:
>
> Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource
> logRequest
> INFO: tika
> (multipart/form-data;boundary="03cc158f-3213-439f-a0be-3aba14c7036b")
>
> Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException
> from org.ap
> ache.tika.server.resource.TikaResource$1@36b1a1ec
>          at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282
> )
>          at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
> 20)
>      ................................................
>          at java.lang.Thread.run(Thread.java:745)
> Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported
> Media Type
>          at
> org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.jav
> a:116)
>          at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280
> )
>          ... 32 more
>
> Feb 29, 2016 5:26:01 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
> logMessageHandlerP
> roblem
> SEVERE: Problem with writing the data, class
> org.apache.tika.server.resource.Tik
> aResource$4, ContentType: text/plain
>
> I think TIKA does not support POST request.
>
> *
> *
>
> *Passing 240 MB file to tika for content extraction it is giving me the
> Errors.*
>
> *Scenario -2*
>
>
> Tika Url : http://localhost:9998/unpack/all
>
>
> Rather than ReadStringAsync() i have used ReadStreamAsync()  and
> captured the output stream to "ZipArchive"
>
>
> *ERROR:*
>
> Feb 29, 2016 6:03:26 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
> logMessageHandlerP
> roblem
> SEVERE: Problem with writing the data, class java.util.HashMap,
> ContentType: app
> lication/zip
> Feb 29, 2016 6:03:26 PM org.apache.cxf.phase.PhaseInterceptorChain
> doDefaultLogg
> ing
> WARNING: Interceptor for
> {http://resource.server.tika.apache.org/}MetadataResour
> ce has thrown exception, unwinding now
> org.apache.cxf.interceptor.Fault
>          at
> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleWriteExcep
> tion(JAXRSOutInterceptor.java:363)
>          ... 41 more
>
> Feb 29, 2016 6:03:28 PM org.apache.cxf.phase.PhaseInterceptorChain
> doDefaultLogg
> ing
> WARNING: Interceptor for
> {http://resource.server.tika.apache.org/}MetadataResour
> ce has thrown exception, unwinding now
> org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
>          at
> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
> leMessage(JAXRSDefaultFaultOutInterceptor.java:102)
> Caused by: com.ctc.wstx.exc.WstxIOException: null
>          at
> com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:255)
>          at
> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
> leMessage(JAXRSDefaultFaultOutInterceptor.java:100)
>          ... 26 more
>
> *Scenario -3*
>
> Tika url : http://localhost:9998/tika
>
> *
> *
>
> *ERROR:*
> ****
> Feb 29, 2016 6:05:55 PM org.apache.tika.server.resource.TikaResource
> logRequest
> INFO: tika (autodetecting type)
> Feb 29, 2016 6:07:35 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: tika: Text extraction failed
> org.apache.tika.exception.TikaException: Zip bomb detected!
>          at
> org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContent
> Handler.java:192)
>          at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
> 23)
>          at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
> 20)
>          ... 31 more
>
> Feb 29, 2016 6:07:35 PM org.apache.cxf.jaxrs.utils.JAXRSUtils
> logMessageHandlerP
> roblem
> SEVERE: Problem with writing the data, class
> org.apache.tika.server.resource.Tik
> aResource$4, ContentType: text/plain
> Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain
> doDefaultLogg
> ing
> WARNING: Interceptor for
> {http://resource.server.tika.apache.org/}MetadataResour
> ce has thrown exception, unwinding now
> org.apache.cxf.interceptor.Fault: Could not send Message.
>          at
> org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndi
> ngInterceptor.handleMessage(MessageSenderInterceptor.java:64)
>          ... 31 more
>
> Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain
> doDefaultLogg
> ing
> WARNING: Interceptor for
> {http://resource.server.tika.apache.org/}MetadataResour
> ce has thrown exception, unwinding now
> org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
>          at
> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
> leMessage(JAXRSDefaultFaultOutInterceptor.java:102)
>          at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseIntercept
> orChain.java:307)
> ....
>
>
> i was able to extract the content using 80 MB document.
>
> If i split the large file in to chunks and pass it to Tika  giving me
> exceptions.
>
> i am building the solution  in .NET
>
> Regards,
> Raghu.
>
>
> ------------------------------------------------------------------------
> *From:* Ken Krugler <[email protected]>
> *Sent:* Saturday, February 27, 2016 6:22 AM
> *To:* [email protected]
> *Subject:* RE: Unable to extract content from chunked portion of large file
> Hi Raghu,
>
> Previously you'd said
>
> "sending very large files to Tika will cause out of memory exception"
>
> and
>
> "sending that large file to Tika will causing timeout issues"
>
> I assume these are two different issues, as the second one seems related
> to how you're connecting to the Tika server via HTTP, correct?
>
> For out of memory issues, I'd suggested creating an input stream that
> can read from a chunked file *stored on disk*, thus alleviating at least
> part of the memory usage constraint. If the problem is that the
> resulting extracted text is also too big for memory, and you need to
> send it as a single document to Elasticsearch, then that's a separate
> (non-Tika) issue.
>
> For the timeout when sending the file to the Tika server, Sergey has
> already mentioned that you should be able to send it
> as multipart/form-data. And that will construct a temp file on disk from
> the chunks, and (I assume) stream it to Tika, so that also would take
> care of the same memory issue on the input side.
>
> Given the above, it seems like you've got enough ideas to try to solve
> this issue, yes?
>
> Regards,
>
> -- Ken
>
>> ------------------------------------------------------------------------
>>
>> *From:* raghu vittal
>>
>> *Sent:* February 24, 2016 10:50:29pm PST
>>
>> *To:* [email protected] <mailto:[email protected]>
>>
>> *Subject:* Re: Unable to extract content from chunked portion of large
>> file
>>
>>
>> Hi Ken,
>>
>> Thanks for the reply.
>> i understood your point.
>>
>> what i have tried.
>>
>> >  byte[] srcBytes = File.ReadAllBytes(filePath);
>>
>> > get the chunk  of 1 MB out of  srcBytes
>>
>> > when i pass this 1 MB chunk to Tika it is giving me the error.
>>
>> > As the WIKI Tika needs the entire file to extract content.
>>
>> this is where i struck. i don't wan't to pass entire file to Tika.
>>
>> correct me if i am wrong.
>>
>> --Raghu.
>>
>> ------------------------------------------------------------------------
>> *From:*Ken Krugler <[email protected]
>> <mailto:[email protected]>>
>> *Sent:*Wednesday, February 24, 2016 9:07 PM
>> *To:*[email protected] <mailto:[email protected]>
>> *Subject:*RE: Unable to extract content from chunked portion of large
>> file
>> Hi Raghu,
>>
>> I don't think you understood what I was proposing.
>>
>> I suggested creating a service that could receive chunks of the file
>> (persisted to local disk). Then this service could implement an input
>> stream class that would read sequentially from these pieces. This
>> input stream would be passed to Tika, thus giving Tika a single
>> continuous stream of data to the entire file content.
>>
>> -- Ken
>>
>>> ------------------------------------------------------------------------
>>> *From:*raghu vittal
>>> *Sent:*February 24, 2016 4:32:01am PST
>>> *To:*[email protected] <mailto:[email protected]>
>>> *Subject:*Re: Unable to extract content from chunked portion of large
>>> file
>>>
>>> Thanks for your reply.
>>>
>>> In our application user can upload large files. Our intention is to
>>> extract the content out of large file and dump that in Elastic for
>>> contented based search.
>>> we have > 300 MB size .xlsx and .doc files. sending that large file
>>> to Tika will causing timeout issues.
>>>
>>> i tried getting chunk of file and pass to Tika. Tika given me invalid
>>> data exception.
>>>
>>> I Think for Tika we need to pass entire file at once to extract content.
>>>
>>> Raghu.
>>>
>>> ------------------------------------------------------------------------
>>> *From:*Ken Krugler <[email protected]
>>> <mailto:[email protected]>>
>>> *Sent:*Friday, February 19, 2016 8:22 PM
>>> *To:*[email protected] <mailto:[email protected]>
>>> *Subject:*RE: Unable to extract content from chunked portion of large
>>> file
>>> One option is to create your own RESTful API that lets you send
>>> chunks of the file, and then you can provide an input stream that
>>> provides the seamless data view of the chunks to Tika (which is what
>>> it needs).
>>>
>>> -- Ken
>>>
>>>> ------------------------------------------------------------------------
>>>> *From:*raghu vittal
>>>> *Sent:*February 19, 2016 1:37:49am PST
>>>> *To:*[email protected] <mailto:[email protected]>
>>>> *Subject:*Unable to extract content from chunked portion of large file
>>>>
>>>> Hi All
>>>>
>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract
>>>> content and dump data in Elastic Search for full-text search.
>>>> sending very large files to Tika will cause out of memory exception.
>>>>
>>>> we want to chunk the file and send it to TIKA for content
>>>> extraction. when we passed chunked portion of file to Tika it is
>>>> giving empty text.
>>>> I assume Tika is relied on file structure that why it is not giving
>>>> any content.
>>>>
>>>> we are using Tika Server(REST api) in our .net application.
>>>>
>>>> please suggest us better approach for this scenario.
>>>>
>>>> Regards,
>>>> Raghu.
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>


--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: Unable to extract content from chunked portion of large file

Reply via email to