Re: Unable to extract content from chunked portion of large file

raghu vittal Mon, 29 Feb 2016 05:10:21 -0800

Hi ken,


these are my observations ..


scenario -1


Tika Url : http://localhost:9998/tika


I have tried the multipart/form-data  suggested by Sergey . i am getting below 
error (we are using tika 1.11 server)

var data = File.ReadAllBytes(filename);
using (var client = new HttpClient())
{
using (var content = new MultipartFormDataContent())
{
ByteArrayContent byteArrayContent = new ByteArrayContent(data);
byteArrayContent.Headers.Add("Content-Type", "application/octet-stream");
content.Add(byteArrayContent);
var str = client.PutAsync(tikaServerUrl, 
content).Result.Content.ReadAsStringAsync().Result;
}


ERROR:

Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource logRequest
INFO: tika (multipart/form-data;boundary="03cc158f-3213-439f-a0be-3aba14c7036b")

Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource parse
WARNING: tika: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.ap
ache.tika.server.resource.TikaResource$1@36b1a1ec
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282
)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
20)
    ................................................
        at java.lang.Thread.run(Thread.java:745)
Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported Media Type
        at org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.jav
a:116)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280
)
        ... 32 more

Feb 29, 2016 5:26:01 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
roblem
SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
aResource$4, ContentType: text/plain

I think TIKA does not support POST request.



Passing 240 MB file to tika for content extraction it is giving me the Errors.

Scenario -2


Tika Url : http://localhost:9998/unpack/all


Rather than ReadStringAsync() i have used ReadStreamAsync()  and captured the 
output stream to "ZipArchive"


ERROR:

Feb 29, 2016 6:03:26 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
roblem
SEVERE: Problem with writing the data, class java.util.HashMap, ContentType: app
lication/zip
Feb 29, 2016 6:03:26 PM org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogg
ing
WARNING: Interceptor for {http://resource.server.tika.apache.org/}MetadataResour
ce has thrown exception, unwinding now
org.apache.cxf.interceptor.Fault
        at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleWriteExcep
tion(JAXRSOutInterceptor.java:363)

        ... 41 more

Feb 29, 2016 6:03:28 PM org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogg
ing
WARNING: Interceptor for {http://resource.server.tika.apache.org/}MetadataResour
ce has thrown exception, unwinding now
org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
        at org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
leMessage(JAXRSDefaultFaultOutInterceptor.java:102)

Caused by: com.ctc.wstx.exc.WstxIOException: null
        at com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:255)
        at org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
leMessage(JAXRSDefaultFaultOutInterceptor.java:100)
        ... 26 more


Scenario -3

Tika url : http://localhost:9998/tika


ERROR:
Feb 29, 2016 6:05:55 PM org.apache.tika.server.resource.TikaResource logRequest
INFO: tika (autodetecting type)
Feb 29, 2016 6:07:35 PM org.apache.tika.server.resource.TikaResource parse
WARNING: tika: Text extraction failed
org.apache.tika.exception.TikaException: Zip bomb detected!
        at org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContent
Handler.java:192)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
23)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
20)
        ... 31 more

Feb 29, 2016 6:07:35 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerP
roblem
SEVERE: Problem with writing the data, class org.apache.tika.server.resource.Tik
aResource$4, ContentType: text/plain
Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogg
ing
WARNING: Interceptor for {http://resource.server.tika.apache.org/}MetadataResour
ce has thrown exception, unwinding now
org.apache.cxf.interceptor.Fault: Could not send Message.
        at org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndi
ngInterceptor.handleMessage(MessageSenderInterceptor.java:64)
        ... 31 more

Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogg
ing
WARNING: Interceptor for {http://resource.server.tika.apache.org/}MetadataResour
ce has thrown exception, unwinding now
org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
        at org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand
leMessage(JAXRSDefaultFaultOutInterceptor.java:102)
        at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseIntercept
orChain.java:307)
....


i was able to extract the content using 80 MB document.

If i split the large file in to chunks and pass it to Tika  giving me 
exceptions.

i am building the solution  in .NET

Regards,
Raghu.


________________________________
From: Ken Krugler <[email protected]>
Sent: Saturday, February 27, 2016 6:22 AM
To: [email protected]
Subject: RE: Unable to extract content from chunked portion of large file

Hi Raghu,

Previously you'd said

"sending very large files to Tika will cause out of memory exception"

and

"sending that large file to Tika will causing timeout issues"

I assume these are two different issues, as the second one seems related to how 
you're connecting to the Tika server via HTTP, correct?

For out of memory issues, I'd suggested creating an input stream that can read 
from a chunked file *stored on disk*, thus alleviating at least part of the 
memory usage constraint. If the problem is that the resulting extracted text is 
also too big for memory, and you need to send it as a single document to 
Elasticsearch, then that's a separate (non-Tika) issue.

For the timeout when sending the file to the Tika server, Sergey has already 
mentioned that you should be able to send it as multipart/form-data. And that 
will construct a temp file on disk from the chunks, and (I assume) stream it to 
Tika, so that also would take care of the same memory issue on the input side.

Given the above, it seems like you've got enough ideas to try to solve this 
issue, yes?

Regards,

-- Ken

________________________________

From: raghu vittal

Sent: February 24, 2016 10:50:29pm PST

To: [email protected]<mailto:[email protected]>

Subject: Re: Unable to extract content from chunked portion of large file

Hi Ken,

Thanks for the reply.
i understood your point.

what i have tried.

>  byte[] srcBytes = File.ReadAllBytes(filePath);

> get the chunk  of 1 MB out of  srcBytes


> when i pass this 1 MB chunk to Tika it is giving me the error.

> As the WIKI Tika needs the entire file to extract content.

this is where i struck. i don't wan't to pass entire file to Tika.

correct me if i am wrong.

--Raghu.

________________________________
From: Ken Krugler 
<[email protected]<mailto:[email protected]>>
Sent: Wednesday, February 24, 2016 9:07 PM
To: [email protected]<mailto:[email protected]>
Subject: RE: Unable to extract content from chunked portion of large file

Hi Raghu,

I don't think you understood what I was proposing.

I suggested creating a service that could receive chunks of the file (persisted 
to local disk). Then this service could implement an input stream class that 
would read sequentially from these pieces. This input stream would be passed to 
Tika, thus giving Tika a single continuous stream of data to the entire file 
content.

-- Ken

________________________________
From: raghu vittal
Sent: February 24, 2016 4:32:01am PST
To: [email protected]<mailto:[email protected]>
Subject: Re: Unable to extract content from chunked portion of large file

Thanks for your reply.

In our application user can upload large files. Our intention is to extract the 
content out of large file and dump that in Elastic for contented based search.
we have > 300 MB size .xlsx and .doc files. sending that large file to Tika 
will causing timeout issues.

i tried getting chunk of file and pass to Tika. Tika given me invalid data 
exception.

I Think for Tika we need to pass entire file at once to extract content.

Raghu.

________________________________
From: Ken Krugler 
<[email protected]<mailto:[email protected]>>
Sent: Friday, February 19, 2016 8:22 PM
To: [email protected]<mailto:[email protected]>
Subject: RE: Unable to extract content from chunked portion of large file

One option is to create your own RESTful API that lets you send chunks of the 
file, and then you can provide an input stream that provides the seamless data 
view of the chunks to Tika (which is what it needs).

-- Ken

________________________________
From: raghu vittal
Sent: February 19, 2016 1:37:49am PST
To: [email protected]<mailto:[email protected]>
Subject: Unable to extract content from chunked portion of large file


Hi All

we have very large PDF,.docx,.xlsx. We are using Tika to extract content and 
dump data in Elastic Search for full-text search.
sending very large files to Tika will cause out of memory exception.

we want to chunk the file and send it to TIKA for content extraction. when we 
passed chunked portion of file to Tika it is giving empty text.
I assume Tika is relied on file structure that why it is not giving any content.

we are using Tika Server(REST api) in our .net application.

please suggest us better approach for this scenario.

Regards,
Raghu.

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: Unable to extract content from chunked portion of large file

Reply via email to