Thanks for your reply. I actually started this thread for finding a way to extract content out of chunked portion of file
Will TIKA supports to extract content from file chunk.? Regards, Raghu. ________________________________________ From: Sergey Beryozkin <[email protected]> Sent: Monday, February 29, 2016 7:23 PM To: [email protected] Subject: Re: Unable to extract content from chunked portion of large file Well, it is a different issue now, the server is processing a 250MB payload and throws an error: org.apache.tika.exception.TikaException: Zip bomb detected! So may be you need to start a new thread... Cheers, Sergey On 29/02/16 13:49, raghu vittal wrote: > it is working. thx > > i have tried sending 250MB file using multipart/form-data it is giving > exception. > > ERROR: > Feb 29, 2016 7:07:27 PM org.apache.tika.server.resource.TikaResource > logRequest > INFO: tika/form (autodetecting type) > Feb 29, 2016 7:09:02 PM org.apache.tika.server.resource.TikaResource parse > WARNING: tika/form: Text extraction failed > org.apache.tika.exception.TikaException: Zip bomb detected! > at > org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContent > Handler.java:192) > ... 31 more > Feb 29, 2016 7:09:02 PM org.apache.cxf.jaxrs.utils.JAXRSUtils > logMessageHandlerP > roblem > SEVERE: Problem with writing the data, class > org.apache.tika.server.resource.Tik > aResource$4, ContentType: text/plain > Feb 29, 2016 7:09:02 PM org.apache.cxf.phase.PhaseInterceptorChain > doDefaultLogg > ing > WARNING: Interceptor for > {http://resource.server.tika.apache.org/}MetadataResour > ce has thrown exception, unwinding now > org.apache.cxf.interceptor.Fault: Could not send Message. > ... 24 more > Caused by: java.io.IOException: An established connection was aborted by the > sof > tware in your host machine > ... 35 more > > > and i have tried to get the chunk of file data and passed to tika using > multipart/form-data getting exception. > > ERROR: > Feb 29, 2016 7:02:43 PM org.apache.cxf.jaxrs.utils.JAXRSUtils > logMessageHandlerP > roblem > SEVERE: Problem with writing the data, class > org.apache.tika.server.resource.Tik > aResource$4, ContentType: text/plain > Feb 29, 2016 7:04:30 PM org.apache.tika.server.resource.TikaResource > logRequest > INFO: tika/form (autodetecting type) > Feb 29, 2016 7:04:30 PM org.apache.tika.server.resource.TikaResource parse > WARNING: tika/form: Text extraction failed > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.ap > ache.tika.parser.microsoft.ooxml.OOXMLParser@41530372 > Feb 29, 2016 7:04:30 PM org.apache.cxf.jaxrs.utils.JAXRSUtils > logMessageHandlerP > roblem > SEVERE: Problem with writing the data, class > org.apache.tika.server.resource.Tik > aResource$4, ContentType: text/plain > > > we are struck up handling this scenarios. In our production we have documents > of this size. we need to handle this. > > please help us. > > Regards, > Raghu. > > ________________________________________ > From: Sergey Beryozkin <[email protected]> > Sent: Monday, February 29, 2016 6:50 PM > To: [email protected] > Subject: Re: Unable to extract content from chunked portion of large file > > Hi > > In the first case it should be > > http://localhost:9998/tika/form > > Sergey > On 29/02/16 13:09, raghu vittal wrote: >> Hi ken, >> >> >> these are my observations .. >> >> >> scenario -1 >> >> >> Tika Url : http://localhost:9998/tika >> >> >> I have tried the multipart/form-data suggested by Sergey . i am getting >> below error (we are using tika 1.11 server) >> >> var data = File.ReadAllBytes(filename); >> using (var client = new HttpClient()) >> { >> using (var content = new MultipartFormDataContent()) >> { >> ByteArrayContent byteArrayContent = new ByteArrayContent(data); >> byteArrayContent.Headers.Add("Content-Type", "application/octet-stream"); >> content.Add(byteArrayContent); >> var str = client.PutAsync(tikaServerUrl, >> content).Result.Content.ReadAsStringAsync().Result; >> } >> >> *ERROR*: >> >> Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource >> logRequest >> INFO: tika >> (multipart/form-data;boundary="03cc158f-3213-439f-a0be-3aba14c7036b") >> >> Feb 29, 2016 5:26:01 PM org.apache.tika.server.resource.TikaResource parse >> WARNING: tika: Text extraction failed >> org.apache.tika.exception.TikaException: Unexpected RuntimeException >> from org.ap >> ache.tika.server.resource.TikaResource$1@36b1a1ec >> at >> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282 >> ) >> at >> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1 >> 20) >> ................................................ >> at java.lang.Thread.run(Thread.java:745) >> Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported >> Media Type >> at >> org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.jav >> a:116) >> at >> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280 >> ) >> ... 32 more >> >> Feb 29, 2016 5:26:01 PM org.apache.cxf.jaxrs.utils.JAXRSUtils >> logMessageHandlerP >> roblem >> SEVERE: Problem with writing the data, class >> org.apache.tika.server.resource.Tik >> aResource$4, ContentType: text/plain >> >> I think TIKA does not support POST request. >> >> * >> * >> >> *Passing 240 MB file to tika for content extraction it is giving me the >> Errors.* >> >> *Scenario -2* >> >> >> Tika Url : http://localhost:9998/unpack/all >> >> >> Rather than ReadStringAsync() i have used ReadStreamAsync() and >> captured the output stream to "ZipArchive" >> >> >> *ERROR:* >> >> Feb 29, 2016 6:03:26 PM org.apache.cxf.jaxrs.utils.JAXRSUtils >> logMessageHandlerP >> roblem >> SEVERE: Problem with writing the data, class java.util.HashMap, >> ContentType: app >> lication/zip >> Feb 29, 2016 6:03:26 PM org.apache.cxf.phase.PhaseInterceptorChain >> doDefaultLogg >> ing >> WARNING: Interceptor for >> {http://resource.server.tika.apache.org/}MetadataResour >> ce has thrown exception, unwinding now >> org.apache.cxf.interceptor.Fault >> at >> org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleWriteExcep >> tion(JAXRSOutInterceptor.java:363) >> ... 41 more >> >> Feb 29, 2016 6:03:28 PM org.apache.cxf.phase.PhaseInterceptorChain >> doDefaultLogg >> ing >> WARNING: Interceptor for >> {http://resource.server.tika.apache.org/}MetadataResour >> ce has thrown exception, unwinding now >> org.apache.cxf.interceptor.Fault: XML_WRITE_EXC >> at >> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand >> leMessage(JAXRSDefaultFaultOutInterceptor.java:102) >> Caused by: com.ctc.wstx.exc.WstxIOException: null >> at >> com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:255) >> at >> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand >> leMessage(JAXRSDefaultFaultOutInterceptor.java:100) >> ... 26 more >> >> *Scenario -3* >> >> Tika url : http://localhost:9998/tika >> >> * >> * >> >> *ERROR:* >> **** >> Feb 29, 2016 6:05:55 PM org.apache.tika.server.resource.TikaResource >> logRequest >> INFO: tika (autodetecting type) >> Feb 29, 2016 6:07:35 PM org.apache.tika.server.resource.TikaResource parse >> WARNING: tika: Text extraction failed >> org.apache.tika.exception.TikaException: Zip bomb detected! >> at >> org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContent >> Handler.java:192) >> at >> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1 >> 23) >> at >> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1 >> 20) >> ... 31 more >> >> Feb 29, 2016 6:07:35 PM org.apache.cxf.jaxrs.utils.JAXRSUtils >> logMessageHandlerP >> roblem >> SEVERE: Problem with writing the data, class >> org.apache.tika.server.resource.Tik >> aResource$4, ContentType: text/plain >> Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain >> doDefaultLogg >> ing >> WARNING: Interceptor for >> {http://resource.server.tika.apache.org/}MetadataResour >> ce has thrown exception, unwinding now >> org.apache.cxf.interceptor.Fault: Could not send Message. >> at >> org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndi >> ngInterceptor.handleMessage(MessageSenderInterceptor.java:64) >> ... 31 more >> >> Feb 29, 2016 6:07:35 PM org.apache.cxf.phase.PhaseInterceptorChain >> doDefaultLogg >> ing >> WARNING: Interceptor for >> {http://resource.server.tika.apache.org/}MetadataResour >> ce has thrown exception, unwinding now >> org.apache.cxf.interceptor.Fault: XML_WRITE_EXC >> at >> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.hand >> leMessage(JAXRSDefaultFaultOutInterceptor.java:102) >> at >> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseIntercept >> orChain.java:307) >> .... >> >> >> i was able to extract the content using 80 MB document. >> >> If i split the large file in to chunks and pass it to Tika giving me >> exceptions. >> >> i am building the solution in .NET >> >> Regards, >> Raghu. >> >> >> ------------------------------------------------------------------------ >> *From:* Ken Krugler <[email protected]> >> *Sent:* Saturday, February 27, 2016 6:22 AM >> *To:* [email protected] >> *Subject:* RE: Unable to extract content from chunked portion of large file >> Hi Raghu, >> >> Previously you'd said >> >> "sending very large files to Tika will cause out of memory exception" >> >> and >> >> "sending that large file to Tika will causing timeout issues" >> >> I assume these are two different issues, as the second one seems related >> to how you're connecting to the Tika server via HTTP, correct? >> >> For out of memory issues, I'd suggested creating an input stream that >> can read from a chunked file *stored on disk*, thus alleviating at least >> part of the memory usage constraint. If the problem is that the >> resulting extracted text is also too big for memory, and you need to >> send it as a single document to Elasticsearch, then that's a separate >> (non-Tika) issue. >> >> For the timeout when sending the file to the Tika server, Sergey has >> already mentioned that you should be able to send it >> as multipart/form-data. And that will construct a temp file on disk from >> the chunks, and (I assume) stream it to Tika, so that also would take >> care of the same memory issue on the input side. >> >> Given the above, it seems like you've got enough ideas to try to solve >> this issue, yes? >> >> Regards, >> >> -- Ken >> >>> ------------------------------------------------------------------------ >>> >>> *From:* raghu vittal >>> >>> *Sent:* February 24, 2016 10:50:29pm PST >>> >>> *To:* [email protected] <mailto:[email protected]> >>> >>> *Subject:* Re: Unable to extract content from chunked portion of large >>> file >>> >>> >>> Hi Ken, >>> >>> Thanks for the reply. >>> i understood your point. >>> >>> what i have tried. >>> >>>> byte[] srcBytes = File.ReadAllBytes(filePath); >>> >>>> get the chunk of 1 MB out of srcBytes >>> >>>> when i pass this 1 MB chunk to Tika it is giving me the error. >>> >>>> As the WIKI Tika needs the entire file to extract content. >>> >>> this is where i struck. i don't wan't to pass entire file to Tika. >>> >>> correct me if i am wrong. >>> >>> --Raghu. >>> >>> ------------------------------------------------------------------------ >>> *From:*Ken Krugler <[email protected] >>> <mailto:[email protected]>> >>> *Sent:*Wednesday, February 24, 2016 9:07 PM >>> *To:*[email protected] <mailto:[email protected]> >>> *Subject:*RE: Unable to extract content from chunked portion of large >>> file >>> Hi Raghu, >>> >>> I don't think you understood what I was proposing. >>> >>> I suggested creating a service that could receive chunks of the file >>> (persisted to local disk). Then this service could implement an input >>> stream class that would read sequentially from these pieces. This >>> input stream would be passed to Tika, thus giving Tika a single >>> continuous stream of data to the entire file content. >>> >>> -- Ken >>> >>>> ------------------------------------------------------------------------ >>>> *From:*raghu vittal >>>> *Sent:*February 24, 2016 4:32:01am PST >>>> *To:*[email protected] <mailto:[email protected]> >>>> *Subject:*Re: Unable to extract content from chunked portion of large >>>> file >>>> >>>> Thanks for your reply. >>>> >>>> In our application user can upload large files. Our intention is to >>>> extract the content out of large file and dump that in Elastic for >>>> contented based search. >>>> we have > 300 MB size .xlsx and .doc files. sending that large file >>>> to Tika will causing timeout issues. >>>> >>>> i tried getting chunk of file and pass to Tika. Tika given me invalid >>>> data exception. >>>> >>>> I Think for Tika we need to pass entire file at once to extract content. >>>> >>>> Raghu. >>>> >>>> ------------------------------------------------------------------------ >>>> *From:*Ken Krugler <[email protected] >>>> <mailto:[email protected]>> >>>> *Sent:*Friday, February 19, 2016 8:22 PM >>>> *To:*[email protected] <mailto:[email protected]> >>>> *Subject:*RE: Unable to extract content from chunked portion of large >>>> file >>>> One option is to create your own RESTful API that lets you send >>>> chunks of the file, and then you can provide an input stream that >>>> provides the seamless data view of the chunks to Tika (which is what >>>> it needs). >>>> >>>> -- Ken >>>> >>>>> ------------------------------------------------------------------------ >>>>> *From:*raghu vittal >>>>> *Sent:*February 19, 2016 1:37:49am PST >>>>> *To:*[email protected] <mailto:[email protected]> >>>>> *Subject:*Unable to extract content from chunked portion of large file >>>>> >>>>> Hi All >>>>> >>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract >>>>> content and dump data in Elastic Search for full-text search. >>>>> sending very large files to Tika will cause out of memory exception. >>>>> >>>>> we want to chunk the file and send it to TIKA for content >>>>> extraction. when we passed chunked portion of file to Tika it is >>>>> giving empty text. >>>>> I assume Tika is relied on file structure that why it is not giving >>>>> any content. >>>>> >>>>> we are using Tika Server(REST api) in our .net application. >>>>> >>>>> please suggest us better approach for this scenario. >>>>> >>>>> Regards, >>>>> Raghu. >> >> -------------------------- >> Ken Krugler >> +1 530-210-6378 >> http://www.scaleunlimited.com >> custom big data solutions & training >> Hadoop, Cascading, Cassandra & Solr >> >> >> >> >> > > > -- > Sergey Beryozkin > > Talend Community Coders > http://coders.talend.com/ > -- Sergey Beryozkin Talend Community Coders http://coders.talend.com/
