+1 please just remove it from the wiki since it clearly supports that per your research thanks Sergey!
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Sergey Beryozkin <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Wednesday, February 24, 2016 at 7:44 AM To: "[email protected]" <[email protected]> Subject: Re: Unable to extract content from chunked portion of large file >Hi All > >If a large file is passed to a Tika server as a multipart/form payload > >then CXF will be creating a temp file on the disk itself. > >Hmm... I was looking for a reference to it and I found the advice not to >use multipart/form-data: >https://wiki.apache.org/tika/TikaJAXRS (in Services) > >I believe it should be removed, see: > >http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org/a >pache/tika/server/resource/TikaResource.java, >example: > >@POST > @Consumes("multipart/form-data") > @Produces("text/plain") > @Path("form") > public StreamingOutput getTextFromMultipart(Attachment att, >@Context final UriInfo info) { > return produceText(att.getObject(InputStream.class), >att.getHeaders(), info); > } > > >Cheers, Sergey > > > >On 24/02/16 15:37, Ken Krugler wrote: >> Hi Raghu, >> >> I don't think you understood what I was proposing. >> >> I suggested creating a service that could receive chunks of the file >> (persisted to local disk). Then this service could implement an input >> stream class that would read sequentially from these pieces. This input >> stream would be passed to Tika, thus giving Tika a single continuous >> stream of data to the entire file content. >> >> -- Ken >> >>> >>>------------------------------------------------------------------------ >>> >>> *From:* raghu vittal >>> >>> *Sent:* February 24, 2016 4:32:01am PST >>> >>> *To:* [email protected] <mailto:[email protected]> >>> >>> *Subject:* Re: Unable to extract content from chunked portion of large >>> file >>> >>> >>> Thanks for your reply. >>> >>> In our application user can upload large files. Our intention is to >>> extract the content out of large file and dump that in Elastic for >>> contented based search. >>> we have > 300 MB size .xlsx and .doc files. sending that large file to >>> Tika will causing timeout issues. >>> >>> i tried getting chunk of file and pass to Tika. Tika given me invalid >>> data exception. >>> >>> I Think for Tika we need to pass entire file at once to extract >>>content. >>> >>> Raghu. >>> >>> >>>------------------------------------------------------------------------ >>> *From:*Ken Krugler <[email protected] >>> <mailto:[email protected]>> >>> *Sent:*Friday, February 19, 2016 8:22 PM >>> *To:*[email protected] <mailto:[email protected]> >>> *Subject:*RE: Unable to extract content from chunked portion of large >>>file >>> One option is to create your own RESTful API that lets you send chunks >>> of the file, and then you can provide an input stream that provides >>> the seamless data view of the chunks to Tika (which is what it needs). >>> >>> -- Ken >>> >>>> >>>>----------------------------------------------------------------------- >>>>- >>>> *From:*raghu vittal >>>> *Sent:*February 19, 2016 1:37:49am PST >>>> *To:*[email protected] <mailto:[email protected]> >>>> *Subject:*Unable to extract content from chunked portion of large file >>>> >>>> Hi All >>>> >>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract >>>> content and dump data in Elastic Search for full-text search. >>>> sending very large files to Tika will cause out of memory exception. >>>> >>>> we want to chunk the file and send it to TIKA for content extraction. >>>> when we passed chunked portion of file to Tika it is giving empty >>>>text. >>>> I assume Tika is relied on file structure that why it is not giving >>>> any content. >>>> >>>> we are using Tika Server(REST api) in our .net application. >>>> >>>> please suggest us better approach for this scenario. >>>> >>>> Regards, >>>> Raghu. >> >> >> >> -------------------------- >> Ken Krugler >> +1 530-210-6378 >> http://www.scaleunlimited.com >> custom big data solutions & training >> Hadoop, Cascading, Cassandra & Solr >> >> >> >> >> > > >-- >Sergey Beryozkin > >Talend Community Coders >http://coders.talend.com/
