thanks mucho my friend ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: Sergey Beryozkin <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Wednesday, February 24, 2016 at 8:58 AM To: "[email protected]" <[email protected]> Subject: Re: Unable to extract content from chunked portion of large file >Hi Chris > >Sure, I've opened >https://issues.apache.org/jira/browse/TIKA-1871 > >and assigned to myself, will add some info about multipart/form-data asap > >Cheers, Sergey > > >On 24/02/16 16:40, Mattmann, Chris A (3980) wrote: >> +1 please just remove it from the wiki since it clearly supports >> that per your research thanks Sergey! >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> -----Original Message----- >> From: Sergey Beryozkin <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Wednesday, February 24, 2016 at 7:44 AM >> To: "[email protected]" <[email protected]> >> Subject: Re: Unable to extract content from chunked portion of large >>file >> >>> Hi All >>> >>> If a large file is passed to a Tika server as a multipart/form payload >>> >>> then CXF will be creating a temp file on the disk itself. >>> >>> Hmm... I was looking for a reference to it and I found the advice not >>>to >>> use multipart/form-data: >>> https://wiki.apache.org/tika/TikaJAXRS (in Services) >>> >>> I believe it should be removed, see: >>> >>> >>>http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org >>>/a >>> pache/tika/server/resource/TikaResource.java, >>> example: >>> >>> @POST >>> @Consumes("multipart/form-data") >>> @Produces("text/plain") >>> @Path("form") >>> public StreamingOutput getTextFromMultipart(Attachment att, >>> @Context final UriInfo info) { >>> return produceText(att.getObject(InputStream.class), >>> att.getHeaders(), info); >>> } >>> >>> >>> Cheers, Sergey >>> >>> >>> >>> On 24/02/16 15:37, Ken Krugler wrote: >>>> Hi Raghu, >>>> >>>> I don't think you understood what I was proposing. >>>> >>>> I suggested creating a service that could receive chunks of the file >>>> (persisted to local disk). Then this service could implement an input >>>> stream class that would read sequentially from these pieces. This >>>>input >>>> stream would be passed to Tika, thus giving Tika a single continuous >>>> stream of data to the entire file content. >>>> >>>> -- Ken >>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> >>>>> *From:* raghu vittal >>>>> >>>>> *Sent:* February 24, 2016 4:32:01am PST >>>>> >>>>> *To:* [email protected] <mailto:[email protected]> >>>>> >>>>> *Subject:* Re: Unable to extract content from chunked portion of >>>>>large >>>>> file >>>>> >>>>> >>>>> Thanks for your reply. >>>>> >>>>> In our application user can upload large files. Our intention is to >>>>> extract the content out of large file and dump that in Elastic for >>>>> contented based search. >>>>> we have > 300 MB size .xlsx and .doc files. sending that large file >>>>>to >>>>> Tika will causing timeout issues. >>>>> >>>>> i tried getting chunk of file and pass to Tika. Tika given me invalid >>>>> data exception. >>>>> >>>>> I Think for Tika we need to pass entire file at once to extract >>>>> content. >>>>> >>>>> Raghu. >>>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> *From:*Ken Krugler <[email protected] >>>>> <mailto:[email protected]>> >>>>> *Sent:*Friday, February 19, 2016 8:22 PM >>>>> *To:*[email protected] <mailto:[email protected]> >>>>> *Subject:*RE: Unable to extract content from chunked portion of large >>>>> file >>>>> One option is to create your own RESTful API that lets you send >>>>>chunks >>>>> of the file, and then you can provide an input stream that provides >>>>> the seamless data view of the chunks to Tika (which is what it >>>>>needs). >>>>> >>>>> -- Ken >>>>> >>>>>> >>>>>> >>>>>>--------------------------------------------------------------------- >>>>>>-- >>>>>> - >>>>>> *From:*raghu vittal >>>>>> *Sent:*February 19, 2016 1:37:49am PST >>>>>> *To:*[email protected] <mailto:[email protected]> >>>>>> *Subject:*Unable to extract content from chunked portion of large >>>>>>file >>>>>> >>>>>> Hi All >>>>>> >>>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract >>>>>> content and dump data in Elastic Search for full-text search. >>>>>> sending very large files to Tika will cause out of memory exception. >>>>>> >>>>>> we want to chunk the file and send it to TIKA for content >>>>>>extraction. >>>>>> when we passed chunked portion of file to Tika it is giving empty >>>>>> text. >>>>>> I assume Tika is relied on file structure that why it is not giving >>>>>> any content. >>>>>> >>>>>> we are using Tika Server(REST api) in our .net application. >>>>>> >>>>>> please suggest us better approach for this scenario. >>>>>> >>>>>> Regards, >>>>>> Raghu. >>>> >>>> >>>> >>>> -------------------------- >>>> Ken Krugler >>>> +1 530-210-6378 >>>> http://www.scaleunlimited.com >>>> custom big data solutions & training >>>> Hadoop, Cascading, Cassandra & Solr >>>> >>>> >>>> >>>> >>>> >>> >>> >>> -- >>> Sergey Beryozkin >>> >>> Talend Community Coders >>> http://coders.talend.com/ >> >
