yayyy! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: Sergey Beryozkin <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Wednesday, February 24, 2016 at 9:04 AM To: "[email protected]" <[email protected]> Subject: Re: Unable to extract content from chunked portion of large file >Time to start contributing to Tika again :-) > >Cheers, Sergey >On 24/02/16 17:01, Mattmann, Chris A (3980) wrote: >> thanks mucho my friend >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> -----Original Message----- >> From: Sergey Beryozkin <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Wednesday, February 24, 2016 at 8:58 AM >> To: "[email protected]" <[email protected]> >> Subject: Re: Unable to extract content from chunked portion of large >>file >> >>> Hi Chris >>> >>> Sure, I've opened >>> https://issues.apache.org/jira/browse/TIKA-1871 >>> >>> and assigned to myself, will add some info about multipart/form-data >>>asap >>> >>> Cheers, Sergey >>> >>> >>> On 24/02/16 16:40, Mattmann, Chris A (3980) wrote: >>>> +1 please just remove it from the wiki since it clearly supports >>>> that per your research thanks Sergey! >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Chief Architect >>>> Instrument Software and Science Data Systems Section (398) >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 168-519, Mailstop: 168-527 >>>> Email: [email protected] >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Associate Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Sergey Beryozkin <[email protected]> >>>> Reply-To: "[email protected]" <[email protected]> >>>> Date: Wednesday, February 24, 2016 at 7:44 AM >>>> To: "[email protected]" <[email protected]> >>>> Subject: Re: Unable to extract content from chunked portion of large >>>> file >>>> >>>>> Hi All >>>>> >>>>> If a large file is passed to a Tika server as a multipart/form >>>>>payload >>>>> >>>>> then CXF will be creating a temp file on the disk itself. >>>>> >>>>> Hmm... I was looking for a reference to it and I found the advice not >>>>> to >>>>> use multipart/form-data: >>>>> https://wiki.apache.org/tika/TikaJAXRS (in Services) >>>>> >>>>> I believe it should be removed, see: >>>>> >>>>> >>>>> >>>>>http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/o >>>>>rg >>>>> /a >>>>> pache/tika/server/resource/TikaResource.java, >>>>> example: >>>>> >>>>> @POST >>>>> @Consumes("multipart/form-data") >>>>> @Produces("text/plain") >>>>> @Path("form") >>>>> public StreamingOutput getTextFromMultipart(Attachment att, >>>>> @Context final UriInfo info) { >>>>> return produceText(att.getObject(InputStream.class), >>>>> att.getHeaders(), info); >>>>> } >>>>> >>>>> >>>>> Cheers, Sergey >>>>> >>>>> >>>>> >>>>> On 24/02/16 15:37, Ken Krugler wrote: >>>>>> Hi Raghu, >>>>>> >>>>>> I don't think you understood what I was proposing. >>>>>> >>>>>> I suggested creating a service that could receive chunks of the file >>>>>> (persisted to local disk). Then this service could implement an >>>>>>input >>>>>> stream class that would read sequentially from these pieces. This >>>>>> input >>>>>> stream would be passed to Tika, thus giving Tika a single continuous >>>>>> stream of data to the entire file content. >>>>>> >>>>>> -- Ken >>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>-------------------------------------------------------------------- >>>>>>>-- >>>>>>> -- >>>>>>> >>>>>>> *From:* raghu vittal >>>>>>> >>>>>>> *Sent:* February 24, 2016 4:32:01am PST >>>>>>> >>>>>>> *To:* [email protected] <mailto:[email protected]> >>>>>>> >>>>>>> *Subject:* Re: Unable to extract content from chunked portion of >>>>>>> large >>>>>>> file >>>>>>> >>>>>>> >>>>>>> Thanks for your reply. >>>>>>> >>>>>>> In our application user can upload large files. Our intention is to >>>>>>> extract the content out of large file and dump that in Elastic for >>>>>>> contented based search. >>>>>>> we have > 300 MB size .xlsx and .doc files. sending that large file >>>>>>> to >>>>>>> Tika will causing timeout issues. >>>>>>> >>>>>>> i tried getting chunk of file and pass to Tika. Tika given me >>>>>>>invalid >>>>>>> data exception. >>>>>>> >>>>>>> I Think for Tika we need to pass entire file at once to extract >>>>>>> content. >>>>>>> >>>>>>> Raghu. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>-------------------------------------------------------------------- >>>>>>>-- >>>>>>> -- >>>>>>> *From:*Ken Krugler <[email protected] >>>>>>> <mailto:[email protected]>> >>>>>>> *Sent:*Friday, February 19, 2016 8:22 PM >>>>>>> *To:*[email protected] <mailto:[email protected]> >>>>>>> *Subject:*RE: Unable to extract content from chunked portion of >>>>>>>large >>>>>>> file >>>>>>> One option is to create your own RESTful API that lets you send >>>>>>> chunks >>>>>>> of the file, and then you can provide an input stream that provides >>>>>>> the seamless data view of the chunks to Tika (which is what it >>>>>>> needs). >>>>>>> >>>>>>> -- Ken >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>------------------------------------------------------------------- >>>>>>>>-- >>>>>>>> -- >>>>>>>> - >>>>>>>> *From:*raghu vittal >>>>>>>> *Sent:*February 19, 2016 1:37:49am PST >>>>>>>> *To:*[email protected] <mailto:[email protected]> >>>>>>>> *Subject:*Unable to extract content from chunked portion of large >>>>>>>> file >>>>>>>> >>>>>>>> Hi All >>>>>>>> >>>>>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract >>>>>>>> content and dump data in Elastic Search for full-text search. >>>>>>>> sending very large files to Tika will cause out of memory >>>>>>>>exception. >>>>>>>> >>>>>>>> we want to chunk the file and send it to TIKA for content >>>>>>>> extraction. >>>>>>>> when we passed chunked portion of file to Tika it is giving empty >>>>>>>> text. >>>>>>>> I assume Tika is relied on file structure that why it is not >>>>>>>>giving >>>>>>>> any content. >>>>>>>> >>>>>>>> we are using Tika Server(REST api) in our .net application. >>>>>>>> >>>>>>>> please suggest us better approach for this scenario. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Raghu. >>>>>> >>>>>> >>>>>> >>>>>> -------------------------- >>>>>> Ken Krugler >>>>>> +1 530-210-6378 >>>>>> http://www.scaleunlimited.com >>>>>> custom big data solutions & training >>>>>> Hadoop, Cascading, Cassandra & Solr >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Sergey Beryozkin >>>>> >>>>> Talend Community Coders >>>>> http://coders.talend.com/ >>>> >>> >> >
