Re: Unable to extract content from chunked portion of large file

Mattmann, Chris A (3980) Wed, 24 Feb 2016 09:06:19 -0800

yayyy!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Sergey Beryozkin <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, February 24, 2016 at 9:04 AM
To: "[email protected]" <[email protected]>
Subject: Re: Unable to extract content from chunked portion of large file

>Time to start contributing to Tika again :-)
>
>Cheers, Sergey
>On 24/02/16 17:01, Mattmann, Chris A (3980) wrote:
>> thanks mucho my friend
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: [email protected]
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Sergey Beryozkin <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Wednesday, February 24, 2016 at 8:58 AM
>> To: "[email protected]" <[email protected]>
>> Subject: Re: Unable to extract content from chunked portion of large
>>file
>>
>>> Hi Chris
>>>
>>> Sure, I've opened
>>> https://issues.apache.org/jira/browse/TIKA-1871
>>>
>>> and assigned to myself, will add some info about multipart/form-data
>>>asap
>>>
>>> Cheers, Sergey
>>>
>>>
>>> On 24/02/16 16:40, Mattmann, Chris A (3980) wrote:
>>>> +1 please just remove it from the wiki since it clearly supports
>>>> that per your research thanks Sergey!
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: [email protected]
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Sergey Beryozkin <[email protected]>
>>>> Reply-To: "[email protected]" <[email protected]>
>>>> Date: Wednesday, February 24, 2016 at 7:44 AM
>>>> To: "[email protected]" <[email protected]>
>>>> Subject: Re: Unable to extract content from chunked portion of large
>>>> file
>>>>
>>>>> Hi All
>>>>>
>>>>> If a large file is passed to a Tika server as a multipart/form
>>>>>payload
>>>>>
>>>>> then CXF will be creating a temp file on the disk itself.
>>>>>
>>>>> Hmm... I was looking for a reference to it and I found the advice not
>>>>> to
>>>>> use multipart/form-data:
>>>>> https://wiki.apache.org/tika/TikaJAXRS (in Services)
>>>>>
>>>>> I believe it should be removed, see:
>>>>>
>>>>>
>>>>> 
>>>>>http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/o
>>>>>rg
>>>>> /a
>>>>> pache/tika/server/resource/TikaResource.java,
>>>>> example:
>>>>>
>>>>> @POST
>>>>>       @Consumes("multipart/form-data")
>>>>>       @Produces("text/plain")
>>>>>       @Path("form")
>>>>>       public StreamingOutput getTextFromMultipart(Attachment att,
>>>>> @Context final UriInfo info) {
>>>>>           return produceText(att.getObject(InputStream.class),
>>>>> att.getHeaders(), info);
>>>>>       }
>>>>>
>>>>>
>>>>> Cheers, Sergey
>>>>>
>>>>>
>>>>>
>>>>> On 24/02/16 15:37, Ken Krugler wrote:
>>>>>> Hi Raghu,
>>>>>>
>>>>>> I don't think you understood what I was proposing.
>>>>>>
>>>>>> I suggested creating a service that could receive chunks of the file
>>>>>> (persisted to local disk). Then this service could implement an
>>>>>>input
>>>>>> stream class that would read sequentially from these pieces. This
>>>>>> input
>>>>>> stream would be passed to Tika, thus giving Tika a single continuous
>>>>>> stream of data to the entire file content.
>>>>>>
>>>>>> -- Ken
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>--------------------------------------------------------------------
>>>>>>>--
>>>>>>> --
>>>>>>>
>>>>>>> *From:* raghu vittal
>>>>>>>
>>>>>>> *Sent:* February 24, 2016 4:32:01am PST
>>>>>>>
>>>>>>> *To:* [email protected] <mailto:[email protected]>
>>>>>>>
>>>>>>> *Subject:* Re: Unable to extract content from chunked portion of
>>>>>>> large
>>>>>>> file
>>>>>>>
>>>>>>>
>>>>>>> Thanks for your reply.
>>>>>>>
>>>>>>> In our application user can upload large files. Our intention is to
>>>>>>> extract the content out of large file and dump that in Elastic for
>>>>>>> contented based search.
>>>>>>> we have > 300 MB size .xlsx and .doc files. sending that large file
>>>>>>> to
>>>>>>> Tika will causing timeout issues.
>>>>>>>
>>>>>>> i tried getting chunk of file and pass to Tika. Tika given me
>>>>>>>invalid
>>>>>>> data exception.
>>>>>>>
>>>>>>> I Think for Tika we need to pass entire file at once to extract
>>>>>>> content.
>>>>>>>
>>>>>>> Raghu.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>--------------------------------------------------------------------
>>>>>>>--
>>>>>>> --
>>>>>>> *From:*Ken Krugler <[email protected]
>>>>>>> <mailto:[email protected]>>
>>>>>>> *Sent:*Friday, February 19, 2016 8:22 PM
>>>>>>> *To:*[email protected] <mailto:[email protected]>
>>>>>>> *Subject:*RE: Unable to extract content from chunked portion of
>>>>>>>large
>>>>>>> file
>>>>>>> One option is to create your own RESTful API that lets you send
>>>>>>> chunks
>>>>>>> of the file, and then you can provide an input stream that provides
>>>>>>> the seamless data view of the chunks to Tika (which is what it
>>>>>>> needs).
>>>>>>>
>>>>>>> -- Ken
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 
>>>>>>>>-------------------------------------------------------------------
>>>>>>>>--
>>>>>>>> --
>>>>>>>> -
>>>>>>>> *From:*raghu vittal
>>>>>>>> *Sent:*February 19, 2016 1:37:49am PST
>>>>>>>> *To:*[email protected] <mailto:[email protected]>
>>>>>>>> *Subject:*Unable to extract content from chunked portion of large
>>>>>>>> file
>>>>>>>>
>>>>>>>> Hi All
>>>>>>>>
>>>>>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract
>>>>>>>> content and dump data in Elastic Search for full-text search.
>>>>>>>> sending very large files to Tika will cause out of memory
>>>>>>>>exception.
>>>>>>>>
>>>>>>>> we want to chunk the file and send it to TIKA for content
>>>>>>>> extraction.
>>>>>>>> when we passed chunked portion of file to Tika it is giving empty
>>>>>>>> text.
>>>>>>>> I assume Tika is relied on file structure that why it is not
>>>>>>>>giving
>>>>>>>> any content.
>>>>>>>>
>>>>>>>> we are using Tika Server(REST api) in our .net application.
>>>>>>>>
>>>>>>>> please suggest us better approach for this scenario.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Raghu.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --------------------------
>>>>>> Ken Krugler
>>>>>> +1 530-210-6378
>>>>>> http://www.scaleunlimited.com
>>>>>> custom big data solutions & training
>>>>>> Hadoop, Cascading, Cassandra & Solr
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sergey Beryozkin
>>>>>
>>>>> Talend Community Coders
>>>>> http://coders.talend.com/
>>>>
>>>
>>
>

Re: Unable to extract content from chunked portion of large file

Reply via email to