Re: Unable to extract content from chunked portion of large file

Mattmann, Chris A (3980) Wed, 24 Feb 2016 09:01:44 -0800

thanks mucho my friend

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Sergey Beryozkin <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, February 24, 2016 at 8:58 AM
To: "[email protected]" <[email protected]>
Subject: Re: Unable to extract content from chunked portion of large file

>Hi Chris
>
>Sure, I've opened
>https://issues.apache.org/jira/browse/TIKA-1871
>
>and assigned to myself, will add some info about multipart/form-data asap
>
>Cheers, Sergey
>
>
>On 24/02/16 16:40, Mattmann, Chris A (3980) wrote:
>> +1 please just remove it from the wiki since it clearly supports
>> that per your research thanks Sergey!
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: [email protected]
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Sergey Beryozkin <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Wednesday, February 24, 2016 at 7:44 AM
>> To: "[email protected]" <[email protected]>
>> Subject: Re: Unable to extract content from chunked portion of large
>>file
>>
>>> Hi All
>>>
>>> If a large file is passed to a Tika server as a multipart/form payload
>>>
>>> then CXF will be creating a temp file on the disk itself.
>>>
>>> Hmm... I was looking for a reference to it and I found the advice not
>>>to
>>> use multipart/form-data:
>>> https://wiki.apache.org/tika/TikaJAXRS (in Services)
>>>
>>> I believe it should be removed, see:
>>>
>>> 
>>>http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org
>>>/a
>>> pache/tika/server/resource/TikaResource.java,
>>> example:
>>>
>>> @POST
>>>      @Consumes("multipart/form-data")
>>>      @Produces("text/plain")
>>>      @Path("form")
>>>      public StreamingOutput getTextFromMultipart(Attachment att,
>>> @Context final UriInfo info) {
>>>          return produceText(att.getObject(InputStream.class),
>>> att.getHeaders(), info);
>>>      }
>>>
>>>
>>> Cheers, Sergey
>>>
>>>
>>>
>>> On 24/02/16 15:37, Ken Krugler wrote:
>>>> Hi Raghu,
>>>>
>>>> I don't think you understood what I was proposing.
>>>>
>>>> I suggested creating a service that could receive chunks of the file
>>>> (persisted to local disk). Then this service could implement an input
>>>> stream class that would read sequentially from these pieces. This
>>>>input
>>>> stream would be passed to Tika, thus giving Tika a single continuous
>>>> stream of data to the entire file content.
>>>>
>>>> -- Ken
>>>>
>>>>>
>>>>> 
>>>>>----------------------------------------------------------------------
>>>>>--
>>>>>
>>>>> *From:* raghu vittal
>>>>>
>>>>> *Sent:* February 24, 2016 4:32:01am PST
>>>>>
>>>>> *To:* [email protected] <mailto:[email protected]>
>>>>>
>>>>> *Subject:* Re: Unable to extract content from chunked portion of
>>>>>large
>>>>> file
>>>>>
>>>>>
>>>>> Thanks for your reply.
>>>>>
>>>>> In our application user can upload large files. Our intention is to
>>>>> extract the content out of large file and dump that in Elastic for
>>>>> contented based search.
>>>>> we have > 300 MB size .xlsx and .doc files. sending that large file
>>>>>to
>>>>> Tika will causing timeout issues.
>>>>>
>>>>> i tried getting chunk of file and pass to Tika. Tika given me invalid
>>>>> data exception.
>>>>>
>>>>> I Think for Tika we need to pass entire file at once to extract
>>>>> content.
>>>>>
>>>>> Raghu.
>>>>>
>>>>>
>>>>> 
>>>>>----------------------------------------------------------------------
>>>>>--
>>>>> *From:*Ken Krugler <[email protected]
>>>>> <mailto:[email protected]>>
>>>>> *Sent:*Friday, February 19, 2016 8:22 PM
>>>>> *To:*[email protected] <mailto:[email protected]>
>>>>> *Subject:*RE: Unable to extract content from chunked portion of large
>>>>> file
>>>>> One option is to create your own RESTful API that lets you send
>>>>>chunks
>>>>> of the file, and then you can provide an input stream that provides
>>>>> the seamless data view of the chunks to Tika (which is what it
>>>>>needs).
>>>>>
>>>>> -- Ken
>>>>>
>>>>>>
>>>>>> 
>>>>>>---------------------------------------------------------------------
>>>>>>--
>>>>>> -
>>>>>> *From:*raghu vittal
>>>>>> *Sent:*February 19, 2016 1:37:49am PST
>>>>>> *To:*[email protected] <mailto:[email protected]>
>>>>>> *Subject:*Unable to extract content from chunked portion of large
>>>>>>file
>>>>>>
>>>>>> Hi All
>>>>>>
>>>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract
>>>>>> content and dump data in Elastic Search for full-text search.
>>>>>> sending very large files to Tika will cause out of memory exception.
>>>>>>
>>>>>> we want to chunk the file and send it to TIKA for content
>>>>>>extraction.
>>>>>> when we passed chunked portion of file to Tika it is giving empty
>>>>>> text.
>>>>>> I assume Tika is relied on file structure that why it is not giving
>>>>>> any content.
>>>>>>
>>>>>> we are using Tika Server(REST api) in our .net application.
>>>>>>
>>>>>> please suggest us better approach for this scenario.
>>>>>>
>>>>>> Regards,
>>>>>> Raghu.
>>>>
>>>>
>>>>
>>>> --------------------------
>>>> Ken Krugler
>>>> +1 530-210-6378
>>>> http://www.scaleunlimited.com
>>>> custom big data solutions & training
>>>> Hadoop, Cascading, Cassandra & Solr
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Sergey Beryozkin
>>>
>>> Talend Community Coders
>>> http://coders.talend.com/
>>
>

Re: Unable to extract content from chunked portion of large file

Reply via email to