Re: Unable to extract content from chunked portion of large file

Mattmann, Chris A (3980) Wed, 24 Feb 2016 08:41:00 -0800

+1 please just remove it from the wiki since it clearly supports
that per your research thanks Sergey!


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Sergey Beryozkin <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, February 24, 2016 at 7:44 AM
To: "[email protected]" <[email protected]>
Subject: Re: Unable to extract content from chunked portion of large file

>Hi All
>
>If a large file is passed to a Tika server as a multipart/form payload
>
>then CXF will be creating a temp file on the disk itself.
>
>Hmm... I was looking for a reference to it and I found the advice not to
>use multipart/form-data:
>https://wiki.apache.org/tika/TikaJAXRS (in Services)
>
>I believe it should be removed, see:
>
>http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org/a
>pache/tika/server/resource/TikaResource.java,
>example:
>
>@POST
>     @Consumes("multipart/form-data")
>     @Produces("text/plain")
>     @Path("form")
>     public StreamingOutput getTextFromMultipart(Attachment att,
>@Context final UriInfo info) {
>         return produceText(att.getObject(InputStream.class),
>att.getHeaders(), info);
>     }
>
>
>Cheers, Sergey
>
>
>
>On 24/02/16 15:37, Ken Krugler wrote:
>> Hi Raghu,
>>
>> I don't think you understood what I was proposing.
>>
>> I suggested creating a service that could receive chunks of the file
>> (persisted to local disk). Then this service could implement an input
>> stream class that would read sequentially from these pieces. This input
>> stream would be passed to Tika, thus giving Tika a single continuous
>> stream of data to the entire file content.
>>
>> -- Ken
>>
>>> 
>>>------------------------------------------------------------------------
>>>
>>> *From:* raghu vittal
>>>
>>> *Sent:* February 24, 2016 4:32:01am PST
>>>
>>> *To:* [email protected] <mailto:[email protected]>
>>>
>>> *Subject:* Re: Unable to extract content from chunked portion of large
>>> file
>>>
>>>
>>> Thanks for your reply.
>>>
>>> In our application user can upload large files. Our intention is to
>>> extract the content out of large file and dump that in Elastic for
>>> contented based search.
>>> we have > 300 MB size .xlsx and .doc files. sending that large file to
>>> Tika will causing timeout issues.
>>>
>>> i tried getting chunk of file and pass to Tika. Tika given me invalid
>>> data exception.
>>>
>>> I Think for Tika we need to pass entire file at once to extract
>>>content.
>>>
>>> Raghu.
>>>
>>> 
>>>------------------------------------------------------------------------
>>> *From:*Ken Krugler <[email protected]
>>> <mailto:[email protected]>>
>>> *Sent:*Friday, February 19, 2016 8:22 PM
>>> *To:*[email protected] <mailto:[email protected]>
>>> *Subject:*RE: Unable to extract content from chunked portion of large
>>>file
>>> One option is to create your own RESTful API that lets you send chunks
>>> of the file, and then you can provide an input stream that provides
>>> the seamless data view of the chunks to Tika (which is what it needs).
>>>
>>> -- Ken
>>>
>>>> 
>>>>-----------------------------------------------------------------------
>>>>-
>>>> *From:*raghu vittal
>>>> *Sent:*February 19, 2016 1:37:49am PST
>>>> *To:*[email protected] <mailto:[email protected]>
>>>> *Subject:*Unable to extract content from chunked portion of large file
>>>>
>>>> Hi All
>>>>
>>>> we have very large PDF,.docx,.xlsx. We are using Tika to extract
>>>> content and dump data in Elastic Search for full-text search.
>>>> sending very large files to Tika will cause out of memory exception.
>>>>
>>>> we want to chunk the file and send it to TIKA for content extraction.
>>>> when we passed chunked portion of file to Tika it is giving empty
>>>>text.
>>>> I assume Tika is relied on file structure that why it is not giving
>>>> any content.
>>>>
>>>> we are using Tika Server(REST api) in our .net application.
>>>>
>>>> please suggest us better approach for this scenario.
>>>>
>>>> Regards,
>>>> Raghu.
>>
>>
>>
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>>
>>
>>
>>
>>
>
>
>-- 
>Sergey Beryozkin
>
>Talend Community Coders
>http://coders.talend.com/

Re: Unable to extract content from chunked portion of large file

Reply via email to