Re: Unable to extract content from chunked portion of large file

Sergey Beryozkin Wed, 24 Feb 2016 09:05:26 -0800

Time to start contributing to Tika again :-)

Cheers, Sergey
On 24/02/16 17:01, Mattmann, Chris A (3980) wrote:

thanks mucho my friend


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Sergey Beryozkin <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, February 24, 2016 at 8:58 AM
To: "[email protected]" <[email protected]>
Subject: Re: Unable to extract content from chunked portion of large file

Hi Chris

Sure, I've opened
https://issues.apache.org/jira/browse/TIKA-1871

and assigned to myself, will add some info about multipart/form-data asap

Cheers, Sergey


On 24/02/16 16:40, Mattmann, Chris A (3980) wrote:

+1 please just remove it from the wiki since it clearly supports
that per your research thanks Sergey!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Sergey Beryozkin <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, February 24, 2016 at 7:44 AM
To: "[email protected]" <[email protected]>
Subject: Re: Unable to extract content from chunked portion of large
file

Hi All

If a large file is passed to a Tika server as a multipart/form payload

then CXF will be creating a temp file on the disk itself.

Hmm... I was looking for a reference to it and I found the advice not
to
use multipart/form-data:
https://wiki.apache.org/tika/TikaJAXRS (in Services)

I believe it should be removed, see:


http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org
/a
pache/tika/server/resource/TikaResource.java,
example:

@POST
      @Consumes("multipart/form-data")
      @Produces("text/plain")
      @Path("form")
      public StreamingOutput getTextFromMultipart(Attachment att,
@Context final UriInfo info) {
          return produceText(att.getObject(InputStream.class),
att.getHeaders(), info);
      }


Cheers, Sergey



On 24/02/16 15:37, Ken Krugler wrote:

Hi Raghu,

I don't think you understood what I was proposing.

I suggested creating a service that could receive chunks of the file
(persisted to local disk). Then this service could implement an input
stream class that would read sequentially from these pieces. This
input
stream would be passed to Tika, thus giving Tika a single continuous
stream of data to the entire file content.

-- Ken



----------------------------------------------------------------------
--

*From:* raghu vittal

*Sent:* February 24, 2016 4:32:01am PST

*To:* [email protected] <mailto:[email protected]>

*Subject:* Re: Unable to extract content from chunked portion of
large
file


Thanks for your reply.

In our application user can upload large files. Our intention is to
extract the content out of large file and dump that in Elastic for
contented based search.
we have > 300 MB size .xlsx and .doc files. sending that large file
to
Tika will causing timeout issues.

i tried getting chunk of file and pass to Tika. Tika given me invalid
data exception.

I Think for Tika we need to pass entire file at once to extract
content.

Raghu.



----------------------------------------------------------------------
--
*From:*Ken Krugler <[email protected]
<mailto:[email protected]>>
*Sent:*Friday, February 19, 2016 8:22 PM
*To:*[email protected] <mailto:[email protected]>
*Subject:*RE: Unable to extract content from chunked portion of large
file
One option is to create your own RESTful API that lets you send
chunks
of the file, and then you can provide an input stream that provides
the seamless data view of the chunks to Tika (which is what it
needs).

-- Ken



---------------------------------------------------------------------
--
-
*From:*raghu vittal
*Sent:*February 19, 2016 1:37:49am PST
*To:*[email protected] <mailto:[email protected]>
*Subject:*Unable to extract content from chunked portion of large
file

Hi All

we have very large PDF,.docx,.xlsx. We are using Tika to extract
content and dump data in Elastic Search for full-text search.
sending very large files to Tika will cause out of memory exception.

we want to chunk the file and send it to TIKA for content
extraction.
when we passed chunked portion of file to Tika it is giving empty
text.
I assume Tika is relied on file structure that why it is not giving
any content.

we are using Tika Server(REST api) in our .net application.

please suggest us better approach for this scenario.

Regards,
Raghu.




--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr



--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: Unable to extract content from chunked portion of large file

Reply via email to