Thanks for your reply.

In our application user can upload large files. Our intention is to extract the 
content out of large file and dump that in Elastic for contented based search.

we have > 300 MB size .xlsx and .doc files. sending that large file to Tika 
will causing timeout issues.


i tried getting chunk of file and pass to Tika. Tika given me invalid data 
exception.


I Think for Tika we need to pass entire file at once to extract content.


Raghu.

________________________________
From: Ken Krugler <[email protected]>
Sent: Friday, February 19, 2016 8:22 PM
To: [email protected]
Subject: RE: Unable to extract content from chunked portion of large file

One option is to create your own RESTful API that lets you send chunks of the 
file, and then you can provide an input stream that provides the seamless data 
view of the chunks to Tika (which is what it needs).

-- Ken

________________________________

From: raghu vittal

Sent: February 19, 2016 1:37:49am PST

To: [email protected]<mailto:[email protected]>

Subject: Unable to extract content from chunked portion of large file


Hi All

we have very large PDF,.docx,.xlsx. We are using Tika to extract content and 
dump data in Elastic Search for full-text search.
sending very large files to Tika will cause out of memory exception.

we want to chunk the file and send it to TIKA for content extraction. when we 
passed chunked portion of file to Tika it is giving empty text.
I assume Tika is relied on file structure that why it is not giving any content.

we are using Tika Server(REST api) in our .net application.

please suggest us better approach for this scenario.

Regards,
Raghu.



--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to