One option is to create your own RESTful API that lets you send chunks of the file, and then you can provide an input stream that provides the seamless data view of the chunks to Tika (which is what it needs).
-- Ken > From: raghu vittal > Sent: February 19, 2016 1:37:49am PST > To: [email protected] > Subject: Unable to extract content from chunked portion of large file > > Hi All > > we have very large PDF,.docx,.xlsx. We are using Tika to extract content and > dump data in Elastic Search for full-text search. > sending very large files to Tika will cause out of memory exception. > > we want to chunk the file and send it to TIKA for content extraction. when we > passed chunked portion of file to Tika it is giving empty text. > I assume Tika is relied on file structure that why it is not giving any > content. > > we are using Tika Server(REST api) in our .net application. > > please suggest us better approach for this scenario. > > Regards, > Raghu. > > -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
