Hi, We use Apache Tika in our application before sending the content to Solr for Indexing. Some of our documents are pretty large (over 150 MB in size with "only text" content over 30 MB). Processing such documents often result in Out of Memory Exceptions during runtime. Ofcourse, increasing the max heap does resolve this issue and another option we use is to index in chunks of 5 MB.
On careful analysis, we realized that most of our keywords lie in the first 1-2 MB of such documents and indexing that chunk suffices our requirement. Is there any provision in Tika APIs to extract only the first 1 or 2 MB (customizable) of the content instead of parsing the entire document? If not, can someone point to which part of the code I can play with to implement this? Thanks, Kumar
