1. I am using Java Version 8 2. The following are the sizes of documents which our system processed over a period of 2 hours:
files size greater than 5MB = 26 file sizes greater than 1MB less than 5MB = 138 file sizes greater than 500KB less than 1MB = 134 file sizes greater than 100KB less than 500KB = 598 less than 100KB = 5000 3. Unfortunately we don't have access to production data , as this is part of our agreement with customer. 4. The product is an email archival system, which basically archives user data in near real time. While archiving it also extracts the data and stores it in solr/elasticsearch for users can search the data. Therefore we do this extraction throughout the data. We process around 7 to 8 million emails a day. Regards, Gaurav On Wed, May 16, 2018 at 8:09 PM, John Patrick <[email protected]> wrote: > What java version are you using? > What size documents are you using? > Do you have sample files? > How frequently are you doing the conversion as sometimes performance > improves after the 1st document but is always slow for the 1st > document. > > I had issues myself previously and either upgraded the java version to > the latest or tika and sometimes the performance improved. > > Compare the same version with and without, as if you compare one > version with and another version without you not comparing like for > like so other factors might come in to play. > > > > On 16 May 2018 at 13:59, Gaurav Sehgal <[email protected]> wrote: > > Hello, > > I am using Tika 1.9, and want to improve the performance of > the > > following document types: > > > > 1. PDF > > 2. Mircosoft Word / Excell > > 3. ZIP > > > > For PDF I tried to fine tune the PDFParserConfig, by using > > setUseNonSequentialParser to true, which according to the document should > > improve the performance, but unfortunately I did not see any improvement. > > > > > > Are, there any other tunables I can use to improve the performance for > the > > above document types. > > > > Any guidance will be greatly appreciated. > > > > Regards, > > Gaurav > > >
