Performance problems with Tika 1.5 and Microsoft Office docx files

Mirko Sertic Tue, 11 Mar 2014 05:40:01 -0700

Hi there

I am encountering some performance problems while extracting content from 
Microsoft Office docx files with Tika 1.5.


It seems as Tika needs about 1.3seconds to extract metadata and content per 
file. I am using the Tika.parseToString() method. After some digging around 
with JProfiler, i discovered that Tika uses the org.openxmlformat.schema 
XMLBean classes a lot. The DocumentDocument class consumes a lot of CPU time 
while parsing content.

Now, how can i speedup metadata amd content extraction?

a) Is the Tika class stateful? Do i have to create a new instance for every 
document, or can i reuse it?
b) Are the parsers stateful? Do i have to create a new parser for every 
document, or can i reuse it?
c) How can i tune the org.openxmlformat.schema classes?
d) What are the best practices to run Tika in a multithreaded environment?

Thanks in advance
Mirko

Performance problems with Tika 1.5 and Microsoft Office docx files

Reply via email to