Hi there

I am encountering some performance problems while extracting content from 
Microsoft Office docx files with Tika 1.5.

It seems as Tika needs about 1.3seconds to extract metadata and content per 
file. I am using the Tika.parseToString() method. After some digging around 
with JProfiler, i discovered that Tika uses the org.openxmlformat.schema 
XMLBean classes a lot. The DocumentDocument class consumes a lot of CPU time 
while parsing content.

Now, how can i speedup metadata amd content extraction?

a) Is the Tika class stateful? Do i have to create a new instance for every 
document, or can i reuse it?
b) Are the parsers stateful? Do i have to create a new parser for every 
document, or can i reuse it?
c) How can i tune the org.openxmlformat.schema classes?
d) What are the best practices to run Tika in a multithreaded environment?

Thanks in advance
Mirko

Reply via email to