Hi there I am encountering some performance problems while extracting content from Microsoft Office docx files with Tika 1.5.
It seems as Tika needs about 1.3seconds to extract metadata and content per file. I am using the Tika.parseToString() method. After some digging around with JProfiler, i discovered that Tika uses the org.openxmlformat.schema XMLBean classes a lot. The DocumentDocument class consumes a lot of CPU time while parsing content. Now, how can i speedup metadata amd content extraction? a) Is the Tika class stateful? Do i have to create a new instance for every document, or can i reuse it? b) Are the parsers stateful? Do i have to create a new parser for every document, or can i reuse it? c) How can i tune the org.openxmlformat.schema classes? d) What are the best practices to run Tika in a multithreaded environment? Thanks in advance Mirko
