On Fri, 20 Jan 2012, Allison, Timothy B. wrote:
I'm just getting started with Tika, and I tried the basic AutoDetectParser and the basic ParsingReader on a batch of a few thousand docx files (tika-app v1.0). On my laptop, I was able to extract text at a rate of 200 docs per minute. When I ran XWPFWordExtractor (poi 3.8) on the same docs, the rate was 1000 docs per minute.

I'd expect a slight difference, but not that big. The POI extractor just does plain text from Paragraphs, with no formatting and almost nothing else, so should be quicker. The Tika extractor does Paragraphs and Tables, with some style information, hyperlinks, bookmarks, comments, notes, pictures, headers and footers. Because it extracts more information, and in a richer manner, it will take a little longer.

2x would be my guess though, rather than 5x, are you able to do any profiling to see where the slowdown is?

Nick

Reply via email to