Re: FW: Default Tika extraction of docx 5X slower than XWPFWordExtractor?

Nick Burch Fri, 20 Jan 2012 07:04:24 -0800

On Fri, 20 Jan 2012, Allison, Timothy B. wrote:

I'm just getting started with Tika, and I tried the basicAutoDetectParser and the basic ParsingReader on a batch of a fewthousand docx files (tika-app v1.0). On my laptop, I was able toextract text at a rate of 200 docs per minute. When I ranXWPFWordExtractor (poi 3.8) on the same docs, the rate was 1000 docs perminute.

I'd expect a slight difference, but not that big. The POI extractor justdoes plain text from Paragraphs, with no formatting and almost nothingelse, so should be quicker. The Tika extractor does Paragraphs and Tables,with some style information, hyperlinks, bookmarks, comments, notes,pictures, headers and footers. Because it extracts more information, andin a richer manner, it will take a little longer.

2x would be my guess though, rather than 5x, are you able to do anyprofiling to see where the slowdown is?


Nick

Re: FW: Default Tika extraction of docx 5X slower than XWPFWordExtractor?

Reply via email to