RE: Performance problems with Tika 1.5 and Microsoft Office docx files

Nick Burch Tue, 11 Mar 2014 22:19:27 -0700

On Tue, 11 Mar 2014, Uwe Schindler wrote:

In my opinion, the MS Office parsers should work similar to OpenDocumentparsers (which uses more or less just some mapping of element names tohtml element names in a completely sax based way). But I expect theMicroweich schemas to be not so simple like OpenDocument, so simplemappings between element names and namespaces would not be so easy. :-)


If only that were true, we could give up on the who POI project... ;-)

But we should give it a try! In that case we don’t need to transform theDOM tree into a big object structure (which is useless overhead becausewe just want to extract text and some formatting + metadata).

You'd probably want to do something similar to the way we do .xlsx files.Load the smaller parts which need random access via dom (eg styles,headers, footers), then process the main (large) body via SAX. Call out tothe smaller DOM based bits when you need to fill those bits in.

I probably don't have time right now to help code, but I can offer adviceon it. [email protected] would probably be the best place to discuss theapproach


Nick

RE: Performance problems with Tika 1.5 and Microsoft Office docx files

Reply via email to