In my opinion, the MS Office parsers should work similar to OpenDocument parsers (which uses more or less just some mapping of element names to html element names in a completely sax based way). But I expect the Microweich schemas to be not so simple like OpenDocument, so simple mappings between element names and namespaces would not be so easy. :-)
But we should give it a try! In that case we don’t need to transform the DOM tree into a big object structure (which is useless overhead because we just want to extract text and some formatting + metadata). The OpenDocument parser is so elegant, and it works with any OpenDocument type (it does not even differentiate between word processor or spreadsheets or presentations, all is handled by the same class). Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [email protected] > -----Original Message----- > From: Nick Burch [mailto:[email protected]] > Sent: Tuesday, March 11, 2014 2:46 PM > To: [email protected] > Subject: Re: Performance problems with Tika 1.5 and Microsoft Office docx > files > > On Tue, 11 Mar 2014, Mirko Sertic wrote: > > Thanks for the reply. Is it possible to fine tune the office parser in > > some way? > > Only with a re-write... > > The Excel XLSX parser was some time ago re-written to largely be SAX based. > PPTX and DOCX remain DOM based parsing. > > Nick
