RE: Performance problems with Tika 1.5 and Microsoft Office docx files

Uwe Schindler Tue, 11 Mar 2014 12:47:50 -0700

In my opinion, the MS Office parsers should work similar to OpenDocument 
parsers (which uses more or less just some mapping of element names to html 
element names in a completely sax based way). But I expect the Microweich 
schemas to be not so simple like OpenDocument, so simple mappings between 
element names and namespaces would not be so easy. :-)

But we should give it a try! In that case we don’t need to transform the DOM 
tree into a big object structure (which is useless overhead because we just 
want to extract text and some formatting + metadata). The OpenDocument parser 
is so elegant, and it works with any OpenDocument type (it does not even 
differentiate between word processor or spreadsheets or presentations, all is 
handled by the same class).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]

> -----Original Message-----
> From: Nick Burch [mailto:[email protected]]
> Sent: Tuesday, March 11, 2014 2:46 PM
> To: [email protected]
> Subject: Re: Performance problems with Tika 1.5 and Microsoft Office docx
> files
> 
> On Tue, 11 Mar 2014, Mirko Sertic wrote:
> > Thanks for the reply. Is it possible to fine tune the office parser in
> > some way?
> 
> Only with a re-write...
> 
> The Excel XLSX parser was some time ago re-written to largely be SAX based.
> PPTX and DOCX remain DOM based parsing.
> 
> Nick

RE: Performance problems with Tika 1.5 and Microsoft Office docx files

Reply via email to