On Tue, 11 Mar 2014, Uwe Schindler wrote:
In my opinion, the MS Office parsers should work similar to OpenDocument parsers (which uses more or less just some mapping of element names to html element names in a completely sax based way). But I expect the Microweich schemas to be not so simple like OpenDocument, so simple mappings between element names and namespaces would not be so easy. :-)

If only that were true, we could give up on the who POI project... ;-)

But we should give it a try! In that case we don’t need to transform the DOM tree into a big object structure (which is useless overhead because we just want to extract text and some formatting + metadata).

You'd probably want to do something similar to the way we do .xlsx files. Load the smaller parts which need random access via dom (eg styles, headers, footers), then process the main (large) body via SAX. Call out to the smaller DOM based bits when you need to fill those bits in.

I probably don't have time right now to help code, but I can offer advice on it. [email protected] would probably be the best place to discuss the approach

Nick

Reply via email to