So here's the scenario...

We convert MS-WORD documents to DOCX using LibreOffice (clunky), but some
files are unreadable because they contain invalid UTF-8 characters in the
XML that version 1.0 and 1.1 of XML do not like.

LibreOffice does not care, but we need to read these documents into POI.
Short of disassembling the archive file and editing the appropriate XML
files in the container, I was wondering if there was a way to edit the
PackagePart data for the relevant bits (it's the word/document.xml this is
occurring in most frequently).  The PackagePart API makes it unclear how to
read the XML into memory and edit, then re-write to the part.

Any recommendations are welcome on how to approach this.

Thanks

Reply via email to