Well aware that I need to fix it at the byte level, but thanks. XD My problem was not wholly understanding how the OPCPackage API and its associated parts worked.
Thanks again. On Mon, Apr 27, 2015 at 11:49 AM, Nick Burch <[email protected]> wrote: > On Mon, 27 Apr 2015, Michael Nguyen wrote: > >> We convert MS-WORD documents to DOCX using LibreOffice (clunky), but some >> files are unreadable because they contain invalid UTF-8 characters in the >> XML that version 1.0 and 1.1 of XML do not like. >> > > Your best long term fix is to report the bug to Apache OpenOffice, get it > fixed there, then wait for LibreOffice to accept the fix. > > LibreOffice does not care, but we need to read these documents into POI. >> Short of disassembling the archive file and editing the appropriate XML >> files in the container, I was wondering if there was a way to edit the >> PackagePart data for the relevant bits (it's the word/document.xml this is >> occurring in most frequently). The PackagePart API makes it unclear how to >> read the XML into memory and edit, then re-write to the part. >> > > Once you have a PackagePart, call getInputStream() to read the contents. > Work you want through that updating / fixing things. Possibly use IOUtils > to get the stream as a byte array. When done, call getOutputStream() and > write the new contents into it, then save the overall package > > If you have invalid XML, you can't fix it at the XML level, you'll need to > fix it at the byte level > > Nick > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
