Thank you Maruan, Apologies for the noise. I have now resolved this. I simplified my code for the examples I gave in the email. The issue is not with PDF Box, rather a 3rd party library which was processing the string "Çâmára Münícìpål de Matelâñdia" before it reached PDFBox was mangling it.
Thanks again. Adam. On 19 July 2016 at 08:13, Maruan Sahyoun <[email protected]> wrote: > Hi, > >> Am 18.07.2016 um 14:15 schrieb Adam Retter <[email protected]>: >> >> Using pdf-box-2.0.2: >> >> I am trying to set dc:publisher to "Çâmára Münícìpål de Matelâñdia" in >> the metadata of my PDF however my diacritical characters seem to get >> mangled when I try and read the PDF back. >> >> My writing code looks like: >> >> PDDocument doc = ... >> PDDocumentCatalog catalog = ... >> >> PDMetadata metadataStream = Optional.ofNullable(catalog.getMetadata()) >> .orElseGet(() -> new PDMetadata(doc)); >> XMPMetadata xmpMetadata = null; >> try(COSInputStream is = metadataStream.createInputStream()) { >> xmpMetadata = new DomXmpParser().parse(is); >> } catch(XmpParsingException e) { >> LOG.warn(e); >> xmpMetadata = XMPMetadata.createXMPMetadata(); >> } >> DublinCoreSchema dcMetadata = xmpMetadata.createAndAddDublinCoreSchema(); >> dcMetadata.addPublisher("Çâmára Münícìpål de Matelâñdia"); >> catalog.setMetadata(xmpMetadata); >> ByteArrayOutputStream baos = new ByteArrayOutputStream(); >> XmpSerializer serializer = new XmpSerializer(); >> serializer.serialize(xmpMetadata, baos, false); >> metadataStream.importXMPMetadata(baos.toByteArray()); >> >> >> My reading code looks like: >> >> PDDocment doc = PDDocument.load(is); >> PDDocumentCatalog catalog = doc.getDocumentCatalog() >> PDMetadata metadata = catalog.getMetadata() >> try(InputStream is = metadata.createInputStream()) { >> Files.copy(is, Paths.get("/tmp/metadata.xml")); >> } >> >> >> However in the output XML I am seeing this: >> >> <dc:publisher> >> <rdf:Bag> >> <rdf:li>??m?ra M?n?c?p?l de Matel??dia</rdf:li> >> </rdf:Bag> >> </dc:publisher> >> >> > > I've tested various ways of saving the file, yours, serializing to > FileOutputStream … and all work with when viewing the content in a browser ot > a text editor. > > > <dc:publisher> > <rdf:Bag> > <rdf:li>Çâmára Münícìpål de Matelâñdia</rdf:li> > </rdf:Bag> > </dc:publisher> > > Where do you see that string? > > BR > Maruan > > > >> So I guess something is up with the character encoding somewhere? Is >> this something I am doing incorrectly, perhaps I need to specify UTF-8 >> somewhere (my character set)? or is this a bug in pdf-box? >> >> Cheers Adam. >> >> >> >> >> >> -- >> Adam Retter >> >> skype: adam.retter >> tweet: adamretter >> http://www.adamretter.org.uk >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > -- Adam Retter skype: adam.retter tweet: adamretter http://www.adamretter.org.uk --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

