Merging multiple different subsets of the same font, or re-embedding font

Craig Ringer Sun, 11 Dec 2011 06:05:16 -0800

Hi folks

I'm new here and to pdfbox - I've been looking at it mainly because it'sused by Apache FOP to embed PDFs in other PDFs during XSL-FOtypesetting. I'm using it to produce classified advertising pages, andI've run into a bit of a roadblock that Google and searching the fop andpdfbox mailing lists hasn't helped with.

My documents contain 500-1000 small PDF files embedded as form XObjectsinto the master PDF file. Each of the original files has its fontsincluded as an embedded subset. Since many of the documents usedifferent sets of glyphs from the same fonts, and the whole PDF iscopied into the new document, I land up with hundreds of copies ofcommon fonts like "Helvetica Bold (subset)" in the final document. Acheck with Acrobat Pro suggests that over 90% of the document's size isembedded fonts.

What I'm looking for is a way to more intelligently merge the documentsto reduce or eliminate this font duplication. I'd like to:

- Embed a whole, non-subset copy of a font if its available locally, andthen change all references in documents I'm including as XObjects sothey refer to the new copy I've embedded (so long as the encodingsmatch); or even better

- As each document is embedded as an XObject into the main document,build a list of which glyphs its embedded fonts define. Don't import thefont embeds, instead leave a dangling indirect reference to a font we'reyet to define. When all documents are embedded, produce and embed a newsubset using a local copy of the complete font, including only theglyphs that're actually used.

Better again would be to extract all the embedded subsets and *combine*them, so I wouldn't need a local copy of the font. That's probably waytoo hard, though.

I realise that I can never de-duplicate embedded subsets with differentencodings. If there's "Helvetica Black" embedded 3 times, once each inWinAnsi, MacRoman and a custom encoding, there's no possible reductionwithout re-encoding the content streams, which is WAY beyond what I wantto tackle. All I'm interested in is improving the case of 100 copies of"Helvetica Black (subset)" in WinAnsi, which I want to reduce to oneslightly bigger embedded subset covering all the same glyphs or failingthat a complete copy of the font.


Ideas? Is this completely insane, or possibly practical?

The docs for PDFBox offer nearly zero information on its font APIs, so Ipresume I need to go delving directly into the PDF font data structuresto do any of this. I know the PDF format's low level structure quitewell, but know nearly nothing about the embedded font formats or theirencodings, so I'm *really* hoping PDFBox offers some helpers for fontsthat just aren't referenced in the docs. Any tips?

Is there anything built-in for creating custom font subsets given aglyph list? For unembedding fonts?


Anybody tried anything like this already?

Tips/suggestions?

--
Craig Ringer

POST Newspapers
276 Onslow Rd, Shenton Park
Ph: 08 9381 3088     Fax: 08 9388 2258
ABN: 50 008 917 717
http://www.postnewspapers.com.au/

Merging multiple different subsets of the same font, or re-embedding font

Reply via email to