Hi folks

I'm new here and to pdfbox - I've been looking at it mainly because it's used by Apache FOP to embed PDFs in other PDFs during XSL-FO typesetting. I'm using it to produce classified advertising pages, and I've run into a bit of a roadblock that Google and searching the fop and pdfbox mailing lists hasn't helped with.

My documents contain 500-1000 small PDF files embedded as form XObjects into the master PDF file. Each of the original files has its fonts included as an embedded subset. Since many of the documents use different sets of glyphs from the same fonts, and the whole PDF is copied into the new document, I land up with hundreds of copies of common fonts like "Helvetica Bold (subset)" in the final document. A check with Acrobat Pro suggests that over 90% of the document's size is embedded fonts.

What I'm looking for is a way to more intelligently merge the documents to reduce or eliminate this font duplication. I'd like to:

- Embed a whole, non-subset copy of a font if its available locally, and then change all references in documents I'm including as XObjects so they refer to the new copy I've embedded (so long as the encodings match); or even better

- As each document is embedded as an XObject into the main document, build a list of which glyphs its embedded fonts define. Don't import the font embeds, instead leave a dangling indirect reference to a font we're yet to define. When all documents are embedded, produce and embed a new subset using a local copy of the complete font, including only the glyphs that're actually used.

Better again would be to extract all the embedded subsets and *combine* them, so I wouldn't need a local copy of the font. That's probably way too hard, though.

I realise that I can never de-duplicate embedded subsets with different encodings. If there's "Helvetica Black" embedded 3 times, once each in WinAnsi, MacRoman and a custom encoding, there's no possible reduction without re-encoding the content streams, which is WAY beyond what I want to tackle. All I'm interested in is improving the case of 100 copies of "Helvetica Black (subset)" in WinAnsi, which I want to reduce to one slightly bigger embedded subset covering all the same glyphs or failing that a complete copy of the font.

Ideas? Is this completely insane, or possibly practical?

The docs for PDFBox offer nearly zero information on its font APIs, so I presume I need to go delving directly into the PDF font data structures to do any of this. I know the PDF format's low level structure quite well, but know nearly nothing about the embedded font formats or their encodings, so I'm *really* hoping PDFBox offers some helpers for fonts that just aren't referenced in the docs. Any tips?

Is there anything built-in for creating custom font subsets given a glyph list? For unembedding fonts?

Anybody tried anything like this already?

Tips/suggestions?

--
Craig Ringer

POST Newspapers
276 Onslow Rd, Shenton Park
Ph: 08 9381 3088     Fax: 08 9388 2258
ABN: 50 008 917 717
http://www.postnewspapers.com.au/

Reply via email to