Hi folks
I'm new here and to pdfbox - I've been looking at it mainly because it's
used by Apache FOP to embed PDFs in other PDFs during XSL-FO
typesetting. I'm using it to produce classified advertising pages, and
I've run into a bit of a roadblock that Google and searching the fop and
pdfbox mailing lists hasn't helped with.
My documents contain 500-1000 small PDF files embedded as form XObjects
into the master PDF file. Each of the original files has its fonts
included as an embedded subset. Since many of the documents use
different sets of glyphs from the same fonts, and the whole PDF is
copied into the new document, I land up with hundreds of copies of
common fonts like "Helvetica Bold (subset)" in the final document. A
check with Acrobat Pro suggests that over 90% of the document's size is
embedded fonts.
What I'm looking for is a way to more intelligently merge the documents
to reduce or eliminate this font duplication. I'd like to:
- Embed a whole, non-subset copy of a font if its available locally, and
then change all references in documents I'm including as XObjects so
they refer to the new copy I've embedded (so long as the encodings
match); or even better
- As each document is embedded as an XObject into the main document,
build a list of which glyphs its embedded fonts define. Don't import the
font embeds, instead leave a dangling indirect reference to a font we're
yet to define. When all documents are embedded, produce and embed a new
subset using a local copy of the complete font, including only the
glyphs that're actually used.
Better again would be to extract all the embedded subsets and *combine*
them, so I wouldn't need a local copy of the font. That's probably way
too hard, though.
I realise that I can never de-duplicate embedded subsets with different
encodings. If there's "Helvetica Black" embedded 3 times, once each in
WinAnsi, MacRoman and a custom encoding, there's no possible reduction
without re-encoding the content streams, which is WAY beyond what I want
to tackle. All I'm interested in is improving the case of 100 copies of
"Helvetica Black (subset)" in WinAnsi, which I want to reduce to one
slightly bigger embedded subset covering all the same glyphs or failing
that a complete copy of the font.
Ideas? Is this completely insane, or possibly practical?
The docs for PDFBox offer nearly zero information on its font APIs, so I
presume I need to go delving directly into the PDF font data structures
to do any of this. I know the PDF format's low level structure quite
well, but know nearly nothing about the embedded font formats or their
encodings, so I'm *really* hoping PDFBox offers some helpers for fonts
that just aren't referenced in the docs. Any tips?
Is there anything built-in for creating custom font subsets given a
glyph list? For unembedding fonts?
Anybody tried anything like this already?
Tips/suggestions?
--
Craig Ringer
POST Newspapers
276 Onslow Rd, Shenton Park
Ph: 08 9381 3088 Fax: 08 9388 2258
ABN: 50 008 917 717
http://www.postnewspapers.com.au/