Hi Craig, We're looking into this exact same problem, I'll let you know if anything comes of it.
Mehdi On 11 December 2011 13:31, Craig Ringer <[email protected]> wrote: > Hi folks > > I'm new here and to pdfbox - I've been looking at it mainly because it's > used by Apache FOP to embed PDFs in other PDFs during XSL-FO typesetting. > I'm using it to produce classified advertising pages, and I've run into a > bit of a roadblock that Google and searching the fop and pdfbox mailing > lists hasn't helped with. > > My documents contain 500-1000 small PDF files embedded as form XObjects into > the master PDF file. Each of the original files has its fonts included as an > embedded subset. Since many of the documents use different sets of glyphs > from the same fonts, and the whole PDF is copied into the new document, I > land up with hundreds of copies of common fonts like "Helvetica Bold > (subset)" in the final document. A check with Acrobat Pro suggests that over > 90% of the document's size is embedded fonts. > > What I'm looking for is a way to more intelligently merge the documents to > reduce or eliminate this font duplication. I'd like to: > > - Embed a whole, non-subset copy of a font if its available locally, and > then change all references in documents I'm including as XObjects so they > refer to the new copy I've embedded (so long as the encodings match); or > even better > > - As each document is embedded as an XObject into the main document, build a > list of which glyphs its embedded fonts define. Don't import the font > embeds, instead leave a dangling indirect reference to a font we're yet to > define. When all documents are embedded, produce and embed a new subset > using a local copy of the complete font, including only the glyphs that're > actually used. > > Better again would be to extract all the embedded subsets and *combine* > them, so I wouldn't need a local copy of the font. That's probably way too > hard, though. > > I realise that I can never de-duplicate embedded subsets with different > encodings. If there's "Helvetica Black" embedded 3 times, once each in > WinAnsi, MacRoman and a custom encoding, there's no possible reduction > without re-encoding the content streams, which is WAY beyond what I want to > tackle. All I'm interested in is improving the case of 100 copies of > "Helvetica Black (subset)" in WinAnsi, which I want to reduce to one > slightly bigger embedded subset covering all the same glyphs or failing that > a complete copy of the font. > > Ideas? Is this completely insane, or possibly practical? > > The docs for PDFBox offer nearly zero information on its font APIs, so I > presume I need to go delving directly into the PDF font data structures to > do any of this. I know the PDF format's low level structure quite well, but > know nearly nothing about the embedded font formats or their encodings, so > I'm *really* hoping PDFBox offers some helpers for fonts that just aren't > referenced in the docs. Any tips? > > Is there anything built-in for creating custom font subsets given a glyph > list? For unembedding fonts? > > Anybody tried anything like this already? > > Tips/suggestions? > > -- > Craig Ringer > > POST Newspapers > 276 Onslow Rd, Shenton Park > Ph: 08 9381 3088 Fax: 08 9388 2258 > ABN: 50 008 917 717 > http://www.postnewspapers.com.au/

