Thanks for the information. I continued my search for libraries and stumbled on ICEpdf from ICEsoft and it works there so you could check for hints in their source code while improving on PDFBox ;-)
On Wed, Apr 4, 2012 at 9:57 AM, Hamed Iravanchi <[email protected]> wrote: > Hi Nicklas, > > I've been working on this issue for a while. > Right now, PDFBox can not convert PDF files created by Open Office or Libre > Office to images correctly. > In my tests, PDF files created by Microsoft Word do not have this problem > in the latest Trunk code. > > This is due to using extracted text to render the image, rather than using > code points. > Andreas used to reply my emails so we could collaborate and resolve such > issues faster, but I haven't received any reply lately. > I don't know if I'm posting in the right place or not thou... > > Anyway, to fix this issue for True Type fonts (which are typically used in > your case) following things should be done by PDFBox: > - It should use code points for all true type fonts, instead of extracted > text > - The code points should be mapped to glyph codes using the font's CMAP > - Glyph codes should be used to draw text on the image. > > I just managed to fix this yesterday in my code for my sample PDF files, by > modifying the trunk code. > But I'm waiting for developer team to collaborate so that I can make sure > what I'm doing is right and doesn't break other parts in PDFBox. > > -Hamed > > > On Wed, Mar 28, 2012 at 11:15 AM, Nicklas Karlsson <[email protected] > >wrote: > > > Hi, > > > > I'm using the latest LibreOffice to produce a PDF and the latest PDFBox > > to extract the pages as images but I'm having some problems with the > fonts. > > If I use Times New Roman I get a > > > > org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString > > Changing font on <test> from <Times New Roman> to the default font > > > > If I embed some more exotic fonts in the PDF, I get a > > > > org.apache.pdfbox.util.PDFStreamEngine processOperator > > unsupported/disabled operation: BMC > > org.apache.pdfbox.util.PDFStreamEngine processOperator > > unsupported/disabled operation: EMC > > org.apache.pdfbox.util.PDFStreamEngine processOperator > > unsupported/disabled operation: BDC > > org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString > > Changing font on <test> from <Algerian> to the default font > > > > This is all on the same machine. Is there a special trick in getting the > > fonts working? > > > > The extraction is done with something like > > > > PDDocument doc = PDDocument.load(pdf); > > List pages = doc.getDocumentCatalog().getAllPages(); > > for (int i = 0; i < pages.size(); i++) > > { > > PDPage page = (PDPage) pages.get(i); > > pics.add(page.convertToImage()); > > } > > > > > > Thanks in advance, > > Nik > > > > -- > > --- > > Nik > > > -- --- Nik

