Hi Andreas, great help, I'm going to check the version on the Trunk!
Regards Hannes On Sun, Jan 30, 2011 at 6:31 PM, Andreas Lehmkuehler <[email protected]>wrote: > Hi, > > > Am 30.01.2011 17:20, schrieb Hannes Carl Meyer: > > Hi Andreas, >> >> thank you very much for your reply! >> >> The problem occurs for example on this document >> >> https://www.sparkasse-hildesheim.de/pdf/vertragsbedingungen/057_produktbedingungen_spk_cards.pdf >> >> I'm using the latest version of PDFBox, 1.4.0! >> > Hmm, I can confirm your issue and it seems to be case 7., the second case > 6.;-) It works fine with the current trunk (we recently made some > improvements). > > > Do you know a tool to debug a given PDF? Maybe you could have a hand on >> the >> PDF shown above. >> > To determine which fonts are used, just have a look at the pdf properties. > The Acrobat reader and other tools provide those props. > Use the PDFDebugger [1] which comes with PDFBox to walk through a pdf on a > logical level. > > > [1] http://pdfbox.apache.org/commandlineutilities/PDFDebugger.html > > > On Sun, Jan 30, 2011 at 4:18 PM, Andreas Lehmkuehler<[email protected] >> >wrote: >> >> Hi, >>> >>> Am 29.01.2011 22:24, schrieb Hannes Carl Meyer: >>> >>> Hi, >>> >>>> >>>> I'm using PDFBox to extract text from various PDFs. >>>> Since these PDFs are from good ol' germany in german language they >>>> contain >>>> lots of nice umlauts (ä,ö,ü etc). >>>> >>>> On some PDFs the extraction of Umlauts fails. >>>> >>>> From my first analysis I could imagine it is somehow because I'm not >>>> owning >>>> the particular PDFs font. >>>> >>>> Is it necessary to have a font installed and loaded into PDFBox to >>>> perform >>>> a >>>> proper extraction? >>>> >>>> Another interesting point: If I open these PDF documents which I can't >>>> extract Umlauts from in my Adobe Reader and try to search for an umlaut >>>> which is displayed properly - it fails. It also fails to manually >>>> extract >>>> the text via copy& paste from the pdf. >>>> >>>> Without having a hand on the pdf, it's hard to say what may be the >>> reason >>> for the described issue. There are different possibilities: >>> >>> 1.) the font isn't embebbed and the substitution made my PDFBox doesn't >>> fit >>> 100% >>> 2.) the font is an embedded subset of a true type font, which will be >>> substituted with another font due to an issue concerning font subsets >>> (see >>> [1] for further info) and that may lead to the same effect than 1. >>> 3.) the pdf uses so called CIDs (charactes IDs) without a suitable >>> mapping >>> to unicode >>> 4.) the pdf uses a type3 font without a suitable mapping to unicode >>> 5.) you're using wrong parameters for the extraction >>> 6.) you're using an editor with limited capabilities concerning text >>> encoding >>> 6.) there is still an issue with PDFBox >>> >>> Following your last comment, the cases 3. or 4. are most likely. >>> >>> BTW, what version of PDFBox are you using? >>> >>> BR >>> Andreas Lehmkühler >>> >>> [1] https://issues.apache.org/jira/browse/PDFBOX-490 >>> >> > BR > Andreas Lehmkühler >

