Re: Text Extraction and Fonts

Hannes Carl Meyer Sun, 30 Jan 2011 09:57:09 -0800

Hi Andreas,

great help, I'm going to check the version on the Trunk!


Regards

Hannes

On Sun, Jan 30, 2011 at 6:31 PM, Andreas Lehmkuehler <[email protected]>wrote:

> Hi,
>
>
> Am 30.01.2011 17:20, schrieb Hannes Carl Meyer:
>
>  Hi Andreas,
>>
>> thank you very much for your reply!
>>
>> The problem occurs for example on this document
>>
>> https://www.sparkasse-hildesheim.de/pdf/vertragsbedingungen/057_produktbedingungen_spk_cards.pdf
>>
>> I'm using the latest version of PDFBox, 1.4.0!
>>
> Hmm, I can confirm your issue and it seems to be case 7., the second case
> 6.;-) It works fine with the current trunk (we recently made some
> improvements).
>
>
>  Do you know a tool to debug a given PDF? Maybe you could have a hand on
>> the
>> PDF shown above.
>>
> To determine which fonts are used, just have a look at the pdf properties.
> The Acrobat reader and other tools provide those props.
> Use the PDFDebugger [1] which comes with PDFBox to walk through a pdf on a
> logical level.
>
>
> [1] http://pdfbox.apache.org/commandlineutilities/PDFDebugger.html
>
>
>  On Sun, Jan 30, 2011 at 4:18 PM, Andreas Lehmkuehler<[email protected]
>> >wrote:
>>
>>  Hi,
>>>
>>> Am 29.01.2011 22:24, schrieb Hannes Carl Meyer:
>>>
>>>  Hi,
>>>
>>>>
>>>> I'm using PDFBox to extract text from various PDFs.
>>>> Since these PDFs are from good ol' germany in german language they
>>>> contain
>>>> lots of nice umlauts (ä,ö,ü etc).
>>>>
>>>> On some PDFs the extraction of Umlauts fails.
>>>>
>>>>  From my first analysis I could imagine it is somehow because I'm not
>>>> owning
>>>> the particular PDFs font.
>>>>
>>>> Is it necessary to have a font installed and loaded into PDFBox to
>>>> perform
>>>> a
>>>> proper extraction?
>>>>
>>>> Another interesting point: If I open these PDF documents which I can't
>>>> extract Umlauts from in my Adobe Reader and try to search for an umlaut
>>>> which is displayed properly - it fails. It also fails to manually
>>>> extract
>>>> the text via copy&   paste from the pdf.
>>>>
>>>>  Without having a hand on the pdf, it's hard to say what may be the
>>> reason
>>> for the described issue. There are different possibilities:
>>>
>>> 1.) the font isn't embebbed and the substitution made my PDFBox doesn't
>>> fit
>>> 100%
>>> 2.) the font is an embedded subset of a true type font, which will be
>>> substituted with another font due to an issue concerning font subsets
>>> (see
>>> [1] for further info) and that may lead to the same effect than 1.
>>> 3.) the pdf uses so called CIDs (charactes IDs) without a suitable
>>> mapping
>>> to unicode
>>> 4.) the pdf uses a type3 font without a suitable mapping to unicode
>>> 5.) you're using wrong parameters for the extraction
>>> 6.) you're using an editor with limited capabilities concerning text
>>> encoding
>>> 6.) there is still an issue with PDFBox
>>>
>>> Following your last comment, the cases 3. or 4. are most likely.
>>>
>>> BTW, what version of PDFBox are you using?
>>>
>>> BR
>>> Andreas Lehmkühler
>>>
>>> [1] https://issues.apache.org/jira/browse/PDFBOX-490
>>>
>>
> BR
> Andreas Lehmkühler
>

Re: Text Extraction and Fonts

Reply via email to