Re: PDFbox appears to struggle with text extraction for some fonts.

John Logan Wed, 16 Nov 2016 10:42:26 -0800

On Wed, 2016-11-16 at 18:06 +0000, Tilman Hausherr wrote:
> Am 16.11.2016 um 18:47 schrieb John Logan:
> > Hi,
> >
> > I've been using PDFbox to extract text features for layout analysis,
> > and I'm running into a file that seems render properly, but the extracted
> > text looks totally botched.  If I copy/paste from Acrobat Reader or Mac
> > Preview, the same glyphs are broken.
>
> Yes.
>
> Have a look here:
> Root/Pages/Kids/[0]/Resources/Font/Ty7
>
> then scroll down and look at the "unicode" column. It is empty.
>
> You have to understand the difference between "glyph" and "character". A
> glyph is just a painting of a character. If you see a "9" then it
> doesn't have to be that you get a "9" in text extraction too, this must
> be defined somewhere. And if it isn't, or is incorrect, then you won't
> get a good extraction.
>
> Tilman
>


[snip]

Thanks for the quick response, Tilman (and John).  Sorry for
the imprecision in terms; I understand your explanation of the
difference.

The part I didn't grok until I saw your explanations was the
missing Unicode mapping information.  I appreciate your help
in clarifying how that works.

John

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: PDFbox appears to struggle with text extraction for some fonts.

Reply via email to