On Wed, 2016-11-16 at 18:06 +0000, Tilman Hausherr wrote: > Am 16.11.2016 um 18:47 schrieb John Logan: > > Hi, > > > > I've been using PDFbox to extract text features for layout analysis, > > and I'm running into a file that seems render properly, but the extracted > > text looks totally botched. If I copy/paste from Acrobat Reader or Mac > > Preview, the same glyphs are broken. > > Yes. > > Have a look here: > Root/Pages/Kids/[0]/Resources/Font/Ty7 > > then scroll down and look at the "unicode" column. It is empty. > > You have to understand the difference between "glyph" and "character". A > glyph is just a painting of a character. If you see a "9" then it > doesn't have to be that you get a "9" in text extraction too, this must > be defined somewhere. And if it isn't, or is incorrect, then you won't > get a good extraction. > > Tilman >
[snip] Thanks for the quick response, Tilman (and John). Sorry for the imprecision in terms; I understand your explanation of the difference. The part I didn't grok until I saw your explanations was the missing Unicode mapping information. I appreciate your help in clarifying how that works. John --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

