Am 16.11.2016 um 18:47 schrieb John Logan:
Hi,

I've been using PDFbox to extract text features for layout analysis, and I'm 
running into a file that seems render properly, but the extracted text looks 
totally botched.  If I copy/paste from Acrobat Reader or Mac Preview, the same 
glyphs are broken.

Yes.

Have a look here:
Root/Pages/Kids/[0]/Resources/Font/Ty7

then scroll down and look at the "unicode" column. It is empty.

You have to understand the difference between "glyph" and "character". A glyph is just a painting of a character. If you see a "9" then it doesn't have to be that you get a "9" in text extraction too, this must be defined somewhere. And if it isn't, or is incorrect, then you won't get a good extraction.

Tilman


I've tried to make sense of the PDF using the debugger, but this is a bit beyond my 
(limited) PDF internals knowledge.  My guess is that the PDF file has some problems with 
the subsetted "BerlingskeSerifText-Extralight*2" font (this appears to be the 
font used in the example I provide below), but I can't determine why the problem glyphs 
appear fine inside a PDF viewer whereas the extracted text is incorrect.

Thanks for any guidance you can provide!  I've included a sample file and 
details below.

John

I've uploaded the PDF for a problem page here:

https://www.dropbox.com/s/05rlbmv74ya0lrg/TVL_2016_12-64.pdf?dl=0

The phrase "comfortable Airbus A XWB to Helsinki and suffering zero jet lag" on this page has 
problems with the numbers in "A350" and the ligature in "suffering".

If I use the PDFbox preflight app, I see three error classes:

1.0.14 : Syntax error, Object {67:0} has an offset of 0
3.1.4 : Invalid Font definition, UDWCAS+BerlingskeSerifCn-XBd: The Charset 
entry is missing for the Type1 Subset
1.2.7 : Body Syntax error, Filter specified in metadata dictionnary

The PDF debugger dump of this part of the content is:

q
     1 0 0 1 99.60001 123.131 cm
     BT
       8.5 0 0 8.5 0 0 Tm
       /Ty5 1 Tf
       [ (c) 10 (omfort) -9.9 (able ) -24 (Airb) 5.1 (us ) -24 (A) ] TJ
     ET
   Q
   q
     1 0 0 1 99.60001 123.131 cm
     BT
       8.5 0 0 8.5 81.1988 0 Tm
       /Ty7 1 Tf
       [ ($%) 10 (&) ] TJ
     ET
   Q
   q
     1 0 0 1 99.60001 123.131 cm
     BT
       8.5 0 0 8.5 94.5778 0 Tm
       /Ty5 1 Tf
       [ ( ) -24 (XWB ) -24 ( ) -24 (to ) -24 (Helsinki ) -24 (and ) -24 (su) ] 
TJ
     ET
   Q
   q
     1 0 0 1 99.60001 123.131 cm
     BT
       8.5 0 0 8.5 186.9813 0 Tm
       /Ty7 1 Tf
       (') Tj
     ET
   Q
   q
     1 0 0 1 99.60001 123.131 cm
     BT
       8.5 0 0 8.5 192.0218 0 Tm
       /Ty5 1 Tf
       [ (ering ) -24 (z) 5 (er) 10 (o ) -24 (jet ) -24 (lag, ) -24 (t) -5 (ra) 
10 (v) 10 (el ) -24 (is ) -24 (g) 5 (ett) -5 (ing ) -24 (undeniably ) -24 
(better) 20 (. ) ] TJ
     ET
   Q

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to