Am 16.11.2016 um 18:47 schrieb John Logan:
Hi,
I've been using PDFbox to extract text features for layout analysis, and I'm
running into a file that seems render properly, but the extracted text looks
totally botched. If I copy/paste from Acrobat Reader or Mac Preview, the same
glyphs are broken.
Yes.
Have a look here:
Root/Pages/Kids/[0]/Resources/Font/Ty7
then scroll down and look at the "unicode" column. It is empty.
You have to understand the difference between "glyph" and "character". A
glyph is just a painting of a character. If you see a "9" then it
doesn't have to be that you get a "9" in text extraction too, this must
be defined somewhere. And if it isn't, or is incorrect, then you won't
get a good extraction.
Tilman
I've tried to make sense of the PDF using the debugger, but this is a bit beyond my
(limited) PDF internals knowledge. My guess is that the PDF file has some problems with
the subsetted "BerlingskeSerifText-Extralight*2" font (this appears to be the
font used in the example I provide below), but I can't determine why the problem glyphs
appear fine inside a PDF viewer whereas the extracted text is incorrect.
Thanks for any guidance you can provide! I've included a sample file and
details below.
John
I've uploaded the PDF for a problem page here:
https://www.dropbox.com/s/05rlbmv74ya0lrg/TVL_2016_12-64.pdf?dl=0
The phrase "comfortable Airbus A XWB to Helsinki and suffering zero jet lag" on this page has
problems with the numbers in "A350" and the ligature in "suffering".
If I use the PDFbox preflight app, I see three error classes:
1.0.14 : Syntax error, Object {67:0} has an offset of 0
3.1.4 : Invalid Font definition, UDWCAS+BerlingskeSerifCn-XBd: The Charset
entry is missing for the Type1 Subset
1.2.7 : Body Syntax error, Filter specified in metadata dictionnary
The PDF debugger dump of this part of the content is:
q
1 0 0 1 99.60001 123.131 cm
BT
8.5 0 0 8.5 0 0 Tm
/Ty5 1 Tf
[ (c) 10 (omfort) -9.9 (able ) -24 (Airb) 5.1 (us ) -24 (A) ] TJ
ET
Q
q
1 0 0 1 99.60001 123.131 cm
BT
8.5 0 0 8.5 81.1988 0 Tm
/Ty7 1 Tf
[ ($%) 10 (&) ] TJ
ET
Q
q
1 0 0 1 99.60001 123.131 cm
BT
8.5 0 0 8.5 94.5778 0 Tm
/Ty5 1 Tf
[ ( ) -24 (XWB ) -24 ( ) -24 (to ) -24 (Helsinki ) -24 (and ) -24 (su) ]
TJ
ET
Q
q
1 0 0 1 99.60001 123.131 cm
BT
8.5 0 0 8.5 186.9813 0 Tm
/Ty7 1 Tf
(') Tj
ET
Q
q
1 0 0 1 99.60001 123.131 cm
BT
8.5 0 0 8.5 192.0218 0 Tm
/Ty5 1 Tf
[ (ering ) -24 (z) 5 (er) 10 (o ) -24 (jet ) -24 (lag, ) -24 (t) -5 (ra)
10 (v) 10 (el ) -24 (is ) -24 (g) 5 (ett) -5 (ing ) -24 (undeniably ) -24
(better) 20 (. ) ] TJ
ET
Q
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]