Am 25.09.2017 um 19:43 schrieb Allison, Timothy B.:
Thank you, Tilman.  I haven't looked yet, but to confirm, there's no page 
parameter that specifies that the text has been rotated?

Yes and no, because it can be rotated through page rotation but also with "cm" or "Tm" and maybe others.

In your file, there is no page level rotation. It is done in the content stream with commands like

    0 60 -60 0 192.84 160.08 Tm

And what gets really tricky is if you have diagonal rotations or mixed rotations...

Tilman


Back to language modeling... 😊  Thank you, again!

-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]]
Sent: Monday, September 25, 2017 1:39 PM
To: [email protected]
Subject: Re: Extracting rotated text

No good idea except call setRotate() on the page and then do text extraction.

A possible strategy might be to do all rotations and see which one brings most 
known words.

Tilman


Am 25.09.2017 um 19:31 schrieb Allison, Timothy B.:
Colleagues,
Any recommendations for extracting rotated text such as: 
https://www.fsis.usda.gov/wps/wcm/connect/896bf55c-0d78-44a0-adfb-94f893eb0f72/GallagherEbelKause_74.pdf?MOD=AJPERES
 ?

Adobe DC gets reasonable text with "save as text".  PDFBox's ExtractText (and 
Tika) get something like this:

FS
IS
L
is
te
ria
Li
st
er
ia
R
is
k
R
is
k
As
se
ss
m
en

Thank you!


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to