Am 25.09.2017 um 19:43 schrieb Allison, Timothy B.:
Thank you, Tilman. I haven't looked yet, but to confirm, there's no page
parameter that specifies that the text has been rotated?
Yes and no, because it can be rotated through page rotation but also
with "cm" or "Tm" and maybe others.
In your file, there is no page level rotation. It is done in the content
stream with commands like
0 60 -60 0 192.84 160.08 Tm
And what gets really tricky is if you have diagonal rotations or mixed
rotations...
Tilman
Back to language modeling... 😊 Thank you, again!
-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]]
Sent: Monday, September 25, 2017 1:39 PM
To: [email protected]
Subject: Re: Extracting rotated text
No good idea except call setRotate() on the page and then do text extraction.
A possible strategy might be to do all rotations and see which one brings most
known words.
Tilman
Am 25.09.2017 um 19:31 schrieb Allison, Timothy B.:
Colleagues,
Any recommendations for extracting rotated text such as:
https://www.fsis.usda.gov/wps/wcm/connect/896bf55c-0d78-44a0-adfb-94f893eb0f72/GallagherEbelKause_74.pdf?MOD=AJPERES
?
Adobe DC gets reasonable text with "save as text". PDFBox's ExtractText (and
Tika) get something like this:
FS
IS
L
is
te
ria
Li
st
er
ia
R
is
k
R
is
k
As
se
ss
m
en
Thank you!
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]