Am 08.10.2019 um 19:19 schrieb Merrick, Scott:
We are seeing issues with parsing text out of a PDF that has the text
rotated 90 degrees counter clockwise.
The resulting text is broken into 2-3 characters per line. The text
seems to be read in the correct order as you can read the text (sort of)
This appears to be the same as TIKA-723
https://issues.apache.org/jira/browse/TIKA-723
That one can probably be closed.
And is in the current TIka as well, using the tika-app-1.22.jar
I did see the following TIKA-2779
https://issues.apache.org/jira/browse/TIKA-2779
Where it mentions better handling of rotated text but I am still not
able to properly parse the sample PDF I have.
Are there some parameters that have to be set that I am not aware of?
Did you try the "detectAngles" setting?
Tilman
Thanks,
Scott Merrick