Extracting character-level text coordinates

Luca Loiodice Tue, 18 Feb 2020 12:26:07 -0800

Hello,

I am having problems extracting precise character-level text coordinates
from PDF.


I have overridden PDFTextStripper's writeString(String text,
List<TextPosition> textPositions) to access the text characters information

This is a bit of code I use to extract info from the TextPosition fields
and pass it to my CharacterTextPosition object.

CharacterTextPosition characterTextPosition = new CharacterTextPosition();

characterTextPosition.SetCharacterText(textPosition.getUnicode());
characterTextPosition.SetLeft(textPosition.getXDirAdj());
characterTextPosition.SetBottom(pdPage.getMediaBox().getHeight() -
textPosition.getYDirAdj());
characterTextPosition.SetWidth(textPosition.getWidthDirAdj());
characterTextPosition.SetHeight(textPosition.getHeight());
int characterDirection = (int) textPosition.getDir();
characterTextPosition.SetOrientation(characterDirection);

This is a PDF where the extracted text coordinates for a PDF of a
Powerpoint slide are drawn
https://www.dropbox.com/s/hp1dape5mp2l8ti/PPT_Slide_CommonFont.pdf.text_extraction_rectangles.pdf?dl=0
as you can see the rectangles are smaller than the characters

Fells this might have to do with the font ... I see textPosition.getFont()
which returns the font information for the TextPosition ...
Is there a way to adjust my code to get more accurate coordinates?

Thanks a lot,
Luca

Extracting character-level text coordinates

Reply via email to