Hello there,

>
> I'm using PDFTextStripper to get text from a PDF document but I need to get 
> text only from some regions in the PDF. I know these regions are being drawn 
> using the "re" operator which draws a rectangle using x,y,width,height as 
> arguments. How do I convert these four arguments to display units so I can 
> compare them with the TextPosition.getX()?
>

The PDF "re" operator is handled by class
org.apache.pdfbox.util.operator.pagedrawer.AppendRectangleToPath. As
the package name indicates, this class is meant to be used from within
the PageDrawer utility, not from within the PDFTextStripper utility.
If you take a look at this class you would see that the actual
transformation is implemented in method
org.apache.pdfbox.pdfviewer.PageDrawer#transformedPoint(double,
double).

If I were given similar task, I would perform two runs on a PDF
document, First I would use PageDrawer utility to capture rectangular
areas (simply override #fillPath(int) and/or #strokePath, and grab
#getLinePath there). Then I would use PDFTextStripper (or better yet,
PDFTextStripperByArea), and extract text from the previously captured
rectangular areas.


VR

Reply via email to