Hello there, > > I'm using PDFTextStripper to get text from a PDF document but I need to get > text only from some regions in the PDF. I know these regions are being drawn > using the "re" operator which draws a rectangle using x,y,width,height as > arguments. How do I convert these four arguments to display units so I can > compare them with the TextPosition.getX()? >
The PDF "re" operator is handled by class org.apache.pdfbox.util.operator.pagedrawer.AppendRectangleToPath. As the package name indicates, this class is meant to be used from within the PageDrawer utility, not from within the PDFTextStripper utility. If you take a look at this class you would see that the actual transformation is implemented in method org.apache.pdfbox.pdfviewer.PageDrawer#transformedPoint(double, double). If I were given similar task, I would perform two runs on a PDF document, First I would use PageDrawer utility to capture rectangular areas (simply override #fillPath(int) and/or #strokePath, and grab #getLinePath there). Then I would use PDFTextStripper (or better yet, PDFTextStripperByArea), and extract text from the previously captured rectangular areas. VR

