Hello there, > > I thought of continuing to use 0.8 version for my purpose for now. > Hoping I will have the easier way to achieve it in the later versions of > PDFBox. > > The reason for this email is, I am having a difference in the data I receive > if I run > PDFTextStripper.writeText() and if I extend > PDFTextStripper.processTextPosition( ). > For example, I have attached a one-page pdf I used for this.
It is unclear to me why do you insist using PDFTextStripper#processTextPosition(TextPosition) to do the job when there are better alternatives available. The example document you sent to me is the second page of the Freddie Mac Form 70 (http://www.freddiemac.com/sell/forms/pdf/70.pdf), which has a fixed 3-column layout. In order to extract field values, you need to find out their bounding boxes. For as long as there is no PDFBox GUI around I suggest you to use Foxit PDF Editor for that (select an element and open "Property List" from its context menu). Then, instantiate a PDFTextStripperByArea and populate it by invoking PDFTextStripperByArea#addRegion(String, Rectangle2D) for every field. Then, process the page by invoking PDFTextStripperByArea#extractRegions(PDPage). Finally, retrieve field values by invoking PDFTextStripperByArea#getTextForRegion(String) for every field. Note that you do not need to override any methods in class PDFTextStripperByArea - the public API does just fine. I have attached a sample application (FreddieMacForm70.java) that extracts the fields "Sale Price", "Date of Sale/Time", and "Gross Living Area" for all 3 comparable sales. You can add other fields as needed. VR

