You are right, I am trying the parse that form. The reason I am trying to use processTextPosition is we will be doing this programmatically, there will be no one selecting the region. Also we will be extracting the data from the form generated by different providers which does not look exactly the same. For eg., the whole page looks kind of squished. I tried the PDFTextStripperByArea#extractRegions(PDPage), since the position will not be exactly the same it is causing me to loose data or pick up the data from the next column.
Is there a way to find the coordinates for PDFTextStripperByArea#extractRegions(PDPage) columns programmatically to be more accurate? From: Villu Ruusmann <[email protected]> To: [email protected] Cc: [email protected] Date: 02/26/2010 02:47 AM Subject: Re: PDFTextStripper.processTextPosition Hello there, > > I thought of continuing to use 0.8 version for my purpose for now. > Hoping I will have the easier way to achieve it in the later versions of PDFBox. > > The reason for this email is, I am having a difference in the data I receive if I run > PDFTextStripper.writeText() and if I extend PDFTextStripper.processTextPosition( ). > For example, I have attached a one-page pdf I used for this. It is unclear to me why do you insist using PDFTextStripper#processTextPosition(TextPosition) to do the job when there are better alternatives available. The example document you sent to me is the second page of the Freddie Mac Form 70 (http://www.freddiemac.com/sell/forms/pdf/70.pdf), which has a fixed 3-column layout. In order to extract field values, you need to find out their bounding boxes. For as long as there is no PDFBox GUI around I suggest you to use Foxit PDF Editor for that (select an element and open "Property List" from its context menu). Then, instantiate a PDFTextStripperByArea and populate it by invoking PDFTextStripperByArea#addRegion(String, Rectangle2D) for every field. Then, process the page by invoking PDFTextStripperByArea#extractRegions(PDPage). Finally, retrieve field values by invoking PDFTextStripperByArea#getTextForRegion(String) for every field. Note that you do not need to override any methods in class PDFTextStripperByArea - the public API does just fine. I have attached a sample application (FreddieMacForm70.java) that extracts the fields "Sale Price", "Date of Sale/Time", and "Gross Living Area" for all 3 comparable sales. You can add other fields as needed. VR [attachment "FreddieMacForm70.java" deleted by Rekha Hariramakrishnan/Flagstar_notes] This e-mail may contain data that is confidential, proprietary or non-public personal information, as that term is defined in the Gramm-Leach-Bliley Act (collectively, Confidential Information). The Confidential Information is disclosed conditioned upon your agreement that you will treat it confidentially and in accordance with applicable law, ensure that such data isn't used or disclosed except for the limited purpose for which it's being provided and will notify and cooperate with us regarding any requested or unauthorized disclosure or use of any Confidential Information. By accepting and reviewing the Confidential information, you agree to indemnify us against any losses or expenses, including attorney's fees that we may incur as a result of any unauthorized use or disclosure of this data due to your acts or omissions. If a party other than the intended recipient receives this e-mail, he or she is requested to instantly notify us of the erroneous delivery and return to us all data so delivered.

