Thanks for the tip about checking the File -> Properties. Apparently the software that is generating and/or editing these PDF Documents originates from "www.activepdf.com"
I can use that information at least try and follow-up with that 3rd party provider and see if there are any options for getting alternate versions Of those Documents that don't contain these annotations. As a follow-up note: The version for PDF is listed as (PDF Version 1.5 Acrobat 6.x) -----Original Message----- From: [email protected] [mailto:[email protected]] Sent: Friday, January 14, 2011 12:44 PM To: [email protected] Cc: [email protected] Subject: Re: Can PDFBox extract text from PDF Documents that have "text boxes" ? I'm not very familiar with these text boxes nor text extraction, but it sounds like that might be a newer feature of the PDF specifications which simple has not been implemented yet. But that's just a guess. If you can find out what software created those PDF files, it might help give us some more information. In Adobe acrobat: File -> Properties; check the "PDF Producer" and "PDF Version". If you can get the software which was used and create a test PDF which fails to extract text, we could look over the technical data and better help you figure out what's going on. ---- Thanks, Adam From: "Lupton, Chris B." <[email protected]> To: "[email protected]" <[email protected]> Date: 01/14/2011 09:19 Subject: Can PDFBox extract text from PDF Documents that have "text boxes" ? I have PDF Documents that have apparently been edited by some kind of PDF Writing Application. When edits are made... people are adding "Text Boxes" to the Documents instead of just removing/editing the existing Text. Each of the Edits have a colored boundary around them. These 'Text Boxes' are always placed inbetween original lines of Text. If the Document were not locked.. I could click and drag the boxes of Text around on the Screen. When I mouse-over them and right-click and select Properties... The window displayed is titled "Text Box Properties." When I attempt to extract text from the PDF Document... I either get runtime exceptions from within PDFBox's API Or.. I get Text back.. but NONE of the text from these "Text Boxes" is captured. Does anyone have working sample of code that can successfully retrieve Text from something like this ? I would love to provide an example, unfortunately the PDFs contain proprietary information so I am not allowed to do that. - FHA 203b; 203k; HECM; VA; USDA; Conventional - Warehouse Lines; FHA-Authorized Originators - Lending and Servicing in over 45 States www.swmc.com - www.simplehecmcalculator.com Visit www.swmc.com/resources for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.

