RE: Can PDFBox extract text from PDF Documents that have "text boxes" ?

Lupton, Chris B. Mon, 17 Jan 2011 06:35:53 -0800

Thanks for the tip about checking the File -> Properties.

Apparently the software that is generating and/or editing these PDF Documents 
originates from  "www.activepdf.com"


I can use that information at least try and follow-up with that 3rd party 
provider and see if there are any options for getting alternate versions
Of those Documents that don't contain these annotations.

As a follow-up note:
The version for PDF is listed as  (PDF Version 1.5  Acrobat 6.x)



-----Original Message-----
From: [email protected] [mailto:[email protected]] 
Sent: Friday, January 14, 2011 12:44 PM
To: [email protected]
Cc: [email protected]
Subject: Re: Can PDFBox extract text from PDF Documents that have "text boxes" ?

I'm not very familiar with these text boxes nor text extraction, but it 
sounds like that might be a newer feature of the PDF specifications which 
simple has not been implemented yet.  But that's just a guess.  If you can 
find out what software created those PDF files, it might help give us some 
more information.  In Adobe acrobat: File -> Properties; check the "PDF 
Producer" and "PDF Version".  If you can get the software which was used 
and create a test PDF which fails to extract text, we could look over the 
technical data and better help you figure out what's going on.

---- 
Thanks,
Adam





From:
"Lupton, Chris B." <[email protected]>
To:
"[email protected]" <[email protected]>
Date:
01/14/2011 09:19
Subject:
Can PDFBox extract text from PDF Documents that have "text boxes" ?



I have PDF Documents that have apparently been edited by some kind of PDF 
Writing Application.
When edits are made... people are adding "Text Boxes" to the Documents 
instead of just removing/editing the existing Text.
Each of the Edits have a colored boundary around them.
These 'Text Boxes' are always placed inbetween original lines of Text.

If the Document were not locked.. I could click and drag the boxes of Text 
around on the Screen.
When I mouse-over them and right-click and select Properties...
The window displayed is titled  "Text Box Properties."

When I attempt to extract text from the PDF Document...
I either get runtime exceptions from within PDFBox's API
Or.. I get Text back.. but NONE of the text from these "Text Boxes" is 
captured.


Does anyone have working sample of code that can successfully retrieve 
Text from something like this ?

I would love to provide an example, unfortunately the PDFs contain 
proprietary information so I am not allowed to do that.









- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   
Visit  www.swmc.com/resources   for helpful links on Training, Webinars, Lender 
Alerts and Submitting Conditions  

This email and any content within or attached hereto from Sun West Mortgage 
Company, Inc. is confidential and/or legally privileged. The information is 
intended only for the use of the individual or entity named on this email. If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or taking any action in reliance on the 
contents of this email information is strictly prohibited, and that the 
documents should be returned to this office immediately by email. Receipt by 
anyone other than the intended recipient is not a waiver of any privilege. 
Please do not include your social security number, account number, or any other 
personal or financial information in the content of the email. Should you have 
any questions, please call (800) 453 7884.

RE: Can PDFBox extract text from PDF Documents that have "text boxes" ?

Reply via email to