Can PDFBox extract text from PDF Documents that have "text boxes" ?

Lupton, Chris B. Fri, 14 Jan 2011 09:19:22 -0800

I have PDF Documents that have apparently been edited by some kind of PDF 
Writing Application.
When edits are made... people are adding "Text Boxes" to the Documents instead 
of just removing/editing the existing Text.
Each of the Edits have a colored boundary around them.
These 'Text Boxes' are always placed inbetween original lines of Text.


If the Document were not locked.. I could click and drag the boxes of Text 
around on the Screen.
When I mouse-over them and right-click and select Properties...
The window displayed is titled  "Text Box Properties."

When I attempt to extract text from the PDF Document...
I either get runtime exceptions from within PDFBox's API
Or.. I get Text back.. but NONE of the text from these "Text Boxes" is captured.


Does anyone have working sample of code that can successfully retrieve Text 
from something like this ?

I would love to provide an example, unfortunately the PDFs contain proprietary 
information so I am not allowed to do that.

Can PDFBox extract text from PDF Documents that have "text boxes" ?

Reply via email to