Re: PDFTextStripper.processTextPosition

Villu Ruusmann Thu, 25 Feb 2010 23:48:00 -0800

Hello there,

>
> I thought of continuing to use 0.8 version for my purpose for now.
> Hoping I will have the easier way to achieve it in the later versions of 
> PDFBox.
>
> The reason for this email is, I am having a difference in the data I receive 
> if  I run
> PDFTextStripper.writeText() and if I extend 
> PDFTextStripper.processTextPosition( ).
> For example, I have attached a one-page pdf I used for this.


It is unclear to me why do you insist using
PDFTextStripper#processTextPosition(TextPosition) to do the job when
there are better alternatives available.

The example document you sent to me is the second page of the Freddie
Mac Form 70 (http://www.freddiemac.com/sell/forms/pdf/70.pdf), which
has a fixed 3-column layout.

In order to extract field values, you need to find out their bounding
boxes. For as long as there is no PDFBox GUI around I suggest you to
use Foxit PDF Editor for that (select an element and open "Property
List" from its context menu). Then, instantiate a
PDFTextStripperByArea and populate it by invoking
PDFTextStripperByArea#addRegion(String, Rectangle2D) for every field.
Then, process the page by invoking
PDFTextStripperByArea#extractRegions(PDPage). Finally, retrieve field
values by invoking PDFTextStripperByArea#getTextForRegion(String) for
every field. Note that you do not need to override any methods in
class PDFTextStripperByArea - the public API does just fine.

I have attached a sample application (FreddieMacForm70.java) that
extracts the fields "Sale Price", "Date of Sale/Time", and "Gross
Living Area" for all 3 comparable sales. You can add other fields as
needed.


VR

Re: PDFTextStripper.processTextPosition

Reply via email to