Re: PDFTextStripper.processTextPosition

Rekha . Hariramakrishnan Fri, 26 Feb 2010 06:09:34 -0800

You are right, I am trying the parse that form. The reason I am trying to 
use processTextPosition is we will be doing this programmatically, there 
will be no one selecting the region. Also we will be extracting the data 
from the form generated by different providers which does not look exactly 
the same. For eg., the whole page looks kind of squished. I tried the 
PDFTextStripperByArea#extractRegions(PDPage), since the position will not 
be exactly the same it is causing me to loose data or pick up the data 
from the next column.


Is there a way to find the coordinates for 
PDFTextStripperByArea#extractRegions(PDPage) columns programmatically to 
be more accurate?






From:
Villu Ruusmann <[email protected]>
To:
[email protected]
Cc:
[email protected]
Date:
02/26/2010 02:47 AM
Subject:
Re: PDFTextStripper.processTextPosition



Hello there,

>
> I thought of continuing to use 0.8 version for my purpose for now.
> Hoping I will have the easier way to achieve it in the later versions of 
PDFBox.
>
> The reason for this email is, I am having a difference in the data I 
receive if  I run
> PDFTextStripper.writeText() and if I extend 
PDFTextStripper.processTextPosition( ).
> For example, I have attached a one-page pdf I used for this.

It is unclear to me why do you insist using
PDFTextStripper#processTextPosition(TextPosition) to do the job when
there are better alternatives available.

The example document you sent to me is the second page of the Freddie
Mac Form 70 (http://www.freddiemac.com/sell/forms/pdf/70.pdf), which
has a fixed 3-column layout.

In order to extract field values, you need to find out their bounding
boxes. For as long as there is no PDFBox GUI around I suggest you to
use Foxit PDF Editor for that (select an element and open "Property
List" from its context menu). Then, instantiate a
PDFTextStripperByArea and populate it by invoking
PDFTextStripperByArea#addRegion(String, Rectangle2D) for every field.
Then, process the page by invoking
PDFTextStripperByArea#extractRegions(PDPage). Finally, retrieve field
values by invoking PDFTextStripperByArea#getTextForRegion(String) for
every field. Note that you do not need to override any methods in
class PDFTextStripperByArea - the public API does just fine.

I have attached a sample application (FreddieMacForm70.java) that
extracts the fields "Sale Price", "Date of Sale/Time", and "Gross
Living Area" for all 3 comparable sales. You can add other fields as
needed.


VR
[attachment "FreddieMacForm70.java" deleted by Rekha 
Hariramakrishnan/Flagstar_notes] 



This e-mail may contain data that is confidential, proprietary or
non-public personal information, as that term is defined in the
Gramm-Leach-Bliley Act (collectively, Confidential Information).
The Confidential Information is disclosed conditioned upon your
agreement that you will treat it confidentially and in accordance
with applicable law, ensure that such data isn't used or disclosed
except for the limited purpose for which it's being provided and
will notify and cooperate with us regarding any requested or
unauthorized disclosure or use of any Confidential Information. 
By accepting and reviewing the Confidential information, you agree
to indemnify us against any losses or expenses, including
attorney's fees that we may incur as a result of any unauthorized
use or disclosure of this data due to your acts or omissions. If a
party other than the intended recipient receives this e-mail, he or
she is requested to instantly notify us of the erroneous delivery
and return to us all data so delivered.

Re: PDFTextStripper.processTextPosition

Reply via email to