Re: PDFTextStripper.processTextPosition

Rekha . Hariramakrishnan Tue, 02 Mar 2010 04:54:01 -0800

Yes it looks like we both are trying to do the same thing. It would be 
helpful if PDFTextStripper#processTextPosition(TextPosition)works as it 
did in 0.8, or at least a easier way to make it work that way would be 
good.





From:
Daniel Wilson <[email protected]>
To:
[email protected], [email protected]
Date:
03/01/2010 04:28 PM
Subject:
Re: Fwd: PDFTextStripper.processTextPosition



Andrew, if you & Rekha have similar problems perhaps public discussion 
here
would result in a good solution.  Villu is following this discussion 
closely
and did some of the related coding, I believe.

Daniel

On Mon, Mar 1, 2010 at 3:53 PM, <[email protected]> wrote:

>
> Thanks for the reply.  Unfortunately Rekha and I seem to have very 
similar
> projects.  The pdfs I am trying to parse do vary visually, although not 
by
> much.  Currently my code looks for keywords then selects text around the
> keywords based on the graphical position.  I have attached an example 
below.
>  I have a "glue" routine that combines near by TextPositions that are 
within
> a threshold to recreate the words from individual characters.  When I 
don't
> have to use "glue" I get better results...
> Andrew
>
>
>             Zone z = new HorizontalOrder(new DirectRight(new
> TextValue("Design (Style)"), 5));
>             z.evaluate(regs);
>             style = z.getMatching().get(1).getValue();
>
>
>
>
>  *Daniel Wilson <[email protected]>*
>
> 03/01/2010 01:25 PM
>   To
> [email protected]
> cc
>   Subject
> Fwd: PDFTextStripper.processTextPosition
>
>
>
>
> Andrew,
>
> Does this answer your question?  It at least looks similar ... and Villu
> has a better handle on what was done & why in that area than do I.
>
> Daniel
>
> ---------- Forwarded message ----------
> From: 
<*[email protected]*<[email protected]>
> >
> Date: Fri, Feb 26, 2010 at 9:08 AM
> Subject: Re: PDFTextStripper.processTextPosition
> To: Villu Ruusmann <*[email protected]* 
<[email protected]>>
> Cc: *[email protected]* <[email protected]>
>
>
> You are right, I am trying the parse that form. The reason I am trying 
to
> use processTextPosition is we will be doing this programmatically, there
> will be no one selecting the region. Also we will be extracting the data
> from the form generated by different providers which does not look 
exactly
> the same. For eg., the whole page looks kind of squished. I tried the
> PDFTextStripperByArea#extractRegions(PDPage), since the position will 
not
> be exactly the same it is causing me to loose data or pick up the data
> from the next column.
>
> Is there a way to find the coordinates for
> PDFTextStripperByArea#extractRegions(PDPage) columns programmatically to
> be more accurate?
>
>
>
>
>
>
> From:
> Villu Ruusmann <*[email protected]* <[email protected]>>
> To:*
> 
**[email protected]*<[email protected]>
> Cc:*
> **[email protected]* <[email protected]>
> Date:
> 02/26/2010 02:47 AM
> Subject:
> Re: PDFTextStripper.processTextPosition
>
>
>
> Hello there,
>
> >
> > I thought of continuing to use 0.8 version for my purpose for now.
> > Hoping I will have the easier way to achieve it in the later versions 
of
> PDFBox.
> >
> > The reason for this email is, I am having a difference in the data I
> receive if  I run
> > PDFTextStripper.writeText() and if I extend
> PDFTextStripper.processTextPosition( ).
> > For example, I have attached a one-page pdf I used for this.
>
> It is unclear to me why do you insist using
> PDFTextStripper#processTextPosition(TextPosition) to do the job when
> there are better alternatives available.
>
> The example document you sent to me is the second page of the Freddie
> Mac Form 70 (*http://www.freddiemac.com/sell/forms/pdf/70.pdf*<
http://www.freddiemac.com/sell/forms/pdf/70.pdf>),
> which
> has a fixed 3-column layout.
>
> In order to extract field values, you need to find out their bounding
> boxes. For as long as there is no PDFBox GUI around I suggest you to
> use Foxit PDF Editor for that (select an element and open "Property
> List" from its context menu). Then, instantiate a
> PDFTextStripperByArea and populate it by invoking
> PDFTextStripperByArea#addRegion(String, Rectangle2D) for every field.
> Then, process the page by invoking
> PDFTextStripperByArea#extractRegions(PDPage). Finally, retrieve field
> values by invoking PDFTextStripperByArea#getTextForRegion(String) for
> every field. Note that you do not need to override any methods in
> class PDFTextStripperByArea - the public API does just fine.
>
> I have attached a sample application (FreddieMacForm70.java) that
> extracts the fields "Sale Price", "Date of Sale/Time", and "Gross
> Living Area" for all 3 comparable sales. You can add other fields as
> needed.
>
>
> VR
> [attachment "FreddieMacForm70.java" deleted by Rekha
> Hariramakrishnan/Flagstar_notes]
>
>
>
> This e-mail may contain data that is confidential, proprietary or
> non-public personal information, as that term is defined in the
> Gramm-Leach-Bliley Act (collectively, Confidential Information).
> The Confidential Information is disclosed conditioned upon your
> agreement that you will treat it confidentially and in accordance
> with applicable law, ensure that such data isn't used or disclosed
> except for the limited purpose for which it's being provided and
> will notify and cooperate with us regarding any requested or
> unauthorized disclosure or use of any Confidential Information.
> By accepting and reviewing the Confidential information, you agree
> to indemnify us against any losses or expenses, including
> attorney's fees that we may incur as a result of any unauthorized
> use or disclosure of this data due to your acts or omissions. If a
> party other than the intended recipient receives this e-mail, he or
> she is requested to instantly notify us of the erroneous delivery
> and return to us all data so delivered.
>
>

Re: PDFTextStripper.processTextPosition

Reply via email to