Re: PDFTextStripper.processTextPosition

George Van Treeck Tue, 02 Mar 2010 10:14:14 -0800

I concur. I have not upgraded to 1.0 because I want words rather the characters.


I think the issue is one of API design philosophy. It appears to me that the 
API change maps to the underlying PDF specification which is known as a 
functional design (each function/method maps to some component/facet of the PDF 
specification). I would prefer an API based on use-cases. Examples of use-cases 
are "Give a sword", "Give me a line of words", Give me a sentence", "Give me a 
paragraph", and "Give me a table". From a use-case point of view, I don't care 
and don't want to know about character spacing, positioning, etc.

Another example of functional design that all "normal" people hate is the W3C 
DOM-based APIs. That is why there are Java APIs like Dom$J and JDOM layered on 
top the ugly DOM and SAX APIs. That is why PHP has the SimpleDOM API, etc. XML 
users think in terms of entities and attributes. And thus, they want API that 
operate on entities and attributes -- not nodes, leafs, etc.

An API based on use-cases, will change very little with changes to PDF 
specifications, etc., which provides an easier upgrade path. An API based on 
use-cases also makes adoption by new users much easier, because new users don't 
have to learn about PDF formats to use the libraries. The 1.0 API change, has 
moved further away from user-based API to a functional API, which is a very bad 
thing to do. And that is why there are lot of complaints about the API now 
being "broken". From a use-case point of view, the API has suffered a very 
serious regression.

George Van Treeck


----- Original Message ----
From: "[email protected]" 
<[email protected]>
To: [email protected]; [email protected]; 
[email protected]
Cc: [email protected]
Sent: Tue, March 2, 2010 4:53:16 AM
Subject: Re: PDFTextStripper.processTextPosition

Yes it looks like we both are trying to do the same thing. It would be 
helpful if PDFTextStripper#processTextPosition(TextPosition)works as it 
did in 0.8, or at least a easier way to make it work that way would be 
good.




From:
Daniel Wilson <[email protected]>
To:
[email protected], [email protected]
Date:
03/01/2010 04:28 PM
Subject:
Re: Fwd: PDFTextStripper.processTextPosition



Andrew, if you & Rekha have similar problems perhaps public discussion 
here
would result in a good solution.  Villu is following this discussion 
closely
and did some of the related coding, I believe.

Daniel

On Mon, Mar 1, 2010 at 3:53 PM, <[email protected]> wrote:

>
> Thanks for the reply.  Unfortunately Rekha and I seem to have very 
similar
> projects.  The pdfs I am trying to parse do vary visually, although not 
by
> much.  Currently my code looks for keywords then selects text around the
> keywords based on the graphical position.  I have attached an example 
below.
>  I have a "glue" routine that combines near by TextPositions that are 
within
> a threshold to recreate the words from individual characters.  When I 
don't
> have to use "glue" I get better results...
> Andrew
>
>
>             Zone z = new HorizontalOrder(new DirectRight(new
> TextValue("Design (Style)"), 5));
>             z.evaluate(regs);
>             style = z.getMatching().get(1).getValue();
>
>
>
>
>  *Daniel Wilson <[email protected]>*
>
> 03/01/2010 01:25 PM
>   To
> [email protected]
> cc
>   Subject
> Fwd: PDFTextStripper.processTextPosition
>
>
>
>
> Andrew,
>
> Does this answer your question?  It at least looks similar ... and Villu
> has a better handle on what was done & why in that area than do I.
>
> Daniel
>
> ---------- Forwarded message ----------
> From: 
<*[email protected]*<[email protected]>
> >
> Date: Fri, Feb 26, 2010 at 9:08 AM
> Subject: Re: PDFTextStripper.processTextPosition
> To: Villu Ruusmann <*[email protected]* 
<[email protected]>>
> Cc: *[email protected]* <[email protected]>
>
>
> You are right, I am trying the parse that form. The reason I am trying 
to
> use processTextPosition is we will be doing this programmatically, there
> will be no one selecting the region. Also we will be extracting the data
> from the form generated by different providers which does not look 
exactly
> the same. For eg., the whole page looks kind of squished. I tried the
> PDFTextStripperByArea#extractRegions(PDPage), since the position will 
not
> be exactly the same it is causing me to loose data or pick up the data
> from the next column.
>
> Is there a way to find the coordinates for
> PDFTextStripperByArea#extractRegions(PDPage) columns programmatically to
> be more accurate?
>
>
>
>
>
>
> From:
> Villu Ruusmann <*[email protected]* <[email protected]>>
> To:*
> 
**[email protected]*<[email protected]>
> Cc:*
> **[email protected]* <[email protected]>
> Date:
> 02/26/2010 02:47 AM
> Subject:
> Re: PDFTextStripper.processTextPosition
>
>
>
> Hello there,
>
> >
> > I thought of continuing to use 0.8 version for my purpose for now.
> > Hoping I will have the easier way to achieve it in the later versions 
of
> PDFBox.
> >
> > The reason for this email is, I am having a difference in the data I
> receive if  I run
> > PDFTextStripper.writeText() and if I extend
> PDFTextStripper.processTextPosition( ).
> > For example, I have attached a one-page pdf I used for this.
>
> It is unclear to me why do you insist using
> PDFTextStripper#processTextPosition(TextPosition) to do the job when
> there are better alternatives available.
>
> The example document you sent to me is the second page of the Freddie
> Mac Form 70 (*http://www.freddiemac.com/sell/forms/pdf/70.pdf*<
http://www.freddiemac.com/sell/forms/pdf/70.pdf>),
> which
> has a fixed 3-column layout.
>
> In order to extract field values, you need to find out their bounding
> boxes. For as long as there is no PDFBox GUI around I suggest you to
> use Foxit PDF Editor for that (select an element and open "Property
> List" from its context menu). Then, instantiate a
> PDFTextStripperByArea and populate it by invoking
> PDFTextStripperByArea#addRegion(String, Rectangle2D) for every field.
> Then, process the page by invoking
> PDFTextStripperByArea#extractRegions(PDPage). Finally, retrieve field
> values by invoking PDFTextStripperByArea#getTextForRegion(String) for
> every field. Note that you do not need to override any methods in
> class PDFTextStripperByArea - the public API does just fine.
>
> I have attached a sample application (FreddieMacForm70.java) that
> extracts the fields "Sale Price", "Date of Sale/Time", and "Gross
> Living Area" for all 3 comparable sales. You can add other fields as
> needed.
>
>
> VR
> [attachment "FreddieMacForm70.java" deleted by Rekha
> Hariramakrishnan/Flagstar_notes]
>
>
>
> This e-mail may contain data that is confidential, proprietary or
> non-public personal information, as that term is defined in the
> Gramm-Leach-Bliley Act (collectively, Confidential Information).
> The Confidential Information is disclosed conditioned upon your
> agreement that you will treat it confidentially and in accordance
> with applicable law, ensure that such data isn't used or disclosed
> except for the limited purpose for which it's being provided and
> will notify and cooperate with us regarding any requested or
> unauthorized disclosure or use of any Confidential Information.
> By accepting and reviewing the Confidential information, you agree
> to indemnify us against any losses or expenses, including
> attorney's fees that we may incur as a result of any unauthorized
> use or disclosure of this data due to your acts or omissions. If a
> party other than the intended recipient receives this e-mail, he or
> she is requested to instantly notify us of the erroneous delivery
> and return to us all data so delivered.
>
>

Re: PDFTextStripper.processTextPosition

Reply via email to