Re: PDFTextStripper.processTextPosition

Adam Tue, 02 Mar 2010 10:24:58 -0800

+1

It's good to have the ability to drill down to the character level, but it 
should be in addition to (not in lieu of )being able to get words.

--Adam

From:
George Van Treeck <[email protected]>
To:
[email protected]
Date:
03/02/2010 10:14
Subject:
Re: PDFTextStripper.processTextPosition

I concur. I have not upgraded to 1.0 because I want words rather the 
characters.

I think the issue is one of API design philosophy. It appears to me that 
the API change maps to the underlying PDF specification which is known as 
a functional design (each function/method maps to some component/facet of 
the PDF specification). I would prefer an API based on use-cases. Examples 
of use-cases are "Give a sword", "Give me a line of words", Give me a 
sentence", "Give me a paragraph", and "Give me a table". From a use-case 
point of view, I don't care and don't want to know about character 
spacing, positioning, etc.

Another example of functional design that all "normal" people hate is the 
W3C DOM-based APIs. That is why there are Java APIs like Dom$J and JDOM 
layered on top the ugly DOM and SAX APIs. That is why PHP has the 
SimpleDOM API, etc. XML users think in terms of entities and attributes. 
And thus, they want API that operate on entities and attributes -- not 
nodes, leafs, etc.

An API based on use-cases, will change very little with changes to PDF 
specifications, etc., which provides an easier upgrade path. An API based 
on use-cases also makes adoption by new users much easier, because new 
users don't have to learn about PDF formats to use the libraries. The 1.0 
API change, has moved further away from user-based API to a functional 
API, which is a very bad thing to do. And that is why there are lot of 
complaints about the API now being "broken". From a use-case point of 
view, the API has suffered a very serious regression.

George Van Treeck

----- Original Message ----
From: "[email protected]" 
<[email protected]>
To: [email protected]; [email protected]; 
[email protected]
Cc: [email protected]
Sent: Tue, March 2, 2010 4:53:16 AM
Subject: Re: PDFTextStripper.processTextPosition

Yes it looks like we both are trying to do the same thing. It would be 
helpful if PDFTextStripper#processTextPosition(TextPosition)works as it 
did in 0.8, or at least a easier way to make it work that way would be 
good.

From:
Daniel Wilson <[email protected]>
To:
[email protected], [email protected]
Date:
03/01/2010 04:28 PM
Subject:
Re: Fwd: PDFTextStripper.processTextPosition

Andrew, if you & Rekha have similar problems perhaps public discussion 
here
would result in a good solution.  Villu is following this discussion 
closely
and did some of the related coding, I believe.

Daniel

On Mon, Mar 1, 2010 at 3:53 PM, <[email protected]> wrote:

>
> Thanks for the reply.  Unfortunately Rekha and I seem to have very 
similar
> projects.  The pdfs I am trying to parse do vary visually, although not 
by
> much.  Currently my code looks for keywords then selects text around the
> keywords based on the graphical position.  I have attached an example 
below.
>  I have a "glue" routine that combines near by TextPositions that are 
within
> a threshold to recreate the words from individual characters.  When I 
don't
> have to use "glue" I get better results...
> Andrew
>
>
>             Zone z = new HorizontalOrder(new DirectRight(new
> TextValue("Design (Style)"), 5));
>             z.evaluate(regs);
>             style = z.getMatching().get(1).getValue();
>
>
>
>
>  *Daniel Wilson <[email protected]>*
>
> 03/01/2010 01:25 PM
>   To
> [email protected]
> cc
>   Subject
> Fwd: PDFTextStripper.processTextPosition
>
>
>
>
> Andrew,
>
> Does this answer your question?  It at least looks similar ... and Villu
> has a better handle on what was done & why in that area than do I.
>
> Daniel
>
> ---------- Forwarded message ----------
> From: 
<*[email protected]*<[email protected]>
> >
> Date: Fri, Feb 26, 2010 at 9:08 AM
> Subject: Re: PDFTextStripper.processTextPosition
> To: Villu Ruusmann <*[email protected]* 
<[email protected]>>
> Cc: *[email protected]* <[email protected]>
>
>
> You are right, I am trying the parse that form. The reason I am trying 
to
> use processTextPosition is we will be doing this programmatically, there
> will be no one selecting the region. Also we will be extracting the data
> from the form generated by different providers which does not look 
exactly
> the same. For eg., the whole page looks kind of squished. I tried the
> PDFTextStripperByArea#extractRegions(PDPage), since the position will 
not
> be exactly the same it is causing me to loose data or pick up the data
> from the next column.
>
> Is there a way to find the coordinates for
> PDFTextStripperByArea#extractRegions(PDPage) columns programmatically to
> be more accurate?
>
>
>
>
>
>
> From:
> Villu Ruusmann <*[email protected]* <[email protected]>>
> To:*
> 
**[email protected]*<[email protected]>
> Cc:*
> **[email protected]* <[email protected]>
> Date:
> 02/26/2010 02:47 AM
> Subject:
> Re: PDFTextStripper.processTextPosition
>
>
>
> Hello there,
>
> >
> > I thought of continuing to use 0.8 version for my purpose for now.
> > Hoping I will have the easier way to achieve it in the later versions 
of
> PDFBox.
> >
> > The reason for this email is, I am having a difference in the data I
> receive if  I run
> > PDFTextStripper.writeText() and if I extend
> PDFTextStripper.processTextPosition( ).
> > For example, I have attached a one-page pdf I used for this.
>
> It is unclear to me why do you insist using
> PDFTextStripper#processTextPosition(TextPosition) to do the job when
> there are better alternatives available.
>
> The example document you sent to me is the second page of the Freddie
> Mac Form 70 (*http://www.freddiemac.com/sell/forms/pdf/70.pdf*<
http://www.freddiemac.com/sell/forms/pdf/70.pdf>),
> which
> has a fixed 3-column layout.
>
> In order to extract field values, you need to find out their bounding
> boxes. For as long as there is no PDFBox GUI around I suggest you to
> use Foxit PDF Editor for that (select an element and open "Property
> List" from its context menu). Then, instantiate a
> PDFTextStripperByArea and populate it by invoking
> PDFTextStripperByArea#addRegion(String, Rectangle2D) for every field.
> Then, process the page by invoking
> PDFTextStripperByArea#extractRegions(PDPage). Finally, retrieve field
> values by invoking PDFTextStripperByArea#getTextForRegion(String) for
> every field. Note that you do not need to override any methods in
> class PDFTextStripperByArea - the public API does just fine.
>
> I have attached a sample application (FreddieMacForm70.java) that
> extracts the fields "Sale Price", "Date of Sale/Time", and "Gross
> Living Area" for all 3 comparable sales. You can add other fields as
> needed.
>
>
> VR
> [attachment "FreddieMacForm70.java" deleted by Rekha
> Hariramakrishnan/Flagstar_notes]
>
>
>
> This e-mail may contain data that is confidential, proprietary or
> non-public personal information, as that term is defined in the
> Gramm-Leach-Bliley Act (collectively, Confidential Information).
> The Confidential Information is disclosed conditioned upon your
> agreement that you will treat it confidentially and in accordance
> with applicable law, ensure that such data isn't used or disclosed
> except for the limited purpose for which it's being provided and
> will notify and cooperate with us regarding any requested or
> unauthorized disclosure or use of any Confidential Information.
> By accepting and reviewing the Confidential information, you agree
> to indemnify us against any losses or expenses, including
> attorney's fees that we may incur as a result of any unauthorized
> use or disclosure of this data due to your acts or omissions. If a
> party other than the intended recipient receives this e-mail, he or
> she is requested to instantly notify us of the erroneous delivery
> and return to us all data so delivered.
>
>

?  Click here to submit conditions  

This email and any content within or attached hereto from  Sun West Mortgage 
Company, Inc.  is confidential and/or legally privileged. The information is 
intended only for the use of the individual or entity named on this email. If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or the taking of any action in reliance on 
the contents of this email information is strictly prohibited, and that the 
documents should be returned to this office immediately by email. Receipt by 
anyone other than the intended recipient is not a waiver of any privilege. 
Please do not include your social security number, account number, or any other 
personal or financial information in the content of the email. Should you have 
any questions, please call  (800) 453 7884.

Re: PDFTextStripper.processTextPosition

Reply via email to