I concur. I have not upgraded to 1.0 because I want words rather the characters.
I think the issue is one of API design philosophy. It appears to me that the API change maps to the underlying PDF specification which is known as a functional design (each function/method maps to some component/facet of the PDF specification). I would prefer an API based on use-cases. Examples of use-cases are "Give a sword", "Give me a line of words", Give me a sentence", "Give me a paragraph", and "Give me a table". From a use-case point of view, I don't care and don't want to know about character spacing, positioning, etc. Another example of functional design that all "normal" people hate is the W3C DOM-based APIs. That is why there are Java APIs like Dom$J and JDOM layered on top the ugly DOM and SAX APIs. That is why PHP has the SimpleDOM API, etc. XML users think in terms of entities and attributes. And thus, they want API that operate on entities and attributes -- not nodes, leafs, etc. An API based on use-cases, will change very little with changes to PDF specifications, etc., which provides an easier upgrade path. An API based on use-cases also makes adoption by new users much easier, because new users don't have to learn about PDF formats to use the libraries. The 1.0 API change, has moved further away from user-based API to a functional API, which is a very bad thing to do. And that is why there are lot of complaints about the API now being "broken". From a use-case point of view, the API has suffered a very serious regression. George Van Treeck ----- Original Message ---- From: "[email protected]" <[email protected]> To: [email protected]; [email protected]; [email protected] Cc: [email protected] Sent: Tue, March 2, 2010 4:53:16 AM Subject: Re: PDFTextStripper.processTextPosition Yes it looks like we both are trying to do the same thing. It would be helpful if PDFTextStripper#processTextPosition(TextPosition)works as it did in 0.8, or at least a easier way to make it work that way would be good. From: Daniel Wilson <[email protected]> To: [email protected], [email protected] Date: 03/01/2010 04:28 PM Subject: Re: Fwd: PDFTextStripper.processTextPosition Andrew, if you & Rekha have similar problems perhaps public discussion here would result in a good solution. Villu is following this discussion closely and did some of the related coding, I believe. Daniel On Mon, Mar 1, 2010 at 3:53 PM, <[email protected]> wrote: > > Thanks for the reply. Unfortunately Rekha and I seem to have very similar > projects. The pdfs I am trying to parse do vary visually, although not by > much. Currently my code looks for keywords then selects text around the > keywords based on the graphical position. I have attached an example below. > I have a "glue" routine that combines near by TextPositions that are within > a threshold to recreate the words from individual characters. When I don't > have to use "glue" I get better results... > Andrew > > > Zone z = new HorizontalOrder(new DirectRight(new > TextValue("Design (Style)"), 5)); > z.evaluate(regs); > style = z.getMatching().get(1).getValue(); > > > > > *Daniel Wilson <[email protected]>* > > 03/01/2010 01:25 PM > To > [email protected] > cc > Subject > Fwd: PDFTextStripper.processTextPosition > > > > > Andrew, > > Does this answer your question? It at least looks similar ... and Villu > has a better handle on what was done & why in that area than do I. > > Daniel > > ---------- Forwarded message ---------- > From: <*[email protected]*<[email protected]> > > > Date: Fri, Feb 26, 2010 at 9:08 AM > Subject: Re: PDFTextStripper.processTextPosition > To: Villu Ruusmann <*[email protected]* <[email protected]>> > Cc: *[email protected]* <[email protected]> > > > You are right, I am trying the parse that form. The reason I am trying to > use processTextPosition is we will be doing this programmatically, there > will be no one selecting the region. Also we will be extracting the data > from the form generated by different providers which does not look exactly > the same. For eg., the whole page looks kind of squished. I tried the > PDFTextStripperByArea#extractRegions(PDPage), since the position will not > be exactly the same it is causing me to loose data or pick up the data > from the next column. > > Is there a way to find the coordinates for > PDFTextStripperByArea#extractRegions(PDPage) columns programmatically to > be more accurate? > > > > > > > From: > Villu Ruusmann <*[email protected]* <[email protected]>> > To:* > **[email protected]*<[email protected]> > Cc:* > **[email protected]* <[email protected]> > Date: > 02/26/2010 02:47 AM > Subject: > Re: PDFTextStripper.processTextPosition > > > > Hello there, > > > > > I thought of continuing to use 0.8 version for my purpose for now. > > Hoping I will have the easier way to achieve it in the later versions of > PDFBox. > > > > The reason for this email is, I am having a difference in the data I > receive if I run > > PDFTextStripper.writeText() and if I extend > PDFTextStripper.processTextPosition( ). > > For example, I have attached a one-page pdf I used for this. > > It is unclear to me why do you insist using > PDFTextStripper#processTextPosition(TextPosition) to do the job when > there are better alternatives available. > > The example document you sent to me is the second page of the Freddie > Mac Form 70 (*http://www.freddiemac.com/sell/forms/pdf/70.pdf*< http://www.freddiemac.com/sell/forms/pdf/70.pdf>), > which > has a fixed 3-column layout. > > In order to extract field values, you need to find out their bounding > boxes. For as long as there is no PDFBox GUI around I suggest you to > use Foxit PDF Editor for that (select an element and open "Property > List" from its context menu). Then, instantiate a > PDFTextStripperByArea and populate it by invoking > PDFTextStripperByArea#addRegion(String, Rectangle2D) for every field. > Then, process the page by invoking > PDFTextStripperByArea#extractRegions(PDPage). Finally, retrieve field > values by invoking PDFTextStripperByArea#getTextForRegion(String) for > every field. Note that you do not need to override any methods in > class PDFTextStripperByArea - the public API does just fine. > > I have attached a sample application (FreddieMacForm70.java) that > extracts the fields "Sale Price", "Date of Sale/Time", and "Gross > Living Area" for all 3 comparable sales. You can add other fields as > needed. > > > VR > [attachment "FreddieMacForm70.java" deleted by Rekha > Hariramakrishnan/Flagstar_notes] > > > > This e-mail may contain data that is confidential, proprietary or > non-public personal information, as that term is defined in the > Gramm-Leach-Bliley Act (collectively, Confidential Information). > The Confidential Information is disclosed conditioned upon your > agreement that you will treat it confidentially and in accordance > with applicable law, ensure that such data isn't used or disclosed > except for the limited purpose for which it's being provided and > will notify and cooperate with us regarding any requested or > unauthorized disclosure or use of any Confidential Information. > By accepting and reviewing the Confidential information, you agree > to indemnify us against any losses or expenses, including > attorney's fees that we may incur as a result of any unauthorized > use or disclosure of this data due to your acts or omissions. If a > party other than the intended recipient receives this e-mail, he or > she is requested to instantly notify us of the erroneous delivery > and return to us all data so delivered. > >

