Yes it looks like we both are trying to do the same thing. It would be helpful if PDFTextStripper#processTextPosition(TextPosition)works as it did in 0.8, or at least a easier way to make it work that way would be good.
From: Daniel Wilson <[email protected]> To: [email protected], [email protected] Date: 03/01/2010 04:28 PM Subject: Re: Fwd: PDFTextStripper.processTextPosition Andrew, if you & Rekha have similar problems perhaps public discussion here would result in a good solution. Villu is following this discussion closely and did some of the related coding, I believe. Daniel On Mon, Mar 1, 2010 at 3:53 PM, <[email protected]> wrote: > > Thanks for the reply. Unfortunately Rekha and I seem to have very similar > projects. The pdfs I am trying to parse do vary visually, although not by > much. Currently my code looks for keywords then selects text around the > keywords based on the graphical position. I have attached an example below. > I have a "glue" routine that combines near by TextPositions that are within > a threshold to recreate the words from individual characters. When I don't > have to use "glue" I get better results... > Andrew > > > Zone z = new HorizontalOrder(new DirectRight(new > TextValue("Design (Style)"), 5)); > z.evaluate(regs); > style = z.getMatching().get(1).getValue(); > > > > > *Daniel Wilson <[email protected]>* > > 03/01/2010 01:25 PM > To > [email protected] > cc > Subject > Fwd: PDFTextStripper.processTextPosition > > > > > Andrew, > > Does this answer your question? It at least looks similar ... and Villu > has a better handle on what was done & why in that area than do I. > > Daniel > > ---------- Forwarded message ---------- > From: <*[email protected]*<[email protected]> > > > Date: Fri, Feb 26, 2010 at 9:08 AM > Subject: Re: PDFTextStripper.processTextPosition > To: Villu Ruusmann <*[email protected]* <[email protected]>> > Cc: *[email protected]* <[email protected]> > > > You are right, I am trying the parse that form. The reason I am trying to > use processTextPosition is we will be doing this programmatically, there > will be no one selecting the region. Also we will be extracting the data > from the form generated by different providers which does not look exactly > the same. For eg., the whole page looks kind of squished. I tried the > PDFTextStripperByArea#extractRegions(PDPage), since the position will not > be exactly the same it is causing me to loose data or pick up the data > from the next column. > > Is there a way to find the coordinates for > PDFTextStripperByArea#extractRegions(PDPage) columns programmatically to > be more accurate? > > > > > > > From: > Villu Ruusmann <*[email protected]* <[email protected]>> > To:* > **[email protected]*<[email protected]> > Cc:* > **[email protected]* <[email protected]> > Date: > 02/26/2010 02:47 AM > Subject: > Re: PDFTextStripper.processTextPosition > > > > Hello there, > > > > > I thought of continuing to use 0.8 version for my purpose for now. > > Hoping I will have the easier way to achieve it in the later versions of > PDFBox. > > > > The reason for this email is, I am having a difference in the data I > receive if I run > > PDFTextStripper.writeText() and if I extend > PDFTextStripper.processTextPosition( ). > > For example, I have attached a one-page pdf I used for this. > > It is unclear to me why do you insist using > PDFTextStripper#processTextPosition(TextPosition) to do the job when > there are better alternatives available. > > The example document you sent to me is the second page of the Freddie > Mac Form 70 (*http://www.freddiemac.com/sell/forms/pdf/70.pdf*< http://www.freddiemac.com/sell/forms/pdf/70.pdf>), > which > has a fixed 3-column layout. > > In order to extract field values, you need to find out their bounding > boxes. For as long as there is no PDFBox GUI around I suggest you to > use Foxit PDF Editor for that (select an element and open "Property > List" from its context menu). Then, instantiate a > PDFTextStripperByArea and populate it by invoking > PDFTextStripperByArea#addRegion(String, Rectangle2D) for every field. > Then, process the page by invoking > PDFTextStripperByArea#extractRegions(PDPage). Finally, retrieve field > values by invoking PDFTextStripperByArea#getTextForRegion(String) for > every field. Note that you do not need to override any methods in > class PDFTextStripperByArea - the public API does just fine. > > I have attached a sample application (FreddieMacForm70.java) that > extracts the fields "Sale Price", "Date of Sale/Time", and "Gross > Living Area" for all 3 comparable sales. You can add other fields as > needed. > > > VR > [attachment "FreddieMacForm70.java" deleted by Rekha > Hariramakrishnan/Flagstar_notes] > > > > This e-mail may contain data that is confidential, proprietary or > non-public personal information, as that term is defined in the > Gramm-Leach-Bliley Act (collectively, Confidential Information). > The Confidential Information is disclosed conditioned upon your > agreement that you will treat it confidentially and in accordance > with applicable law, ensure that such data isn't used or disclosed > except for the limited purpose for which it's being provided and > will notify and cooperate with us regarding any requested or > unauthorized disclosure or use of any Confidential Information. > By accepting and reviewing the Confidential information, you agree > to indemnify us against any losses or expenses, including > attorney's fees that we may incur as a result of any unauthorized > use or disclosure of this data due to your acts or omissions. If a > party other than the intended recipient receives this e-mail, he or > she is requested to instantly notify us of the erroneous delivery > and return to us all data so delivered. > >

