Daniel Wilson was helpful in answering my previous question but I thought 
I would post my reply to the group. My previous question was why .8 
returned full words, and 1.0 returns the individual characters when using 
PDFTextStripper#processTextPosition(TextPosition) .


Rekha and I seem to have very similar projects.  The pdfs I am trying to 
parse do vary visually, although not by much.  Currently my code looks for 
keywords then selects text around the keywords based on the graphical 
position.  I have attached an example below.  I have a "glue" routine that 
combines near by TextPositions that are within a threshold to recreate the 
words from individual characters.   My code works in verion .8 and 1.0 of 
pdf box but when I don't have to use "glue" I get better results, as 
gluing often over glues.

The code below allows some flexability when pdfs look the same but vary on 
the exact positioning of the text. 
Andrew 


            Zone z = new HorizontalOrder(new DirectRight(new 
TextValue("Design (Style)"), 5)); 
            z.evaluate(regs); 
            style = z.getMatching().get(1).getValue(); 

Daniel sent me the below email.



>>
>> I thought of continuing to use 0.8 version for my purpose for now.
>> Hoping I will have the easier way to achieve it in the later versions 
of
>PDFBox.
>>
>> The reason for this email is, I am having a difference in the data I
>receive if  I run
>> PDFTextStripper.writeText() and if I extend
>PDFTextStripper.processTextPosition( ).
>> For example, I have attached a one-page pdf I used for this.
>
>It is unclear to me why do you insist using
>PDFTextStripper#processTextPosition(TextPosition) to do the job when
>there are better alternatives available.
>
>The example document you sent to me is the second page of the Freddie
>Mac Form 70 (http://www.freddiemac.com/sell/forms/pdf/70.pdf), which
>has a fixed 3-column layout.
>
>In order to extract field values, you need to find out their bounding
>boxes. For as long as there is no PDFBox GUI around I suggest you to
>use Foxit PDF Editor for that (select an element and open "Property
>List" from its context menu). Then, instantiate a
>PDFTextStripperByArea and populate it by invoking
>PDFTextStripperByArea#addRegion(String, Rectangle2D) for every field.
>Then, process the page by invoking
>PDFTextStripperByArea#extractRegions(PDPage). Finally, retrieve field
>values by invoking PDFTextStripperByArea#getTextForRegion(String) for
>every field. Note that you do not need to override any methods in
>class PDFTextStripperByArea - the public API does just fine.
>
>I have attached a sample application (FreddieMacForm70.java) that
>extracts the fields "Sale Price", "Date of Sale/Time", and "Gross
>Living Area" for all 3 comparable sales. You can add other fields as
>needed.

Reply via email to