Daniel Wilson was helpful in answering my previous question but I thought
I would post my reply to the group. My previous question was why .8
returned full words, and 1.0 returns the individual characters when using
PDFTextStripper#processTextPosition(TextPosition) .
Rekha and I seem to have very similar projects. The pdfs I am trying to
parse do vary visually, although not by much. Currently my code looks for
keywords then selects text around the keywords based on the graphical
position. I have attached an example below. I have a "glue" routine that
combines near by TextPositions that are within a threshold to recreate the
words from individual characters. My code works in verion .8 and 1.0 of
pdf box but when I don't have to use "glue" I get better results, as
gluing often over glues.
The code below allows some flexability when pdfs look the same but vary on
the exact positioning of the text.
Andrew
Zone z = new HorizontalOrder(new DirectRight(new
TextValue("Design (Style)"), 5));
z.evaluate(regs);
style = z.getMatching().get(1).getValue();
Daniel sent me the below email.
>>
>> I thought of continuing to use 0.8 version for my purpose for now.
>> Hoping I will have the easier way to achieve it in the later versions
of
>PDFBox.
>>
>> The reason for this email is, I am having a difference in the data I
>receive if I run
>> PDFTextStripper.writeText() and if I extend
>PDFTextStripper.processTextPosition( ).
>> For example, I have attached a one-page pdf I used for this.
>
>It is unclear to me why do you insist using
>PDFTextStripper#processTextPosition(TextPosition) to do the job when
>there are better alternatives available.
>
>The example document you sent to me is the second page of the Freddie
>Mac Form 70 (http://www.freddiemac.com/sell/forms/pdf/70.pdf), which
>has a fixed 3-column layout.
>
>In order to extract field values, you need to find out their bounding
>boxes. For as long as there is no PDFBox GUI around I suggest you to
>use Foxit PDF Editor for that (select an element and open "Property
>List" from its context menu). Then, instantiate a
>PDFTextStripperByArea and populate it by invoking
>PDFTextStripperByArea#addRegion(String, Rectangle2D) for every field.
>Then, process the page by invoking
>PDFTextStripperByArea#extractRegions(PDPage). Finally, retrieve field
>values by invoking PDFTextStripperByArea#getTextForRegion(String) for
>every field. Note that you do not need to override any methods in
>class PDFTextStripperByArea - the public API does just fine.
>
>I have attached a sample application (FreddieMacForm70.java) that
>extracts the fields "Sale Price", "Date of Sale/Time", and "Gross
>Living Area" for all 3 comparable sales. You can add other fields as
>needed.