Hello there,

>
> And about your example, you are saying that "Hello World" would result in two 
> invocations.
> But 1.0 results in 10 or 11 invocations - once for each character.
>

Your PDF document contains a "character spacing" instruction, which
states that all characters should be painted away from each other.
Like this - 
"H"(0.01)"e"(0.01)"l"(0.01)"l"(0.01)"o"(10.0)"W"(0.01)"o"(0.01)"r"(0.01)"d".
PDFBox 0.8.0 did not honour this instruction, but PDFBox 1.0.X does. I
must admit that this is annoying when dealing with small "character
spacing" values (< 0.1).

> Anyway, it is not that I should be able use processTextPosition method to do 
> my job.
> What I am trying to say is - if you understood my goal is - I should be able 
> to say what the
>"quality of Construction" was for "comparable sale #1" in the image I sent you 
>before,
> then may be you could tell me if there is a way to do that with PDFBox.
>

I looked it up from the image - the bounding box of that cell is
[x=610, y=520, width=180, height=30].

You can use class PDFTextStripperByArea instead of PDFTextStripper:

PDFTextStripperByArea textStripper = new PDFTextStripperByArea();
textStripper.addRegion("CS1-QoC", new Rectangle2D.Float(610, 520, 180,
30)); // Define the symbolic name and the bounding box of the field
.. // Add more fields as needed
textStripper.extractRegions(pdfPage);
String qualityOfConstrForCompSale1 =
textStripper.getTextForRegion("CS1-QoC"); // Retrieve the value of the
field by the symbolic name

>
> I was able to do that with version 0.8. Is there a way to set a particular 
> value to Tc, Tw, Tj etc
> so that It would behave the way it did before. Just like I have the option to 
> set the
> "setWordSeparator", "setLineSeparator" and "setPageSeparator" to "" - 
> effectively ignoring word
> separation, lineseparation and pageseparation respectively for 
> PDFTextStripper.writeText.
>

You could modify class org.apache.pdfbox.util.PDFStreamEngine to suit
your needs. If I'm not mistaken, then the logic which controls the
processing of characters is located on lines 481-484 (as of SVN
revision 908338). If you want to disable "character spacing", delete
the equality expression "spacingText == 0". If you want to make it
less sensitive, substitute "0" with something greater such as "0.1".


VR

Reply via email to