Hello there,
>
> And about your example, you are saying that "Hello World" would result in two
> invocations.
> But 1.0 results in 10 or 11 invocations - once for each character.
>
Your PDF document contains a "character spacing" instruction, which
states that all characters should be painted away from each other.
Like this -
"H"(0.01)"e"(0.01)"l"(0.01)"l"(0.01)"o"(10.0)"W"(0.01)"o"(0.01)"r"(0.01)"d".
PDFBox 0.8.0 did not honour this instruction, but PDFBox 1.0.X does. I
must admit that this is annoying when dealing with small "character
spacing" values (< 0.1).
> Anyway, it is not that I should be able use processTextPosition method to do
> my job.
> What I am trying to say is - if you understood my goal is - I should be able
> to say what the
>"quality of Construction" was for "comparable sale #1" in the image I sent you
>before,
> then may be you could tell me if there is a way to do that with PDFBox.
>
I looked it up from the image - the bounding box of that cell is
[x=610, y=520, width=180, height=30].
You can use class PDFTextStripperByArea instead of PDFTextStripper:
PDFTextStripperByArea textStripper = new PDFTextStripperByArea();
textStripper.addRegion("CS1-QoC", new Rectangle2D.Float(610, 520, 180,
30)); // Define the symbolic name and the bounding box of the field
.. // Add more fields as needed
textStripper.extractRegions(pdfPage);
String qualityOfConstrForCompSale1 =
textStripper.getTextForRegion("CS1-QoC"); // Retrieve the value of the
field by the symbolic name
>
> I was able to do that with version 0.8. Is there a way to set a particular
> value to Tc, Tw, Tj etc
> so that It would behave the way it did before. Just like I have the option to
> set the
> "setWordSeparator", "setLineSeparator" and "setPageSeparator" to "" -
> effectively ignoring word
> separation, lineseparation and pageseparation respectively for
> PDFTextStripper.writeText.
>
You could modify class org.apache.pdfbox.util.PDFStreamEngine to suit
your needs. If I'm not mistaken, then the logic which controls the
processing of characters is located on lines 481-484 (as of SVN
revision 908338). If you want to disable "character spacing", delete
the equality expression "spacingText == 0". If you want to make it
less sensitive, substitute "0" with something greater such as "0.1".
VR