Hi Villu Ruusmann, Do you think disabling "character spacing" will be made little easier, like setting a property or passing a value to a method, in the later versions of PDFBox? Since the method you have suggesting to change does a lot of things, I am hesitant to override it.
Please let me know. Thank you. Regards, Rekha From: Villu Ruusmann <[email protected]> To: [email protected] Cc: [email protected] Date: 02/19/2010 01:18 PM Subject: Re: PDFTextStripper.processTextPosition Hello there, > > And about your example, you are saying that "Hello World" would result in two invocations. > But 1.0 results in 10 or 11 invocations - once for each character. > Your PDF document contains a "character spacing" instruction, which states that all characters should be painted away from each other. Like this - "H"(0.01)"e"(0.01)"l"(0.01)"l"(0.01)"o"(10.0)"W"(0.01)"o"(0.01)"r"(0.01)"d". PDFBox 0.8.0 did not honour this instruction, but PDFBox 1.0.X does. I must admit that this is annoying when dealing with small "character spacing" values (< 0.1). > Anyway, it is not that I should be able use processTextPosition method to do my job. > What I am trying to say is - if you understood my goal is - I should be able to say what the >"quality of Construction" was for "comparable sale #1" in the image I sent you before, > then may be you could tell me if there is a way to do that with PDFBox. > I looked it up from the image - the bounding box of that cell is [x=610, y=520, width=180, height=30]. You can use class PDFTextStripperByArea instead of PDFTextStripper: PDFTextStripperByArea textStripper = new PDFTextStripperByArea(); textStripper.addRegion("CS1-QoC", new Rectangle2D.Float(610, 520, 180, 30)); // Define the symbolic name and the bounding box of the field .. // Add more fields as needed textStripper.extractRegions(pdfPage); String qualityOfConstrForCompSale1 = textStripper.getTextForRegion("CS1-QoC"); // Retrieve the value of the field by the symbolic name > > I was able to do that with version 0.8. Is there a way to set a particular value to Tc, Tw, Tj etc > so that It would behave the way it did before. Just like I have the option to set the > "setWordSeparator", "setLineSeparator" and "setPageSeparator" to "" - effectively ignoring word > separation, lineseparation and pageseparation respectively for PDFTextStripper.writeText. > You could modify class org.apache.pdfbox.util.PDFStreamEngine to suit your needs. If I'm not mistaken, then the logic which controls the processing of characters is located on lines 481-484 (as of SVN revision 908338). If you want to disable "character spacing", delete the equality expression "spacingText == 0". If you want to make it less sensitive, substitute "0" with something greater such as "0.1". VR This e-mail may contain data that is confidential, proprietary or non-public personal information, as that term is defined in the Gramm-Leach-Bliley Act (collectively, Confidential Information). The Confidential Information is disclosed conditioned upon your agreement that you will treat it confidentially and in accordance with applicable law, ensure that such data isn't used or disclosed except for the limited purpose for which it's being provided and will notify and cooperate with us regarding any requested or unauthorized disclosure or use of any Confidential Information. By accepting and reviewing the Confidential information, you agree to indemnify us against any losses or expenses, including attorney's fees that we may incur as a result of any unauthorized use or disclosure of this data due to your acts or omissions. If a party other than the intended recipient receives this e-mail, he or she is requested to instantly notify us of the erroneous delivery and return to us all data so delivered.

