Hello VR, I agree with you that if we have control over the way we store/exchange data then it should be xml. But we are forced to accept pdf in our case.
And about your example, you are saying that "Hello World" would result in two invocations. But 1.0 results in 10 or 11 invocations - once for each character. Anyway, it is not that I should be able use processTextPosition method to do my job. What I am trying to say is - if you understood my goal is - I should be able to say what the "quality of Construction" was for "comparable sale #1" in the image I sent you before, then may be you could tell me if there is a way to do that with PDFBox. I was able to do that with version 0.8. Is there a way to set a particular value to Tc, Tw, Tj etc so that It would behave the way it did before. Just like I have the option to set the "setWordSeparator", "setLineSeparator" and "setPageSeparator" to "" - effectively ignoring word separation, lineseparation and pageseparation respectively for PDFTextStripper.writeText. Appreciate your help. Rekha From: Villu Ruusmann <[email protected]> To: [email protected] Cc: [email protected] Date: 02/19/2010 11:21 AM Subject: Re: PDFTextStripper.processTextPosition Hello there, > > I read the link you have send me. It is above my understanding of the PDFs and PDFBoxTextStripper. > I am trying to parse this content from the PDF. With 0.8, the PDFTextStripper.processTextPosition() > was called for every column value(e.g: "Mt. Pleasant, SC 29466-8583"). > First of all, your assumption that every "field" should result in exactly one invocation of PDFTextStripper#processTextPosition(TextPosition) is too naive when it comes to real-world PDF documents. Maybe it helps if you consider that there is no such thing as a "white space" literal in PDF. Imagine a PDF document that prints "Hello World". When this document is rendered by a conforming PDF software (for example, Acrobat Reader) then what happens is that the software first draws the string "Hello", leaves some horizontal space, and then draws the string "World". When this document is processed with PDFBox's utilities such as PDFTextStripper, there would be two invocations of PDFTextStripper#processTextPosition(TextPosition) - the first for the string "Hello" and the second for the string "World". It is the responsibility of the application who is consuming those TextPositions to figure out (by comparing their relative positions on screen) that they should be combined to yield "Hello World". > So I thought I will use the getYDirAdj and getXDirAdj methods to sort them and take the values. > Now I do not know where each of those column value end. For eg. How will I know "Mt. Pleasant, > SC 29466-8583" is from one "field" if I get one character at a time and setSortByPosition(true) also > doesn't work with the processTextPosition(). Could you please tell me if there is a better way of do that. > The sample you sent to me revealed a rather complex table structure. Assuming this is a fixed layout you can obtain "fields" if you define the bounding box of each cell (x, y, width, height), collect all the TextPositions that fall into that region, and finally join the collected TextPositions into the result string. You are correct that you must use TextPosition#getXDirAdj, #getYDirAdj, #getWidthDirAdj to do the job. PDF really isn't a good choice for data storage or exchange. You would be better off if you could obtain this data in some structured format such as XML. VR This e-mail may contain data that is confidential, proprietary or non-public personal information, as that term is defined in the Gramm-Leach-Bliley Act (collectively, Confidential Information). The Confidential Information is disclosed conditioned upon your agreement that you will treat it confidentially and in accordance with applicable law, ensure that such data isn't used or disclosed except for the limited purpose for which it's being provided and will notify and cooperate with us regarding any requested or unauthorized disclosure or use of any Confidential Information. By accepting and reviewing the Confidential information, you agree to indemnify us against any losses or expenses, including attorney's fees that we may incur as a result of any unauthorized use or disclosure of this data due to your acts or omissions. If a party other than the intended recipient receives this e-mail, he or she is requested to instantly notify us of the erroneous delivery and return to us all data so delivered.

