Dear Eliot, I do appreciate your comprehensive response. It was really informative for me. Thank you. Amir
________________________________ From: Eliot Kimber <[email protected]> To: "[email protected]" <[email protected]> Sent: Tuesday, August 5, 2014 7:43 AM Subject: Re: How to find the position of a specific paragraph in the input PDF? Detecting paragraphs is a "hard problem": there is nothing inherent in the PDF data that will reliably tell you where paragraph boundaries are. Some PDF documents may have more reliable indicators than others, but unless you're working with a very specific set of documents you can't depend on it. The only reasonably-complete solution is to analyze the x/y location of each line of text and use a heuristic to guess at paragraph boundaries, where the heuristic will depend on how the paragraphs are indicated in the document at hand: extra vertical space, first line indent, etc. In the easy case, the characters of the text line will be contiguous in the PDF data stream. In the hard case, the characters will not be contiguous and you will need to use each string's x/y position to build up a single line (PDFBox may have utilities for this, I don't know). You'll need to again use heuristics to determine that a given character is or is not within a line (for example, superscripts and subscripts will not have the same Y origin as other characters in the same line, but they are definitely part of the line). Even then, if you have a multi-column document you have the further challenge of detecting the column boundary--if you must use the x/y positions of the characters to detect lines horizontally, you then have to have some way of distinguishing a normal interword space from the gap between columns. This may require configuring your tool with the boundaries of each column ("zoning"). Likewise, you may need to define zones to distinguish the headers and footers from the main body content. If you just need to reproduce the visual look of the page, say in HTML, then it's not so hard: you just treat each separately-placed sequence of characters as an absolutely-positioned <div> with appropriate styling applied (which you can get from the PDF data). But if you need to try to reconstitute the logical structure of the document, that is much harder. If your pages are regular pages of simple text, the problem isn't too hard. But if you have things like figures and tables then the problem becomes harder. If you need to detect paragraphs that span page boundaries, then you have the challenge of distinguishing a paragraph that happens to end at the bottom of a page from one that does not. So there cannot be a general "get all the paragraphs in PDF" function--even if you have general code it must be tuned with the details of a given document or set of documents. I know this from work I did more than 10 years ago to convert PDFs of published books into the format used by the Sony EReader product (it used a proprietary XML language as input). I'm sure PDFBox as improved since then (we used it as the basis for our tool, but PDF itself has not changed materially and certainly the tools that produced it are not necessarily any better now than they were then. We did pretty well with simple fiction books that had little or no content except paragraphs, but it still required zoning and so forth. In one document everything was coming out correctly except the first character of the first paragraph in a chapter, which always ended up at the end of the page. I finally realized that that character was a dropped capitol and it happened to be the last character in the data stream for the page, but it's X/Y position put it first in the reading order--the typesetting system (probably Quark at that time) put the drop cap last in the data because it was placed by the operator after all the other content was placed. While it's unlikely, you could have perverse PDFs where each character is separately drawn and the characters occur in some random order different from the reading order. Cheers, Eliot -- Eliot Kimber Senior Solutions Architect "Bringing Strategy, Content, and Technology Together" Main: 512.554.9368 www.reallysi.com www.rsuitecms.com On 8/4/14, 10:20 PM, "Qingchao Kong" <[email protected]> wrote: >Amir, >Paragraphs are separated by "\n", so it sounds feasible to split the >text by "\n". But the text extracted from the PDF seems to contain >many "\n"s and would make it impossible to extract paragraphs. I even >don't think there is a way to do this using PDFBox. > >One possible solution would be constructing classifiers to >discriminate the boundary between different paragraphs. > >I also suggest you get to know the subject "topic boundary detection". > >Regards, > >On Mon, Aug 4, 2014 at 9:53 AM, Amir H. Jadidinejad ><[email protected]> wrote: >> >> >> I'm going to extract the content of a PDF file using PDFBox library. >>The content should be processed paragraph-by-paragraph and for each >>paragraph, I need its position for follow-up processing. Using the >>following code, I can extract the whole content of an input PDF: >> >> PDDocument doc = PDDocument.load(file); >> PDFTextStripper stripper = new PDFTextStripper(); >> String txt = stripper.getText(doc); >> doc.close(); >> >> I have two problems: >> >> 1. I don't know how to extract the content paragraph by paragraph. >> 2. I don't know how to store the position of a paragraph for >>follow-up processing (for example highlighting and etc.) >> >> Thanks. > > > >-- >Qingchao Kong > >Ph.D. Candidate >State Key Laboratory of Management and Control for Complex Systems >Institute of Automation, Chinese Academy of Sciences > >No. 95 Zhongguancun East Road >Haidian District, Beijing 100190 China

