Amir, Paragraphs are separated by "\n", so it sounds feasible to split the text by "\n". But the text extracted from the PDF seems to contain many "\n"s and would make it impossible to extract paragraphs. I even don't think there is a way to do this using PDFBox.
One possible solution would be constructing classifiers to discriminate the boundary between different paragraphs. I also suggest you get to know the subject "topic boundary detection". Regards, On Mon, Aug 4, 2014 at 9:53 AM, Amir H. Jadidinejad <[email protected]> wrote: > > > I'm going to extract the content of a PDF file using PDFBox library. The > content should be processed paragraph-by-paragraph and for each paragraph, I > need its position for follow-up processing. Using the following code, I can > extract the whole content of an input PDF: > > PDDocument doc = PDDocument.load(file); > PDFTextStripper stripper = new PDFTextStripper(); > String txt = stripper.getText(doc); > doc.close(); > > I have two problems: > > 1. I don't know how to extract the content paragraph by paragraph. > 2. I don't know how to store the position of a paragraph for follow-up > processing (for example highlighting and etc.) > > Thanks. -- Qingchao Kong Ph.D. Candidate State Key Laboratory of Management and Control for Complex Systems Institute of Automation, Chinese Academy of Sciences No. 95 Zhongguancun East Road Haidian District, Beijing 100190 China

