Re: How to find the position of a specific paragraph in the input PDF?

Qingchao Kong Mon, 04 Aug 2014 19:21:29 -0700

Amir,
Paragraphs are separated by "\n", so it sounds feasible to split the
text by "\n". But the text extracted from the PDF seems to contain
many "\n"s and would make it impossible to extract paragraphs. I even
don't think there is a way to do this using PDFBox.

One possible solution would be constructing classifiers to
discriminate the boundary between different paragraphs.

I also suggest you get to know  the subject "topic boundary detection".

Regards,

On Mon, Aug 4, 2014 at 9:53 AM, Amir H. Jadidinejad
<[email protected]> wrote:
>
>
> I'm going to extract the content of a PDF file using PDFBox library. The 
> content should be processed paragraph-by-paragraph and for each paragraph, I 
> need its position for follow-up processing. Using the following code, I can 
> extract the whole content of an input PDF:
>
> PDDocument doc = PDDocument.load(file);
> PDFTextStripper stripper = new PDFTextStripper();
> String txt = stripper.getText(doc);
> doc.close();
>
> I have two problems:
>
>     1. I don't know how to extract the content paragraph by paragraph.
>     2. I don't know how to store the position of a paragraph for follow-up 
> processing (for example highlighting and etc.)
>
> Thanks.

-- 
Qingchao Kong

Ph.D. Candidate
State Key Laboratory of Management and Control for Complex Systems
Institute of Automation, Chinese Academy of Sciences

No. 95 Zhongguancun East Road
Haidian District, Beijing 100190 China

Re: How to find the position of a specific paragraph in the input PDF?

Reply via email to