Re: Parsing Paragraphs from PDF.

Michael Howard Wed, 23 Mar 2011 14:34:48 -0700

On Wed, Mar 23, 2011 at 4:58 PM, Jeremy Arnold
<[email protected]> wrote:
[snip]
> Otherwise can anyone
> recommend another way to go about grabbing specific paragraphs from a
> PDF? I have a few thousand PDFs with a paragraph that has a header of
> 'Summary'. I'd like to pull out the paragraphs associated with the


I am not sure how regular your documents are but ...

My first attempt would not involve using pdfbox.

The first thing I would try would be using the pdftotext command line
tool that is part of poppler-utils. This will not give you any font
information. However, it will allow you to specify the region from
which you would like to extract the text. For example, you can use
this to eliminate headers + footers + sidebars.

You may also find the -layout parameter to be useful in helping retain
approximate spacing on the page. If paragraphs are separated by space
then you will get a blank line between paragraphs.

I would then take the output text and run it through perl regular
expressions. If your target text begins with a text header that always
says 'Summary' and is 1 paragraph long then it might be pretty easy to
identify the target text as lying between 'Summary' and the first
blank line.

Good luck.

Re: Parsing Paragraphs from PDF.

Reply via email to