Parsing Paragraphs from PDF.

Jeremy Arnold Wed, 23 Mar 2011 13:59:25 -0700

I am trying to parse some specific paragraphs from PDFs. I first tried
to convert the PDF to html but that created a lot of p tags that
seemed to have absolutely no correlation to the actual paragraphs in
my PDF.


Each paragraph has a header that is in a different font as well as being bold.

Is there any way to grab text based on the font used? I was thinking I
could grab all the text between 2 lines of text with the specific
font/weight information. Is this possible? Otherwise can anyone
recommend another way to go about grabbing specific paragraphs from a
PDF? I have a few thousand PDFs with a paragraph that has a header of
'Summary'. I'd like to pull out the paragraphs associated with the
summary and display them on the web.


Thanks!
Jeremy

Parsing Paragraphs from PDF.

Reply via email to