RE: How to keep PDF format when extracting text

Eric Douglas Thu, 26 May 2011 07:29:29 -0700

This sounds a bit vague.  PDF format sounds like you're creating a PDF, but 
your description sounds more like you're getting text from a PDF trying to make 
it look like it does in the PDF.  Are you trying to modify a PDF or are you 
just losing font information on etracted text?
Is the font information embedded?
Do you have any samples of your text extraction code or a PDF you're extracting?

-----Original Message-----
From: Jack Bush [mailto:[email protected]] 
Sent: Thursday, May 26, 2011 10:12 AM
To: [email protected]
Subject: How to keep PDF format when extracting text

Hi All,

I have no problem extracting text from pdf document using pdfbox-app-1.5.0.jar 
but found that the format has been lost. Also downloaded fontbox-1.5.0.jar and 
jempbox-1.5.0.jar but not sure how to use them to improve the format of the 
extracted text file to be as close to the orginial pdf file as possible.

Are there any good document around on this topic on using recent jars. I found 
some material from Google but they are either using a much earlier version
(0.8) of pdfbox or the explanantion is insufficient to follow. It is not in 
PDDFBox FAQ.

Do you have an archived mailing list I could lookup?

Many thanks,

Jack

RE: How to keep PDF format when extracting text

Reply via email to