On 07/10/2012 10:10 PM, Jeremias Maerki wrote:
On 10.07.2012 15:36:02 Jochen Hebbrecht wrote:
My first question is: how is text stored in a PDF? I think there are 2 ways
to store text in a PDF:
a) vector PDF: the PDF contains a line telling it to print a word in a
specific font on a specific location
There are actually two cases here:

(1) PDF text operators (BT, ET, Tj), used to convert (strings) etc to text using a font; or
(2) Vector line drawing using bezier curves, etc to represent glyphs.

The former can be extracted by fop. The latter, which is common in desktop publishing, needs OCR or special vector-to-font matching analysis and AFAIK cannot be processed by fop.

There is another location where a PDF can carry text but that's not supported by PDFBox, AFAIK: the "ActualText" entries of tagged PDFs can contain text of artifacts on a page (ex. an image). That's used for enabling visually impaired people to read certain documents.
It's also generally an unmangled, linebreak-free, column-free version of the text, which can be a real bonus. When it's there - and when it's correct, because of course there are tools out there that generate ActualText entreis full of invalid garbage or empty ActualText entries.

--
Craig Ringer

Reply via email to