Re: How does PDFBox extract text from a PDF?

Jochen Hebbrecht Wed, 11 Jul 2012 02:49:55 -0700

Hi Jeremias,

I just used the PDFDebugger and it actually makes a lot of sense now! :-)!
Thanks!


Jochen



On 10 Jul 2012, at 20:32, Jeremias Maerki wrote:

> Hi Jochen,
> 
> there is no "extra text layer", not even a "text layer" in PDF. Text
> painting operators are just operators like those for painting lines,
> curves and bitmaps.
> 
> When I wrote that OCR programs write white-on-white text behind the
> scanned bitmap, that is usually the result of text operators being
> placed before the painting of the bitmap. Thus the bitmap basically lies
> over the text because it was painted after the text was painted. But
> there is probably no actual "layer". The so-called "optional content
> groups" (OCG, since PDF 1.5) are sometimes used to create something like
> a "layer" which can be disabled and enabled etc. Good OCR programs
> probably create an OCG if the text
> 
> If you want to know if a PDF is OCRed, just run PDFBox's text extraction.
> If you get no text, you can probably try to run the OCR process. The
> result of running OCR on an OCRed PDF is application-specific. There's
> no single answer for that.
> 
> Here's an extract (with my comments) of a scanned page that I've run
> through Readiris Pro 12 (not the best OCR tool BTW):
> 
> BT                         % begin text object
> 3 Tr                       % text rendering mode: fill
> 1 0 0 1 0 846 Tm           % text matrix (position, scale...)
> 138.48 -58.56 Td           % move text position
> /F00 23 Tf                 % select font /F00, size 23 (internally mapped to 
> TimesNewRoman)
> (FS) Tj                    % write "FS"
> 34.8 0 Td                  % move text position
> (Hotel-) Tj                % write "Hotel-" etc. etc.
> 75.12 0 Td
> (Stuttgart) Tj
> 93.6 0 Td
> (-) Tj
> 13.44 0 Td
> (Böblingen) Tj
> 56.4 -65.28 Td
> /F10 9 Tf
> (Wolf) Tj
> 23.04 0 Td
> (-) Tj
> 8.88 0 Td
> (Hirth) Tj
> 23.28 0 Td
> (-) Tj
> 8.64 0 Td
> (Straße) Tj
> 
> [.....]
> 
> 6 0 Td                    % move text position
> (Stuttgart) Tj            % write "Stuttgart"
> ET                        % end text object
> q                         % save graphics state
> 601.92 0 0 846 0 0 cm    % concatenate transformation matrix (position, scale 
> etc.)
> /img0 Do                 % Paint bitmap /img0 (the scanned page)
> Q                        % restore graphics state
> 
> So, just a bitmap painted over the recognized text. No layers, they
> didn't even bother to paint the text in white.
> 
> Jochen, fire up PDFBox's PDFDebugger [1] and load a few PDFs and browse
> through the object tree. Look around. That'll give you a feeling of
> what's in a PDF. Then download the PDF specification. It's not written
> in Hieroglyphs or Klingon. ;-)
> 
> [1] http://pdfbox.apache.org/commandlineutilities/PDFDebugger.html
> 
> Jeremias Maerki
> 
> 
> On 10.07.2012 19:41:24 Jochen Hebbrecht wrote:
>> Hi Jeremias,
>> 
>> No, I'm not having any trouble at all :-). Just curious about the working
>> mechanism of PDFBox. And how Adobe created its PDF format.
>> At this page
>> (http://en.wikipedia.org/wiki/Portable_Document_Format#Adobe.27s_versions),
>> you can see all previous (and current) versions of the PDF format. Can any
>> of this format support the text layer? How does Adobe call this "extra text
>> layer"? There's no information on Wikipedia telling me the technical details
>> about this "text layer".
>> 
>> Can we detect using PDFBox if an image has been OCR'rd? Or do we just try to
>> get the contents? And if contents is null, try to OCR with some kind of OCR
>> engine?
>> 
>> And what happens if we try to OCR a PDF which was already OCR'd? Do we have
>> an extra "text layer"? So 1 image, 1 layer with first OCR and 1 layer with
>> secondary OCR?
>> 
>> Jochen
>> 
>> 
>> -----Oorspronkelijk bericht-----
>> Van: Jeremias Maerki [mailto:[email protected]] 
>> Verzonden: dinsdag 10 juli 2012 16:11
>> Aan: [email protected]
>> Onderwerp: Re: How does PDFBox extract text from a PDF?
>> 
>> 
>> On 10.07.2012 15:36:02 Jochen Hebbrecht wrote:
>>> My first question is: how is text stored in a PDF? I think there are 2 
>>> ways to store text in a PDF:
>>> a) vector PDF: the PDF contains a line telling it to print a word in a 
>>> specific font on a specific location
>> 
>> That's the usual case, yes.
>> 
>>> b) OCR text has been added to the image as an extra layer (I think 
>>> this is called, the XMP metadata)
>> 
>> No, actually an OCR software usually just adds white-on-white text behind
>> the bitmap. This would technically be like your a).
>> 
>> XMP Metadata is really just for metadata, not actual text content.
>> 
>>> Is this information correct?
>>> 
>>> So, if PDFBox wants to extract text from a PDF, how does it extract 
>>> the data? Is it looking at the XMP metadata? Or the vector details?
>>> Any developer wanting to help me on this issue?
>> 
>> PDFBox interprets the text painting operators (as if it were painting the
>> PDF), looks up the actual character for a code point (character "a"
>> might be at code point 7 (or whatever) when a subset CID font is used, for
>> example) and emits that as Unicode text. Well's that's simplified.
>> There are some additional heuristics for things like placement and order of
>> text but that doesn't really affect the actual process of extracting text.
>> 
>> There is another location where a PDF can carry text but that's not
>> supported by PDFBox, AFAIK: the "ActualText" entries of tagged PDFs can
>> contain text of artifacts on a page (ex. an image). That's used for enabling
>> visually impaired people to read certain documents.
>> 
>> I guess the question is: what are you trying to do? Do you have a problem
>> you're trying to solve?
>> 
>> If you want to learn about how text is put into a PDF, run PDFBox's
>> PDFDebugger and open a random PDF. That allows you to explore all the
>> details of a PDF. Quite enlightening if you don't know the PDF specification
>> by heart.
>> 
>> Jeremias Maerki
>> 
>

Re: How does PDFBox extract text from a PDF?

Reply via email to